Date of Award

6-2024

Degree Name

Doctor of Philosophy

Department

Evaluation

First Advisor

Brooks Applegate, Ph.D.

Second Advisor

Ya Zhang, Ph.D.

Third Advisor

Warren Lacefield, Ph.D.

Abstract

This dissertation examined classification outcome differences among four popular individual supervised machine learning (ISML) models (logistic regression, decision tree, support vector machine, and multilayer perceptron) when predicting minor class membership within imbalanced datasets. The study context and the theoretical population sampled focus on one aspect of the larger problem of student retention and dropout prediction in higher education (HE): identification.

This study differs from current literature by implementing an experimental design approach with simulated student data that closely mirrors HE situational and student data. Specifically, this study tested the predictive ability of the four ISML classification models (CLS) under experimentally manipulated conditions. These included total sample size (TS), minor class proportion (MCP), training-to-testing sample size ratios (TTSS), and the application of bagging techniques during model training (BAG). Using this 4-between, 1-within mixed design, five different outcome measures (precision, recall/sensitivity, specificity, F1-score and AUC) were examined and analyzed individually.

For each outcome measure, findings revealed multiple statistically significant interactions among classifier models and design variables. Simple effect analyses of these interactions highlighted how TS, MCP, TTSS, and BAG differentially affect different measures of classification performance such as precision, recall/sensitivity, specificity, F1-score, and AUC. For instance, the presence of interactions involving MCP underscores the importance of informed modeling of class distribution for enhancing overall model predictive capability and performance.

Such insights regarding how the experimental variables can critically affect different measures of classification success advances our understanding of how these four ISML models might be optimized for the prediction of student-at-risk status within imbalanced datasets. This dissertation provides a framework for using these or similar ISML models more effectively in HE. It points toward the development of predictive modeling methods that are more useful and perhaps equitable by demonstrating empirically the impact of one of the most challenging aspects of implementing machine learning in HE: maximizing the accurate identification of the minority class. This work contributes to the use of machine learning in HE and will help inform its use in smaller and larger educational research communities by providing strategies for improving the prediction of student dropout.

Access Setting

Dissertation-Open Access

Share

COinS