Date of Award

4-1-2023

Degree Name

Doctor of Philosophy

Department

Statistics

First Advisor

Kevin H. Lee, Ph.D.

Second Advisor

Joshua Naranjo, Ph.D.

Third Advisor

Hyun Bin Kang, Ph.D.

Fourth Advisor

Jinseok Kim, Ph.D.

Keywords

High dimensional, lightGBM, machine learning, statistics, variable selection, XGBOOST

Abstract

As data continue to grow rapidly in size and complexity, efficient and effective statistical methods are needed to detect the important variables/features. Variable selection is one of the most crucial problems in statistical applications. This problem arises when one wants to model the relationship between the response and the predictors. The goal is to reduce the number of variables to a minimal set of explanatory variables that are truly associated with the response of interest to improve the model accuracy. Effectively choosing the true influential variables and controlling the False Discovery Rate (FDR) without sacrificing power has been a challenge in variable selection research and its applications. The most recently proposed knockoff filter is a general framework operates by generating knockoff variables that are cheap and mimic the correlation structure found within the original set of variables. Those knockoffs serve as negative controls for statistical inference. We propose an extension of using knockoffs for machine learning with one of the fastest, and most accurate gradient boosting techniques, namely Light Gradient Boosting Machine (LightGBM). Machine Learning is becoming more and more widely used in problems where no prior knowledge of model function is required. We use SHAP values as a method to interpret the black-box in the machine learning techniques, and also as a feature importance measure to identify the important variables. The proposed method was found better performing on several aspects and has proved it worth when used on many data sets to be faster and efficient. It also verified that it could identify the important variables related to each individual class more accurate and applicable than the traditional methods. We evaluate the proposed method through an extensive simulation study in terms of the FDR, the power of identifying the important variables and the computational time. The original knockoff filter method and XGBoost were used for comparison. The proposed method was applied on real data and the results were also discussed.

Access Setting

Dissertation-Open Access

Share

COinS