An Exploration of Data Mining Approach in Prediction of the Use of Physical and Occupational Therapy in US Adults

Date of Award


Degree Name

Doctor of Philosophy


Interdisciplinary Health Sciences

First Advisor

Dr. Amy B. Curtis

Second Advisor

Dr. Carl Lee

Third Advisor

Dr. Stacie Fruth


Data mining, physical therapy, occupational therapy, predictive analytics, machine learning, healthcare services research


Physical (PT) and Occupational Therapists (OT) appear to be in over- or undersupply in proportion to the adults at the county and state levels in the US. The supply of PT/OT services can be more precisely optimized if the utilization of PT/OT could be predicted based on characteristics of an adult. Prior studies in prediction analyses for the utilization of PT/OT services are either outdated or limited by sampling. With publicly available survey data on national, yearly samples of US adults, there is an opportunity to address this gap in knowledge. This opportunity can be better leveraged with the emergent methods of data mining or machine learning that use computation in combination with statistics. Data mining can allow for future automation of forecasts for the utilization of PT/OT and consequently provide a dynamic support for business and policy decisions to optimize the supply of PT/OT services. Therefore, the aim of this dissertation is to build and validate machine learning models to predict the use of PT/OT services in the US adult population using publicly available survey data.

Methods: Using the 2012 National Health Interview Survey (NHIS) data on US adults (n = 34,083), logistic regression, neural network, and decision tree were initially trained and compared for the prediction of whether a sampled adult used PT/OT services. Seeking further gains in generalizability of predictive modeling, averaged models based on ensemble theory were built and compared next. These models included decision tree variants that use bootstrapping (bagging, random forest) and gradient boosting. Stability of explanatory variables was examined across models and variables important for prediction were identified. Finally, the best of these models were empirically tested on NHIS samples of 2013 (n = 34,296) and 2014 (n = 36,359) for their predictive accuracy.

Results: Models built on 2012 data showed promising Receiver Operator Characteristic Curve Indexes ranging from 0.722 to 0.823. The best model was the ensemble model that averaged logistic regression, neural network, and decision tree. This model performed consistently well when empirically tested for 2013 (misclassification rate = 9.32%) and 2014 (10.35%) data as well, though there were only small differences across models overall. Important input variables that were significant for their predictive association with the use of PT/OT across more than half of the models included having seen a medical specialist, higher numbers of office visits, being hospitalized, having health problem that requires special equipment, higher frequency of strength activity, surgery, joint pain/aching/stiffness, difficulty standing 2 hours without special equipment, difficulty pushing large objects without special equipment, and having low back pain.

Conclusions: The data mining approach deploying multiple-models and model averaging to predict whether an adult will use PT/OT can potentially be translated to practice to support business and policy decisions toward optimizing the supply of PT/OT services to the needs of the population units in the US. Future research may explore local-level data sources like Electronic Health Records, consistent with privacy protection laws, to drive prediction analyses.

Access Setting

Dissertation-Abstract Only

Restricted to Campus until


This document is currently not available here.