Date of Defense

12-3-2025

Date of Graduation

12-2025

Department

Statistics

First Advisor

Kevin Lee

Second Advisor

Geumchan Hwang

Abstract

The Northwoods League (NWL) is a collegiate summer wooden-bat baseball league that provides college athletes with an opportunity to maintain and enhance their skills during the offseason while competing in game-like conditions. Although prior research has explored player performance in other summer leagues, such as the Cape Cod Baseball League, academic investigation into predictors of draft outcomes in the NWL remains limited. This study addresses this gap by evaluating whether pitch-tracking metrics collected via Trackman systems can predict a pitcher’s likelihood of selection in the Major League Baseball (MLB) draft and by identifying which performance characteristics are most strongly associated with draft probability.

Raw pitch-level data from the NWL spanning 2020-2025, comprising 964,931 observations and 100 variables, were aggregated into pitcher-level career statistics, resulting in a refined dataset of 1,725 pitchers, 60 of whom were drafted. Predictive features captured velocity, spin, movement, release mechanics, usage, and batted ball outcomes. Three complementary machine learning models were developed: logistic regression, Random Forest, and XGBoost. Model performance was assessed using area under the receiver operating characteristic curve (AUC-ROC), sensitivity, specificity, and balanced accuracy. Severe class imbalance (drafted vs. undrafted) was addressed through stratified train-test splitting and class-specific weighting.

Results indicated that logistic regression achieved the highest AUC (0.893), demonstrating strong overall discriminative ability. XGBoost, despite having a slightly lower AUC (0.837), provided sufficient practical draft detection performance when evaluated at some appropriately lowered classification thresholds, although the logistic regression model showed better true/false positive ratios overall for most selection thresholds. Key predictors consistently identified across models included average fastball velocity, steeper approach angles, consistent spin orientations, control within an at-bat, and changeup contact rate. These findings align with conventional scouting priorities and confirm that objective pitch-tracking metrics can complement traditional evaluations.

The study has practical applications for MLB scouting departments and Northwoods League teams, providing a data-driven framework for prioritizing prospects and identifying overlooked talent. Limitations include a small sample of drafted pitchers, lack of differentiation by draft round, and potential changes in scouting emphasis over time. Future work could expand the approach to other collegiate or developmental leagues, incorporate temporal validation, or integrate additional contextual factors such as player age, eligibility, and injury history. Overall, the study demonstrates the utility of Trackman data in predicting draft outcomes and highlights the potential for analytics to augment subjective scouting assessments in professional baseball.

Access Setting

Honors Thesis-Open Access

Share

COinS