Date of Award


Degree Name

Doctor of Philosophy


Civil and Construction Engineering

First Advisor

Dr. Valerian Kwigizile

Second Advisor

Dr. Jun-Seok Oh

Third Advisor

Dr. Kevin Lee


Text mining, unstructured data, transportation safety, data mining, crowdsourced data, machine learning


The unprecedented increase in volume and influx of structured and unstructured data has overwhelmed conventional data management system capabilities in organizing, analyzing, and procuring useful information in a timely fashion. Structured data sources have a pre-defined pattern that makes data preprocessing and information retrieval tasks relatively easy for the current technologies that have been designed to handle structured and repeatable data. Unlike structured data, unstructured data usually exists in an unorganized format that offers no or little insight unless indexed and stored in an organized fashion. The inherent format of unstructured data exacerbates difficulties in data preprocessing and information extraction. As a result, despite the vastness of unstructured data, most of the decisions are mainly based on information extracted from structured data.

The objective of this research is to explore different text and data mining methods that can be leveraged in the transportation safety domain to improve the integration of unstructured textual information in the decision-making process. Different case studies in the field of transportation are explored utilizing the police officer crash narratives in Michigan and self-reported collision and near-miss reports from the crowdsourcing platform. Each case study covers distinctive data and text mining approaches. In transportation safety, millions of police crash report narratives are generated each year in the US that describes crash scenarios. Apart from these official police reports, road users have been provided with different crowdsourcing platforms whereby they can describe any incident such as near-miss and collision while sharing the road space. The information that is contained in these unstructured textual sources can offer salient knowledge that can help to improve the existing infrastructural safety and services. The advantages and challenges of incorporating extracted textual information with traditional structured crash data are thoroughly discussed.

The first case study evaluates a way of integrating structured crash metadata with unstructured crash narratives. The data for testing the proposed procedure is the pedestrian crossing-related crashes at undesignated midblock locations. Both structured crash data and report narratives are used to discern human, environmental, and roadway factors associated with pedestrian crossing-related crashes at undesignated midblock areas. The main emphasis is the contribution of crash narratives in understanding the pattern and causes of pedestrian crashes. The extracted textual feature from crash narratives indicated the most important predictor of pedestrian fatalities were cases when a pedestrian was wearing dark clothing while crossing the road. The type of cloth information was only available in the crash narratives. Further, the Random Forest capability of predicting the fatality instances when pedestrians were crossing at undesignated midblock locations was improved when the extracted textual features from the crash narratives were incorporated in model calibration. The case study highlights the importance of incorporating information from an unstructured textual source in transportation safety studies.

The second case study evaluates and proposes efficient ways of automating the process of information extraction using text analytics and a data mining approach. Reports of crashes at signal-controlled intersections in Michigan involving at-fault drivers who were issued a “fail to yield” or “disregard traffic control” hazardous action citation were used in the analysis. The semantic n-gram feature analysis is used to discern the most likely crash scenario at signal-controlled intersections for each of the hazardous actions. Support vector machines and boosted classification trees are developed using unigram and bigram features with different n-gram feature deployment scenarios to predict hazardous action citations. Further, the developed textual-based algorithm proved to be promising in detecting possible errors that were made by the police officers while coding hazardous actions in the crash reports. These findings and the proposed methodology in this case study can be used by the agencies in each state to improve their future editions of crash reporting manuals by providing detailed descriptions of the crash contributing factors.

The third case study covers another interesting aspect of the text mining analytics approach namely topic modeling. Topic models are unsupervised probabilistic models that enable users to search and explore the documents based on the underlying themes that form a document. This case study explores the prevalence and co-occurrence of themes in traffic fatal crashes using structural topic modeling and network topology. The study uses Michigan traffic fatal crash narratives to generate topics that are mainly categorized into pre-crash events, crash locations, and involved parties in a crash. Various topics are discovered and variations of topics prevalence across crash types are observed. Also, the centrality and association between topics are observed to vary across crash types. Further, results indicate that automation of crash typing and consistency check can be accomplished with a decent level of accuracy by using extracted latent themes from the crash narratives. Therefore, the proposed textual-based framework in this case study can be part of the advanced and rigorous quality control of police crash reports and other safety-related reports.

The fourth case study is an extension of the topic modeling incorporating an advanced machine learning technique namely Artificial Neural Networks. Artificial Neural Networks (ANN) or sometimes known as the connectionist systems is the framework that allows different machine learning algorithms to work together in solving complex tasks. The exploratory text mining, topic modeling approach, and ANN are used to study the self-reported cyclist near-miss and collision reports. The benefit of using text mining and machine learning in this case study is the ability to automatically provide a broad snapshot of near-miss and collision events from the textual data. This study not only exposes topics that led to near misses but also sorts out topics based on how likely the topic’s scenario can result in a collision using the proposed text-based ANN framework. The methodology helps sort out the most critical topics related to cyclist’s safety which require in-depth analysis and discussions to produce actionable insights.

Lastly, an online-based tool is created amassing various text and data mining features that were explored in all the case studies. The tool provides a simple to use graphical user interface whereby users with limited statistical and programming skills can still use the tool to extract information from textual data. Users are required to upload textual data and associated metadata. The tool automatically preprocesses the textual data and produces ready-to-use results based on the user’s preferences. The interactive tool can help planners, engineers, and other stakeholders at large in the transportation safety domain to harness the power of text and data mining.

Access Setting

Dissertation-Open Access