Date of Award


Degree Name

Doctor of Philosophy


Computer Science

First Advisor

Dr. Ajay Gupta

Second Advisor

Dr. Rajeev Agrawal

Third Advisor

Dr. Xiaozhong Liu

Fourth Advisor

Dr. Edward Eckel


machine learning, supervised learning, unsupervised learning, natural language processing, digital indexing, ontology


Scientific research papers present the research endeavors of numerous scientists around the world, and are documented across multitudes of technical conference proceedings, and other such publications. Given the plethora of such research data, if we could automate the extraction of key interesting areas of research, and provide access to this new information, it would make literature searches incredibly easier for researchers. This in turn could be very useful for them in furthering their research agenda. With this goal in mind, we have endeavored to provide such solutions through our research. Specifically, the focus of our research is to design, analyze and implement intelligent machine learning algorithms to extract useful information from research publications, which will be immensely useful to researchers, across a wide spectrum of scientific fields.

In the research arena, various topics are studied, researched and developed across various subject areas, in different scientific fields. Looking for trending topics and according a structure to them, can be especially challenging, given the subjective topic representation by the authors of research papers. These challenges are especially exacerbated by the fact that majority of data in research papers is text, and complete, efficient mining of text data still has many open problems. Our research alleviates some of these challenges and endeavors to make the process of browsing, searching and summarizing the state-of-the-art research innovations across various scientific publications easier, especially to a new entrant into a scientific field.

In order to automate the extraction of useful information, we characterize the data in terms of the type of information or knowledge that we seek from research publications. Specifically in the field of Computer Science publications, we characterize words or phrases from the text to represent topics, specific problem-areas and techniques presented in research papers. We achieve this by investigating features of a word or phrase that make it a potential candidate for specifically representing a topic, by mining information from strategic locations of research papers. We present a methodology to learn the topics representing the current state-of-the-art research in a given time period, within a subject area in a scientific field. We have achieved consistently good results as evidenced by precision and recall results from our model.

In the scientific field of computing, there is an indexing scheme called Association for Computing Machinery Computing Classification System (ACM CCS), which has groups of topics that are used to index research articles in digital libraries. In order to facilitate literature search, we use the topics we have learned and present a technique to generate newer clusters or groups that provide insights into how these learned topics can be incorporated into the existing groups of ACM CCS. We also evaluate how the existing groups may need to be rearranged to reflect the current scenario of research. We have performed exhaustive experiments using the digital libraries of research article publications in the field of Computer Science to illustrate and validate our techniques.

Access Setting

Dissertation-Open Access