Unsupervised Learning With Word Embeddings Captures Quiescent Knowledge From COVID-19 And Materials Science Literature
Date of Award
Doctor of Philosophy
Dr. Elise de Doncker
Dr. Alvis Fong
Dr. Pnina Ari-Gur
COVID-19 drugs, COVID-19 vaccines, giant magnetocaloric effect, laser powder bed fusion, unsupervised learning, word2vec
Millions of scientific papers are produced each year and the scientific literature is continuing to grow at a head-spinning speed. Thus, massive scientific knowledge exists in solid text, but all these publications make it difficult, if not impossible, for researchers to keep in up to date with discoveries, even within a narrow scientific area. This massive amount of information also makes it difficult to find implicit and hidden connections, relationships, and dependencies within the information that may guide the direction of future research or lead to valuable new insights. So, there is a need for algorithms or models that can scan the text of millions of papers to uncover new scientific knowledge and search for hidden connections within this knowledge. For computer algorithms, to utilize this resource, it should be converted in terms of numbers and represent the words in some mathematical form. This is where artificial intelligence and machine learning can help. Advanced algorithms in machine learning and natural language processing can be used to make large databases more useful and easier to handle by both researchers and clinicians. We used Word2Vec for our implementation and trained many unsupervised word-embedding models on different data sets in materials science and in the medical field to extract hidden knowledge, relations, and interactions based on words that appear in similar contexts in the text while often having similar meanings. So far, we have adopted three main models. The first is trained within additive manufacturing (AM), targeting the powder bed fusion (PBF) processes, such as selective laser sintering (SLS), selective laser melting (SLM), and direct metal laser sintering (DMLS), with the goal of extracting new knowledge to improve AM processes and address material properties depending on the process used. Other properties inherent to the materials, such as the giant magnetocaloric effect, are also addressed in a specific model. The second model is trained within COVID-19 drugs literature to address what insights can be obtained on candidate drugs to treat COVID-19. Finally, the third model is trained within COVID-19 vaccine literature to predict good candidate vaccines. We thus demonstrate how word embeddings can help extract hidden knowledge from the published literature in very distinct areas of research.
Gharaibeh, Tasnim H., "Unsupervised Learning With Word Embeddings Captures Quiescent Knowledge From COVID-19 And Materials Science Literature" (2022). Dissertations. 3823.