Exploring the Impact of Pretrained Bidirectional Language Models on Protein Secondary Structure Prediction
Date of Award
Master of Science
Dr. Ajay Gupta
Dr. James Springstead
Dr. Jinxia Deng
Neural network, recurrent, protein, structure, language model
Masters Thesis-Open Access
Protein secondary structure prediction (PSSP) involves determining the local conformations of the peptide backbone in a folded protein, and is often the first step in resolving a protein's global folded structure. Accurate structure prediction has important implications for understanding protein function and de novo protein design, with progress in recent years being driven by the application of deep learning methods such as convolutional and recurrent neural networks. Language models pretrained on large text corpora have been shown to learn useful representations for feature extraction and transfer learning across problem domains in natural language processing, most notably in instances where the amount of labeled data for supervised learning is limited. This presents the possibility that pretrained language models can have a positive impact on PSSP, as sequenced proteins vastly outnumber those proteins whose folded structures are known.
In this work, we pretrain a large bidirectional language model (BDLM) on a nonredundant dataset of one million protein sequences gathered from UniRef50. The outputs and intermediate layer activations of this BDLM are incorporated into a bidirectional recurrent neural network (RNN) trained to predict secondary structure. We provide an empirical assessment of the impact the representations learned by our pretrained model have on PSSP and analyze the mutual information of these representations with secondary structure labels.
Daudert, Dillon G., "Exploring the Impact of Pretrained Bidirectional Language Models on Protein Secondary Structure Prediction" (2018). Masters Theses. 3806.