Date of Award


Degree Name

Master of Science


Computer Science

First Advisor

Dr. Ajay Gupta

Second Advisor

Dr. James Springstead

Third Advisor

Dr. Jinxia Deng

Access Setting

Masters Thesis-Open Access


Protein secondary structure prediction (PSSP) involves determining the local conformations of the peptide backbone in a folded protein, and is often the first step in resolving a protein's global folded structure. Accurate structure prediction has important implications for understanding protein function and de novo protein design, with progress in recent years being driven by the application of deep learning methods such as convolutional and recurrent neural networks. Language models pretrained on large text corpora have been shown to learn useful representations for feature extraction and transfer learning across problem domains in natural language processing, most notably in instances where the amount of labeled data for supervised learning is limited. This presents the possibility that pretrained language models can have a positive impact on PSSP, as sequenced proteins vastly outnumber those proteins whose folded structures are known.

In this work, we pretrain a large bidirectional language model (BDLM) on a nonredundant dataset of one million protein sequences gathered from UniRef50. The outputs and intermediate layer activations of this BDLM are incorporated into a bidirectional recurrent neural network (RNN) trained to predict secondary structure. We provide an empirical assessment of the impact the representations learned by our pretrained model have on PSSP and analyze the mutual information of these representations with secondary structure labels.