Masters Theses

Exploring the Impact of Pretrained Bidirectional Language Models on Protein Secondary Structure Prediction

Dillon G. Daudert

Date of Award

12-2018

Degree Name

Master of Science

Department

Computer Science

First Advisor

Dr. Ajay Gupta

Second Advisor

Dr. James Springstead

Third Advisor

Dr. Jinxia Deng

Keywords

Neural network, recurrent, protein, structure, language model

Access Setting

Masters Thesis-Open Access

Abstract

Protein secondary structure prediction (PSSP) involves determining the local conformations of the peptide backbone in a folded protein, and is often the first step in resolving a protein's global folded structure. Accurate structure prediction has important implications for understanding protein function and de novo protein design, with progress in recent years being driven by the application of deep learning methods such as convolutional and recurrent neural networks. Language models pretrained on large text corpora have been shown to learn useful representations for feature extraction and transfer learning across problem domains in natural language processing, most notably in instances where the amount of labeled data for supervised learning is limited. This presents the possibility that pretrained language models can have a positive impact on PSSP, as sequenced proteins vastly outnumber those proteins whose folded structures are known.

In this work, we pretrain a large bidirectional language model (BDLM) on a nonredundant dataset of one million protein sequences gathered from UniRef50. The outputs and intermediate layer activations of this BDLM are incorporated into a bidirectional recurrent neural network (RNN) trained to predict secondary structure. We provide an empirical assessment of the impact the representations learned by our pretrained model have on PSSP and analyze the mutual information of these representations with secondary structure labels.

Recommended Citation

Daudert, Dillon G., "Exploring the Impact of Pretrained Bidirectional Language Models on Protein Secondary Structure Prediction" (2018). Masters Theses. 3806.
https://scholarworks.wmich.edu/masters_theses/3806

Download

Included in

Programming Languages and Compilers Commons

COinS

Masters Theses

Exploring the Impact of Pretrained Bidirectional Language Models on Protein Secondary Structure Prediction

Date of Award

Degree Name

Department

First Advisor

Second Advisor

Third Advisor

Keywords

Access Setting

Abstract

Recommended Citation

Included in

ScholarWorks

Browse

Author Corner

Links

Masters Theses

Exploring the Impact of Pretrained Bidirectional Language Models on Protein Secondary Structure Prediction

Author

Date of Award

Degree Name

Department

First Advisor

Second Advisor

Third Advisor

Keywords

Access Setting

Abstract

Recommended Citation

Included in

Share

ScholarWorks

Browse

Author Corner

Links