Date of Award
Doctor of Philosophy
Dr. Ajay Gupta
Dr. Ala Al-Fuqaha
Dr. James Springstead
Dr. Alexey Nesvizhskii
Proteomics, machine learning, optimization, Big Data, comet, parallel processing
Despite availability of several proteins search engines, due to the increasing amounts of MS/MS data and database sizes, more efficient data analysis and reduction methods are important. Improving accuracy and performance of protein identification is a main goal in the community of proteomic research. In this research, a holistic solution for improvement in search performance is developed.
Most current search engines apply the SEQUEST style of searching protein databases to define MS/MS spectra. SEQUEST involves three main phases: (i) Indexing the protein databases, (ii) Matching and Ranking the MS/MS spectra and (iii) Filtering the matches and reporting the final proteins. Technical analysis of each phase resulted in several potential improvements that have been implemented in a holistic, multidimensional approach. This dissertation focuses on challenges and limitations of the current protein search engines, while providing solutions to address these problems. Primarily, indexing and the searching phases are optimized.
This dissertation describes the indexing phase in a commonly used search engine and provides an alternative solution to code and data structure optimizations, making indexing more efficient and less computationally intensive. This method may be applied in metaproteomics studies, where large protein databases are typically used in identifying proteins from complex samples and individual organisms are not identified in advance. In the searching phase, a deep-learning algorithm and different shallow learning algorithms are tested to reduce computation load of the matching process. The main objective is to reduce unnecessary load introduced by “possibly irrelevant” MS/MS spectra. The deep learning algorithm may be especially useful when the protein(s) of interest are in lower cellular or tissue concentration, while the other algorithms may be more useful for concentrated or more highly expressed proteins. To improve the accuracy of identification and to adjust searching parameters, particle swarm optimization (PSO) is utilized to configure the search engine parameters, resulting in optimization of the matching process. Experimentally, the PSO model shows encouraging results and covers some limitations of the previous works in parameters configuration.
Due to diversity of search engine coverage and overlapping results, it is proposed to combine results of multiple search engines to increase reliability of identification. However, despite straightforward implementations on cloud or distributed environments, transfer of MS/MS spectra among the systems’ various units is a major concern and should be carefully handled. The impact of transferring the raw spectra to the computing nodes is presented in this research, and two different approaches using peaks sampling and machine learning have been developed as spectra reduction methods. The results of both solutions show significant MS/MS size reduction, thereby mitigating unnecessary communication and computation, while also potentially reducing cost in cloud-based pay-as-you-go environments.
Maabreh, Majdi Ahmad Mosa, "A Holistic Computational Approach to Boosting the Performance of Protein Search Engines" (2018). Dissertations. 3221.