Date of Award


Degree Name

Doctor of Philosophy


Computer Science

First Advisor

Dr. Fahad Saeed

Second Advisor

Dr. Ajay Gupta

Third Advisor

Dr. Alvis Fong

Fourth Advisor

Dr. Seung-Hee Bae


proteomics, mass-spectrometry, high-performance computing, GPU, simulation, dimensionality-reduction


Mass Spectrometry (MS)-based proteomics utilizes high performance liquid chromatography in tandem with high-throughput mass spectrometers. These experiments can produce MS data sets with astonishing speed and volume that can easily reach peta-scale level, creating storage and computational problems for large-scale systems biology studies. Each spectrum output by a mass spectrometer may consist of thousands of peaks, which must all be processed to deduce the corresponding peptide. However, only a small percentage of peaks in a spectrum are useful for further processing, as most of the peaks are either noise or are not useful. Our experiments have shown that 90 to 95% of the peaks are not required for reliable results. This leads to a lot of redundant processing and causes a hindrance to high-throughput processing of big MS data. The existing pre-processing algorithms for noise-removal or spectra-denoising are limited in their data-reduction capability and are compute intensive; in most cases these pre-processing stages create an additional compute bottleneck in the software pipeline for proteomics.

One method of attacking this problem would be by developing data-aware algorithms capable of minimizing the amount of redundant computations. Besides, owing to the continuous increase in the speed and size of proteomics data, high-performance computing solutions need to be introduced. In this study we propose a new data reduction algorithm, which exploits the high noise content of MS/MS data to its advantage and uses a weighted-random- sampling technique to reduce the number of computations drastically. Our results have shown a speed gain of over 100x with respect to the existing tools, while giving comparable accuracy on experimental data. To support rapid adoption and development of high-performance computing solutions in proteomics and big data studies in general, we introduce a template-based strategy for development of optimized GPU-based algorithms for omics data. Our proposed template outlines generic methods to tackle critical GPU-centric bottlenecks and provides details of implementing optimized and scalable GPU algorithms for a given big data problem. We demonstrate the application of this template by implementing a GPU version of our proposed data-reduction algorithm as a case-study.

This study also explores the methods of benchmarking novel proteomics algorithms and introduces a highly configurable data simulator to generate user-controlled ground-truth data for assessing new algorithms.

Access Setting

Dissertation-Open Access