Publication Date
2016
Document Type
Technical Report
Abstract
Modern proteomics studies utilize high-throughput mass spectrometers which can produce data at an astonishing rate. These big Mass Spectrometry (MS) datasets can easily reach peta-scale level creating storage and analytic problems for large-scale systems biology studies. Each spectrum consists of thousands of peaks which have to be processed to deduce the peptide. However, only a small percentage of peaks in a spectrum are useful for peptide deduction as most of the peaks are either noise or not useful for a given spectrum. This redundant processing of non-useful peaks is a bottleneck for streaming high-throughput processing of big MS data. One way to reduce the amount of computation required in a high-throughput environment is to eliminate non-useful peaks. Existing noise removing algorithms are limited in their data-reduction capability and are compute intensive making them unsuitable for big data and high-throughput environments. In this paper we introduce a novel low-complexity technique based on classification, quantization and sampling of MS peaks We present a novel data-reductive strategy for analysis of Big MS data. Our algorithm, called MS-REDUCE, is capable of eliminating noisy peaks as well as peaks that do not contribute to peptide deduction before any peptide deduction is attempted. Our experiments have shown up to 100x speed up over existing state of the art noise elimination algorithms while maintaining comparable high quality matches. Using our approach we were able to process a million spectra in just under an hour on a moderate server. The developed tool and strategy will be available to wider proteomics and parallel computing community at the authors webpages after the paper is published.