With the advent of high-throughput next-generation sequencing (NGS) techniques, the amount of data being generated represents challenges including storage, analysis and transport of huge datasets. One solution to storage and transmission of data is compression using specialized compression algorithms. However, these specialized algorithms suffer from poor scalability with increasing size of the datasets and best available solutions can take hours to compress gigabytes of data. In this paper we introduce paraDSRC, a parallel implementation of DSRC algorithm using a message passing model that presents reduction of the compression time complexity by a factor of O(1/p ). Our experimental results show that paraDSRC achieves compression times that are 43% to 99% faster than DSRC and compression throughputs of up to 8.4GB/s on a moderate size cluster. For many of the datasets used in our experiments super-linear speedups have been registered, making the implementation strongly scalable. We also show that paraDSRC is more than 25.6x faster than comparable parallel compression algorithms. The code will be available in author’s website if paper is accepted.
Sandino N. V. Perez and Fahad Saeed, "A Parallel Algorithm for Compression of Big Next-Generation Sequencing (NGS) Datasets", IEEE International Workshop on Parallelism in Bioinformatics (PBio), Proceedings of Parallel and Distributed Processing with Applications (IEEE ISPA-15), Helsinki Finland, August 2015