A Hybrid MPI-OpenMP Strategy to Speedup the Compression of Big Next-Generation Sequencing Datasets
DNA sequencing has moved into the realm of Big Data due to the rapid development of high-throughput, low cost Next-Generation Sequencing (NGS) technologies. Sequential data compression solutions that once were sufficient to efficiently store and distribute this information are now falling behind. In this paper we introduce phyN- GSC, a hybrid MPI-OpenMP strategy to speedup the compression of big NGS data by combining the features of both distributed and shared memory architectures. Our algorithm balances work-load among processes and threads, alleviates memory latency by exploiting locality, and accelerates I/O by reducing excessive read/write operations and inter-node message exchange. To make the algorithm scalable, we introduce a novel timestamp-based file structure that allows us to write the compressed data in a distributed and non-deterministic fashion while retaining the capability of reconstruct- ing the dataset with its original order. Our experimental results show that phyNGSC achieved compression times for big NGS datasets that were 45% to 98% faster than NGS-specific sequential compressors with throughputs of up to 3GB/s. Our theoretical analysis and experimental results suggest strong scalability with some datasets yielding super-linear speedups and constant efficiency. We were able to compress 1 terabyte of data in under 8 minutes compared to more than 5 hours taken by NGS-specific compression algorithms running sequentially. Compared to other parallel solutions, phyNGSC achieved up to 6x speedups while maintaining a higher compression ratio. The code for this implementation is available at https://github.com/pcdslab/PHYNGSC.