Publication Date

Fall 2017

Document Type

Technical Report

Abstract

It is now possible to compress and decompress large-scale Next-Generation Sequencing files taking advantage of high-performance computing techniques. To this end, we have recently introduced a scalable hybrid parallel algorithm, called phyNGSC, which allows fast compression as well as decompression of big FASTQ datasets using distributed and shared memory programming models via MPI and OpenMP. In this paper we present the design and implementation of a novel parallel data structure which lessens the dependency on decompression and facilitates the handling of DNA sequences in their compressed state using fine-grained decompression in a technique that is identified as in compresso data processing. Using our data structure compression and decompression throughputs of up to 8.71 GB/s and 10.12 GB/s were observed. Our proposed structure and methodology brings us one step closer to compressive genomics and sublinear analysis of big NGS datasets. The code for this implementation is available at https://github.com/pcdslab/PHYNGSD

Published Citation

Sandino Vargas-Pérez, Fahad Saeed, "Scalable Data Structure to Compress Next-Generation Sequencing Files and its Application to Compressive Genomics", Proceedings of IEEE International Conference on Bioinformatics and Biomedicine (BIBM 2017) Kansas City, MO, USA, November 13 - 16, 2017

Share

COinS