A Hybrid Parallel Approach to High-Performance Compression of Big Genomic Files and in Compresso Data Processing
Date of Award
Doctor of Philosophy
Dr. Fahad Saeed
Dr. Ajay Gupta
Dr. Todd Barkman
IIPC, FASTQ, parallel, compression, genomics, Big Data
Due to the rapid development of high-throughput low cost Next-Generation Sequencing, genomic file transmission and storage is now one of the many Big Data challenges in computer science. Highly specialized compression techniques have been devised to tackle this issue, but sequential data compression has become increasingly inefficient and existing parallel algorithms suffer from poor scalability. Even the best available solutions can take hours to compress gigabytes of data, making the use of these techniques for large-scale genomics prohibitively expensive in terms of time and space complexity.
This dissertation responds to the aforementioned problem by presenting a novel hybrid parallel approach to speed up the compression of big genomic datasets by combining the features of both distributed and shared memory architectures and parallel programing models. The algorithm that the approach relies on has been developed with several goals in mind: to balance the work load among processes and threads, to alleviate memory latency by exploiting locality, and to accelerate I/O by reducing excessive read/write operations and inter-node message exchange. To make the algorithm scalable, an innovative timestampbased file structure was designed. It allows the compressed data to be written in a distributed and non-deterministic fashion while retaining the capability of decompressing the dataset back to its original state. In an effort to lessen the dependency on decompression, the proposed file structure facilitates the handling of DNA sequences in their compressed state by using fine-grained decompression in a technique that is identified as in compresso data manipulation.
Theoretical analysis and experimental results from this research suggest strong scalability, with many datasets yielding super-linear speedups and constant efficiency. Performance measurements were executed to test the limitations imposed by Amdahl’s law when doubling the number of processing units. The compression of a FASTQ file of size 1 terabyte took less than 3.5 minutes, reporting a 90% time decrease against compression algorithms with multithreaded parallelism and more than 98% time decrease for those running sequential code—i.e., 10 to 50 times faster, respectively—improving the execution time proportionally to the added system resources. In addition, the proposed approach achieved better compression ratios by employing an entropy-based encoder optimized to work close to its Shannon entropy and a dictionary-based encoder with variable-length codes. Consequently, in compresso data manipulation was accelerated for FASTQ to FASTA format conversion, basic FASTQ file statistics, and DNA sequence pattern finding, extraction, and trimming. Findings of this research provide evidence that hybrid parallelism can also be implemented using CPU+GPU models, potentially increasing the computational power and portability of the approach.
Vargas Pérez, Sandino N., "A Hybrid Parallel Approach to High-Performance Compression of Big Genomic Files and in Compresso Data Processing" (2017). Dissertations. 3189.