Date of Award

12-2017

Degree Name

Doctor of Philosophy

Department

Computer Science

First Advisor

Dr. Fahad Saeed

Second Advisor

Dr. Elise DeDoncker

Third Advisor

Dr. Leszek Lilien

Fourth Advisor

Dr. Todd Barkman

Keywords

DNA, Internet of Things, Big Genomic Data, Network Transfer Protocols, deep learning, data encoding

Abstract

In the age of Big Genomics Data, institutes such as the National Human Genome Research Institute (NHGRI),1000-Genomes project, and the international cancer sequencing consortium are faced with the challenge of sharing large volumes of data between internationallydispersed sample collectors, data analyzers, and researchers, a process that up until now has been plagued by unreliable transfers and slow connection speeds. These occur due to the inherent throughput bottlenecks of traditional transfer technologies. One suggested solution is using the cloud as an infrastructure to solve the store and analysis challenges. However, the transfer and share of the genomics datasets between biological laboratories and from/to the cloud represents an ongoing bottleneck because of the amount of data, as well as the limitations of the network bandwidth. Therefore, transfer challenges can be solved by either increasing the bandwidth or minimizing the data size during the transfer phase.

One way to increase the efficiency of data transmission is to increase the bandwidth, which might not always be possible due to resource limitations. Another way to maximize channel capacity utilization is by decreasing the bits that need to be transmitted for a given dataset. Traditionally, transmission of big genomics datasets between two geographical locations is commonly done using general-purpose protocols, such as hypertext transfer protocol (HTTP) and file transfer protocol (FTP). In this dissertation, a novel deep learning-based data minimization algorithm is presented and aims to: 1) minimize the datasets during transfer over the carrierchannels; 2) protect the data from the man-in-the-middle (MITM) and other attacks by changing the binary representation (codewords) several times for the same dataset.

This innovative data minimization strategy exploits the alphabet limitation of DNA sequences and modifies the binary representation (codewords) of dataset characters by usingdeep learning-based random sampling that utilizes the convolutional neural network (CNN) and Fourier transform theory. This algorithm ensures transmission of big genomics datasets with minimal bits and latency, thereby lending to a more efficient and expedient process. To evaluate this approach, extensive actual and simulated tests on various genomics datasets were conducted. Results indicate that the proposed data minimization algorithm is up to 99-fold faster and more secure than the current use of the HTTP data-encoding scheme and 96-fold faster than FTP on tested datasets.

Access Setting

Dissertation-Open Access

Share

COinS