Abstract:
During last two decades, a huge amount of data is being produced worldwide by various sources. Genomic data is one of the main sources for this huge data termed as Big Data.
Next-Generation Sequencing (NGS) machines are producing up to six billion base pairs per run in very cost-effective manner. Currently, the main challenge is to process this huge genomic data to extract relevant information. During extraction of relevant information
from this genomic Big Data, alignment and analysis are two most important tasks. In this thesis, we present very accurate and efficient distributed sequence alignment and analysis algorithms.
To tackle the problem of efficient sequence alignment, two distributed sequence alignment algorithms named as AVLR-Mapper and StreamAligner are proposed and implemented using Big Data framework Apache Spark. AVLR-Mapper is first sequence aligner which
has distributed index generation approach. AVLR-Mapper uses most efficient search mechanism based on partitioning to reduce computation during read mapping. It outperforms ost of the state-of-the-art sequence alignment algorithms in terms of accuracy and performance. StreamAligner is the first sequence aligner which can directly align stream of machines as stream and output interesting patterns after alignment and analysis. It has a great scope in future for making sequencing, alignment, and visualization (or analysis)
process automated. It showed better execution time (speedup upto 9.97x) due to better load balancing and stream processing engine. AVLR-Mapper and StreamAligner are implemented on Apache Spark and evaluated on IIT Mandi local cluster as well as Amazon
EC2 cloud. Source code written in Java is available on GitHub.
To analyze large genomic datasets, three distributed association rule mining algorithms named as Reduced-Apriori (R-Apriori), Adaptive-Apriori (A-Apriori), and Flink-Apriori (F-Apriori) are proposed and implemented using Big Data frameworks Apache Spark
and Flink. R-Apriori and A-Apriori are implemented on Big Data framework Apache Spark. R-Apriori uses a reduced approach for the second iteration of Apriori algorithm and minimizes computation to a great extent. R-Apriori outperforms conventional Apriori in terms of accuracy and efficiency. A-Apriori uses an adaptive approach for every iteration where the decision is made to use reduced or conventional Apriori approach before every iteration based on precomputations. A-Apriori always performs better than R-Apriori and conventional Apriori for all datasets. F-Apriori uses Apache Flink to handle iterative
computations during Apriori and outperforms all association rule mining algorithms in terms of performance. All these association rule mining algorithms are written in Scala and evaluated on local cluster as well as Amazon EC2 cloud. These algorithms are used for analyzing large genome datasets to get interesting patterns from them. These algorithms can be used in Bioinformatics applications like cancer detection, SNP discovery, motif discovery and clustering. In summary, this thesis presents the architecture, algorithm, and implementation of two distributed sequence alignment and three distributed association
rule mining algorithms targeted towards helping bioinformatic scientists to reduce the time and cost for alignment and analysis of large genomic datasets.