Distributed algorithms on big data frameworks for alignment and analysis if big data generated by next-generation sequencing (PhD)

Rathee, Sanjay

dc.contributor.advisor	Dr. Arti Kashyap
dc.contributor.author	Rathee, Sanjay
dc.date.accessioned	2020-07-08T09:26:36Z
dc.date.available	2020-07-08T09:26:36Z
dc.date.issued	2018-09-25
dc.identifier.uri	http://hdl.handle.net/123456789/281
dc.description	A dissertation submitted for the award of the degree of Doctor of Philosophy under the guidance of Dr. Arti Kashyap (Faculty, SCEE)	en_US
dc.description.abstract	During last two decades, a huge amount of data is being produced worldwide by various sources. Genomic data is one of the main sources for this huge data termed as Big Data. Next-Generation Sequencing (NGS) machines are producing up to six billion base pairs per run in very cost-effective manner. Currently, the main challenge is to process this huge genomic data to extract relevant information. During extraction of relevant information from this genomic Big Data, alignment and analysis are two most important tasks. In this thesis, we present very accurate and efficient distributed sequence alignment and analysis algorithms. To tackle the problem of efficient sequence alignment, two distributed sequence alignment algorithms named as AVLR-Mapper and StreamAligner are proposed and implemented using Big Data framework Apache Spark. AVLR-Mapper is first sequence aligner which has distributed index generation approach. AVLR-Mapper uses most efficient search mechanism based on partitioning to reduce computation during read mapping. It outperforms ost of the state-of-the-art sequence alignment algorithms in terms of accuracy and performance. StreamAligner is the first sequence aligner which can directly align stream of machines as stream and output interesting patterns after alignment and analysis. It has a great scope in future for making sequencing, alignment, and visualization (or analysis) process automated. It showed better execution time (speedup upto 9.97x) due to better load balancing and stream processing engine. AVLR-Mapper and StreamAligner are implemented on Apache Spark and evaluated on IIT Mandi local cluster as well as Amazon EC2 cloud. Source code written in Java is available on GitHub. To analyze large genomic datasets, three distributed association rule mining algorithms named as Reduced-Apriori (R-Apriori), Adaptive-Apriori (A-Apriori), and Flink-Apriori (F-Apriori) are proposed and implemented using Big Data frameworks Apache Spark and Flink. R-Apriori and A-Apriori are implemented on Big Data framework Apache Spark. R-Apriori uses a reduced approach for the second iteration of Apriori algorithm and minimizes computation to a great extent. R-Apriori outperforms conventional Apriori in terms of accuracy and efficiency. A-Apriori uses an adaptive approach for every iteration where the decision is made to use reduced or conventional Apriori approach before every iteration based on precomputations. A-Apriori always performs better than R-Apriori and conventional Apriori for all datasets. F-Apriori uses Apache Flink to handle iterative computations during Apriori and outperforms all association rule mining algorithms in terms of performance. All these association rule mining algorithms are written in Scala and evaluated on local cluster as well as Amazon EC2 cloud. These algorithms are used for analyzing large genome datasets to get interesting patterns from them. These algorithms can be used in Bioinformatics applications like cancer detection, SNP discovery, motif discovery and clustering. In summary, this thesis presents the architecture, algorithm, and implementation of two distributed sequence alignment and three distributed association rule mining algorithms targeted towards helping bioinformatic scientists to reduce the time and cost for alignment and analysis of large genomic datasets.
dc.language.iso	en_US	en_US
dc.publisher	IITMandi	en_US
dc.subject	Bioinfomatics	en_US
dc.subject	Reference Preprocessing	en_US
dc.title	Distributed algorithms on big data frameworks for alignment and analysis if big data generated by next-generation sequencing (PhD)	en_US
dc.type	Thesis	en_US