Distributed algorithms on big data frameworks for alignment and analysis if big data generated by next-generation sequencing (PhD)

Show simple item record

dc.contributor.advisor Dr. Arti Kashyap
dc.contributor.author Rathee, Sanjay
dc.date.accessioned 2020-07-08T09:26:36Z
dc.date.available 2020-07-08T09:26:36Z
dc.date.issued 2018-09-25
dc.identifier.uri http://hdl.handle.net/123456789/281
dc.description A dissertation submitted for the award of the degree of Doctor of Philosophy under the guidance of Dr. Arti Kashyap (Faculty, SCEE) en_US
dc.description.abstract During last two decades, a huge amount of data is being produced worldwide by various sources. Genomic data is one of the main sources for this huge data termed as Big Data. Next-Generation Sequencing (NGS) machines are producing up to six billion base pairs per run in very cost-effective manner. Currently, the main challenge is to process this huge genomic data to extract relevant information. During extraction of relevant information from this genomic Big Data, alignment and analysis are two most important tasks. In this thesis, we present very accurate and efficient distributed sequence alignment and analysis algorithms. To tackle the problem of efficient sequence alignment, two distributed sequence alignment algorithms named as AVLR-Mapper and StreamAligner are proposed and implemented using Big Data framework Apache Spark. AVLR-Mapper is first sequence aligner which has distributed index generation approach. AVLR-Mapper uses most efficient search mechanism based on partitioning to reduce computation during read mapping. It outperforms ost of the state-of-the-art sequence alignment algorithms in terms of accuracy and performance. StreamAligner is the first sequence aligner which can directly align stream of machines as stream and output interesting patterns after alignment and analysis. It has a great scope in future for making sequencing, alignment, and visualization (or analysis) process automated. It showed better execution time (speedup upto 9.97x) due to better load balancing and stream processing engine. AVLR-Mapper and StreamAligner are implemented on Apache Spark and evaluated on IIT Mandi local cluster as well as Amazon EC2 cloud. Source code written in Java is available on GitHub. To analyze large genomic datasets, three distributed association rule mining algorithms named as Reduced-Apriori (R-Apriori), Adaptive-Apriori (A-Apriori), and Flink-Apriori (F-Apriori) are proposed and implemented using Big Data frameworks Apache Spark and Flink. R-Apriori and A-Apriori are implemented on Big Data framework Apache Spark. R-Apriori uses a reduced approach for the second iteration of Apriori algorithm and minimizes computation to a great extent. R-Apriori outperforms conventional Apriori in terms of accuracy and efficiency. A-Apriori uses an adaptive approach for every iteration where the decision is made to use reduced or conventional Apriori approach before every iteration based on precomputations. A-Apriori always performs better than R-Apriori and conventional Apriori for all datasets. F-Apriori uses Apache Flink to handle iterative computations during Apriori and outperforms all association rule mining algorithms in terms of performance. All these association rule mining algorithms are written in Scala and evaluated on local cluster as well as Amazon EC2 cloud. These algorithms are used for analyzing large genome datasets to get interesting patterns from them. These algorithms can be used in Bioinformatics applications like cancer detection, SNP discovery, motif discovery and clustering. In summary, this thesis presents the architecture, algorithm, and implementation of two distributed sequence alignment and three distributed association rule mining algorithms targeted towards helping bioinformatic scientists to reduce the time and cost for alignment and analysis of large genomic datasets.
dc.language.iso en_US en_US
dc.publisher IITMandi en_US
dc.subject Bioinfomatics en_US
dc.subject Reference Preprocessing en_US
dc.title Distributed algorithms on big data frameworks for alignment and analysis if big data generated by next-generation sequencing (PhD) en_US
dc.type Thesis en_US


Files in this item

This item appears in the following Collection(s)

Show simple item record

Search IIT Mandi Repository


Advanced Search

Browse

My Account