Benchmarking distributed stream processing frameworks for classical machine learning applications (MS Dissertation)

Sundar, Merlin

Benchmarking distributed stream processing frameworks for classical machine learning applications (MS Dissertation)

Sundar, Merlin

URI: http://hdl.handle.net/123456789/427

Date: 2021-04

Abstract:

In India, the large telecom service providers each serve 100 million - 400 mil lion subscribers. Where, each telecommunications network may contain hundreds or more diﬀerent types of network devices transmitting data among each other and to the customer subscriber. In this scenario, a Network Management System (NMS) may collect millions of records/sec of data. There can be lots of network faults that keep happening in real-time. Some of these may be of low priority while others may be of high priority. Hence, it becomes imperative to analyze the data in real-time to manage the high priority network faults in real-time too. In view of the growing com plexity and rapid changes in the demands on the network, machine learning (ML) techniques are being used for advanced NMS. ML models are typically computa tionally intensive, involving training and testing phases. To handle the huge volume of data streaming at high velocity, we not only require powerful machines but also mechanisms to distribute the computation involved across multiple nodes. There are several open-source distributed stream processing frameworks such as Apache Storm, Apache Flink, Apache Spark and Conﬂuent Kafka for building real-time ma chine learning applications. Prior works benchmarked some of these platforms using low-level operations like ﬁlters, joins, windowed computations etc. In this thesis, we ﬁrst survey multiple Distributed Stream Processing Frameworks qualitatively for choosing appropriate frameworks and also Message Queuing Appli cation for ordered message delivery. Once the platforms are decided, we benchmark our four chosen DSPFs for their applicability to execute classical machine learning models. For variety in complexity of computation, we have chosen three classical machine leaning models - Online K-Means, Online Linear Regression and Online Logistic Regression. We study the following quantitative metrics of evaluation: throughput, latency, CPU utilization, memory usage and Input/Output usage. The experiments were conducted in both standalone and clusters setups to determine the scalability of the models. In this study, we found that all four frameworks are comparable, except Apache Spark performs marginally better than the others for standalone setup for all algorithms. Whereas, for cluster node setup the best per forming framework varies between Apache Storm and Apache Spark. We have alsoobserved speedup across diﬀerent setups. These results can help system designers choose the right model and the right framework, given a speciﬁc conﬁguration of streaming data. Later in the thesis, we also discuss a direct application based on the benchmark ing experiment, called the Conﬁguration Planner. This Conﬁguration Planner is designed to make recommendations to a telecom network administrator for a server conﬁguration based on the network size to be managed. We describe in detail the parameters involved, design overview and also the data structures of each compo nent of this planner. This thesis covers the design aspect of the planner and also a possible user interface of the Planner.

Description:

A thesis submitted for the award of the degree of Doctor of Philosophy under the guidance of Dr. Timothy A. Gonsalves and Dr. Sriram Kailasam (Faculty, SCEE)

Show full item record