Abstract:
In India, the large telecom service providers each serve 100 million - 400 mil
lion subscribers. Where, each telecommunications network may contain hundreds
or more different types of network devices transmitting data among each other and
to the customer subscriber. In this scenario, a Network Management System (NMS)
may collect millions of records/sec of data. There can be lots of network faults that
keep happening in real-time. Some of these may be of low priority while others may
be of high priority. Hence, it becomes imperative to analyze the data in real-time to
manage the high priority network faults in real-time too. In view of the growing com
plexity and rapid changes in the demands on the network, machine learning (ML)
techniques are being used for advanced NMS. ML models are typically computa
tionally intensive, involving training and testing phases. To handle the huge volume
of data streaming at high velocity, we not only require powerful machines but also
mechanisms to distribute the computation involved across multiple nodes. There
are several open-source distributed stream processing frameworks such as Apache
Storm, Apache Flink, Apache Spark and Confluent Kafka for building real-time ma
chine learning applications. Prior works benchmarked some of these platforms using
low-level operations like filters, joins, windowed computations etc.
In this thesis, we first survey multiple Distributed Stream Processing Frameworks
qualitatively for choosing appropriate frameworks and also Message Queuing Appli
cation for ordered message delivery. Once the platforms are decided, we benchmark
our four chosen DSPFs for their applicability to execute classical machine learning
models. For variety in complexity of computation, we have chosen three classical
machine leaning models - Online K-Means, Online Linear Regression and Online
Logistic Regression. We study the following quantitative metrics of evaluation:
throughput, latency, CPU utilization, memory usage and Input/Output usage. The
experiments were conducted in both standalone and clusters setups to determine
the scalability of the models. In this study, we found that all four frameworks are
comparable, except Apache Spark performs marginally better than the others for
standalone setup for all algorithms. Whereas, for cluster node setup the best per
forming framework varies between Apache Storm and Apache Spark. We have alsoobserved speedup across different setups. These results can help system designers
choose the right model and the right framework, given a specific configuration of
streaming data.
Later in the thesis, we also discuss a direct application based on the benchmark
ing experiment, called the Configuration Planner. This Configuration Planner is
designed to make recommendations to a telecom network administrator for a server
configuration based on the network size to be managed. We describe in detail the
parameters involved, design overview and also the data structures of each compo
nent of this planner. This thesis covers the design aspect of the planner and also a
possible user interface of the Planner.