Online resource usage prediction and failure-aware system for resource provisioning in cloud data centres (PhD)

Gupta, Shaifu

Online resource usage prediction and failure-aware system for resource provisioning in cloud data centres (PhD)

Gupta, Shaifu

URI: http://hdl.handle.net/123456789/219

Date: 2020-05-26

Abstract:

primary goal of cloud service administrators is e ective resource provisioning. Prediction of future resource usage and proactive prediction of resource contention failures are two important functions of a resource provisioning system. Towards this goal, this work focuses on improving these two functions by analysing the nature of cloud resource usage workloads and by applying statistical and machine learning techniques. The thesis first explores both experimental and analytical investigations to indicate the presence of long range dependence in cloud resource usage workloads. Thus, we compare the prediction capabilities of future resource usage prediction using models without and with modelling of long range dependence in resource usage workloads. However, the performance of any resource may depend on other resources as well. Hence, this thesis proposes multivariate extensions of future resource usage prediction models. This work analyses six di erent multivariate frameworks based on regression models of all available resource metrics and a subset of relevant set of metrics. To support the selection of a relevant set of features, we propose to use two desirable characteristics of feature selection techniques, namely prediction performance and stability to support the dynamic cloud environment. This thesis next proposes online variations for future resource usage prediction models. Here, the prediction models are updated in real time based on the error produced by the model. In this context, we analyse gradient descent and Levenberg-Marquardt parameter update methods. Further as long short term memory models for future resource usage prediction have a large number of trainable parameters, we propose sparse variations of these prediction models where only a few parameters are retained to support fast online adaptation of prediction models. In addition to future resource usage prediction, this thesis also proposes a resource contention failure prediction system. Here, the heteroscedastic nature of cloud workloads is used with machine learning for anomaly detection. This work proposes an iterative autoencoder method for anomaly detection. Since detection of anomalies is not su cient, we propose four di erent classification models for identification of types of anomalies into di erent kinds of resource bottlenecks. These include simple multiclass classifier, multiclass classifier with fractional di erencing, multiclass classifier using encoded representation, and multiclass classifier using triplet-loss based representation. To support continuous online learning of models used for identification of di erent types of anomalies, this thesis proposes a novel approach for real time adaptation of anomaly identification models. Here, we analyse two fundamental challenges associated with online adaptation of anomaly identification based classification models. These challenges are associated with catastrophic forgetting and architectural evolution in models. To avoid catastrophic forgetting, we propose a combination of standard loss and distillation loss in a teacher-student network approach. For architectural evolution, we propose three di erent alternatives for incremental architectural learning and a column subset selection based teacher-student network. Based on the analysis, we observe that leveraging the presence of long range dependence, the proposed multivariate regression based prediction framework and the proposed online adaptation of prediction models, have enhanced the accuracy of CPU usage prediction by 69% over other existing methods for resource usage prediction. In the context of failure prediction, the proposed iterative autoencoder method performs 35% better than existing methods for anomaly detection. Among the proposed type of anomalies identification methods, LSTM-based multiclass classifier using triplet-loss based representation improves the performance of identification of types of anomalies by 66% which is further enhanced by proposed ensemble methods to a total of 76%. E ective resource provisioning has become a necessity rather than a luxury. It is important that resource management decisions by schedulers and resource managers are carried based on the expected resource usage workload as well as the current state of the servers. Based on this, the work carried in this thesis achieves its primary goal to improve the performance of prediction of future resource usage and diagnosis of resource contention failures. Insights and outcomes from this thesis can be used by service administrators to achieve optimal resource management and its far-reaching impact on revenue, reliability and reputation of service administrators.

Description:

A thesis submitted for the award of the degree of Doctor of Philosophy under the guidance of Dr. Dileep A.D. and Prof. Timothy A. Gonsalves (Faculty, SCEE)

Show full item record