Pelaez, Alejandro. Using data analytics for reliability and autonomic management of large-scale systems. Retrieved from https://doi.org/doi:10.7282/T30K2BKG
DescriptionLarge-scale clusters are growing at a rapid pace, and the resulting amount of monitoring data produced in these systems is also increasing. The goal of this research is to investigate tools that improve the reliability and help manage such systems using this wealth of data. This is a challenging problem as the scale of these machines increases the complexity, the amount of monitored data, and amount of interactions between different nodes, making the system much harder to manage and also resulting in high failure frequency. In this thesis we focus on online failure prediction and policy based management as mechanisms that can help address these issues. First, in case of failure prediction we focus on achieving an acceptable accuracy that is comparable other algorithms, but with the objective of being able to scale to thousands of nodes (given that typical centralized solutions suffer from high transmission and processing overheads at very large scales). Our solution to this problem is based on a decentralized online clustering algorithm (DOC) to detect anomalies in resource usage logs. We show that we can in fact achieve a similar accuracy as other algorithms while scaling to thousands of nodes with less than 2% overhead. Second, high level policies are an attractive option for managing complex systems and ensuring that they run within certain restrictions, as policies can be specified in terms of business goals and do not require low level knowledge of the machines. In order to enable this, we need a way of dynamically mapping the state of the system to the high level policies. We consequently propose a machine learning solution based on monitoring data, wherein we make predictions of the high- level indicators of the state of a system in order to determine what actions have to be taken to satisfy a given policy. We evaluate our approach using a sample system, and demonstrate that neural networks do an excellent job at predicting the required state, only incurring an error of at most 8.78%, 98% of the time.