Bayesian Prognostic Framework for High-Availability Clusters

Abstract

Critical services from domains as diverse as finance, manufacturing and healthcare are often delivered by complex enterprise applications (EAs). High-availability clusters (HACs) are software-managed IT infrastructures that enable these EAs to operate with minimum downtime. To that end, HACs monitor the health of EA layers (e.g., application servers and databases) and resources (i.e., components), and attempt to reinitialise or restart failed resources swiftly. When this is unsuccessful, HACs try to failover (i.e., relocate) the resource group to which the failed resource belongs to another server. If the resource group failover is also unsuccessful, or when a system-wide critical failure occurs, HACs initiate a complete system failover.

Despite the availability of multiple commercial and open-source HAC solutions, these HACs (i) disregard important sources of historical and runtime information, and (ii) have limited reasoning capabilities. Therefore, they may conservatively perform unnecessary resource group or system failovers or delay justified failovers for longer than necessary.

This thesis introduces the first HAC taxonomy, uses it to carry out an extensive survey of current HAC solutions, and develops a novel Bayesian prognostic (BP) framework that addresses the significant HAC limitations that are mentioned above and are identified by the survey. The BP framework comprises four \emph{modules}. The first module is a technique for modelling high availability using a combination of established and new HAC characteristics. The second is a suite of methods for obtaining and maintaining the information required by the other modules. The third is a HAC-independent Bayesian decision network (BDN) that predicts whether resource failures can be managed locally (i.e., without failovers). The fourth is a method for constructing a HAC-specific Bayesian network for the fast prediction of resource group and system failures. Used together, these modules reduce the downtime of HAC-protected EAs significantly. The experiments presented in this thesis show that the BP framework can deliver downtimes between 5.5 and 7.9 times smaller than those obtained with an established open-source HAC.

Metadata

Supervisors:	Calinescu, Radu
Keywords:	Bayesian network; dependability; decision network; disaster recovery; failover clustering; fault-tolerance; high-availability; high-availability cluster; probabilistic reasoning; reliability
Awarding institution:	University of York
Academic Units:	The University of York > Computer Science (York)
Identification Number/EthosID:	uk.bl.ethos.852195
Depositing User:	Dr Premathas Somasekaram
Date Deposited:	05 Apr 2022 11:56
Last Modified:	21 May 2022 09:53
Open Archives Initiative ID (OAI ID):	oai:etheses.whiterose.ac.uk:30406

You do not need to contact us to get a copy of this thesis. Please use the 'Download' link(s) above to get a copy.
You can contact us about this thesis. If you need to make a general enquiry, please see the Contact us page.

Bayesian Prognostic Framework for High-Availability Clusters

Abstract

Metadata

Download

Examined Thesis (PDF)

Export

Statistics