Title: A hybrid fault tolerance framework for SaaS services based on hidden Markov model
Authors: Feng Ye; Qian Huang; Zhijian Wang; Ling Li
Addresses: College of Computer and Information, Hohai University, Nanjing, China; Nanjing Longyuan Micro-Electronics Company, Nanjing, China ' College of Computer and Information, Hohai University, Nanjing, China; Nanjing Huiying Electronics Technology Corporation, Nanjing, China ' College of Computer and Information, Hohai University, Nanjing, China ' College of Computer and Information, Hohai University, Nanjing, China
Abstract: With the booming of cloud computing, more and more applications adopt cloud services to implement their critical business. However, failures causing either service downtime or producing invalid results in such applications may range from a mere inconvenience to significant monetary penalties or even loss of human lives. In critical systems, making the cloud services highly dependable is one of the main challenges. Existing researches show that using fault injection for experimental assessment of fault tolerance architecture for cloud services is still an open problem because of the complexity and diversity of failures in cloud environment. Therefore, we propose a hybrid fault tolerance framework which utilises replication and design diversity techniques for SaaS service. In order to verify the effectiveness of the fault tolerance framework in various pragmatic failure scenarios, a mixed fault simulator based on urn and ball model in hidden Markov model is introduced. A series of experiments are carried out for evaluating the reliability of the SaaS service, including single service without replication, single service with retry or reboot, and a service with spatial replication. The results show that the mixed fault simulator is flexible for simulating various faults in cloud environment, and both temporal and spatial redundancy have better effect on the availability and reliability improvement of the SaaS service.
Keywords: hidden Markov model; SaaS; fault tolerance; cloud services.
International Journal of Reliability and Safety, 2019 Vol.13 No.1/2, pp.138 - 150
Received: 14 Sep 2017
Accepted: 13 Jun 2018
Published online: 14 Dec 2018 *