Proactive Fault Tolerance using Heartbeat Strategy for Fault Detection
Shelly Prakash1, Vaibhav Vyas2, Anup Bhola3

1Shelly Prakash, Ph.D. Scholar, Department of Computer Science, Banasthali Vidyapith, Tonk, Rajasthan.
2Dr. Vaibhav Vyas ,Associate Professor, Department of Computer Science, Banasthali Vidyapith, Tonk, Rajasthan.
3Dr. Anup Bhola, Assistant Professor, Department of Computer Science, Banasthali Vidyapith, Tonk, Rajasthan.
Manuscript received on September 23, 2019. | Revised Manuscript received on October 15, 2019. | Manuscript published on October 30, 2019. | PP: 4927-4932 | Volume-9 Issue-1, October 2019 | Retrieval Number: A2079109119/2019©BEIESP | DOI: 10.35940/ijeat.A2079.019119
Open Access | Ethics and Policies | Cite | Mendeley
© The Authors. Blue Eyes Intelligence Engineering and Sciences Publication (BEIESP). This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/)

Abstract: Failure is something which causes services on the cloud to go down for some time period. Most of the times instead of recovery and repair, we opt for virtual machine migration where failover of the failed service is done on some other running virtual server so that the service is revived. Virtual migrations and recovery mechanisms consume a lot of energy and many approaches are implemented to make them energy efficient. Failure Detection is a topic of equal importance and comes under fault tolerance. Failure detection if done properly can be more effective and energy/cost saving than fault recovery. Heartbeat strategy is one such failure detection approach where live processes send an “I am alive” message to the host device at some pre-defined fixed intervals which ensures that the process is running fine. In this paper, we propose to mark the nodes whose processes have failed to send the heartbeat message and prepare a count (confidence factor, α) for the same. In primary testing, if this confidence factor reaches a specific threshold then that particular node is sent for confidence testing (second level failure detection testing using a different time sequence of heartbeat message arrival) and later marked for failure recovery (if found faulty). Fault recovery techniques are then applied to it so that it can be corrected and reused and the current jobs can be migrated to the better node during the recovery period. If the confidence factor, α is below the threshold value then no action is taken and only network parameters and connections can be rechecked. This method will re-ensure the trust on heartbeat strategy for fault detection and save the device from failure.
Keywords: Proactive, reactive fault tolerance, confidence factor, primary testing, confidence testing, virtual machine.