Reading Time: 4 minutes
VMware HA (High Availability) is a tool included in VMware vSphere software that can restart failed virtual machines (VMs) on alternate host servers to reduce application downtime.
VSphere HA allows the server administrator to merge physical servers on the same network into a logical group called a high-availability cluster.
VMware first implemented vSphere HA in Virtual Infrastructure 3 in 2006 and has continued to develop and support the feature and to improve the VMware tools. In the event of a server collapse, for instance, a system failure, power failure, or network failure, vSphere HA detects which VMs are faulty and restarts them on another stable system in the cluster.
This process of recovering failed workloads on secondary systems is called a firewall. The VMware infrastructure is doing a great job, especially for the vSphere fault tolerance.
Generally used, high availability is a term used to describe available systems or applications, functioning as expected, a high percentage of the time. System availability often exceeds 99% in enterprise data centers and is usually measured in nines.
VMware HA features
VMware HA enables organizations to enhance availability by automatically detecting failed VMs and restarting various physical servers without manual human intervention. The ability to restart these VMs on additional physical hardware is possible because virtual machine (VMDK) files are stored on shared storage, accessible to all physical servers connected via the HA cluster.
VMware Distributed Resource Scheduler (DRS) is regularly used with vSphere HA to rebalance loads that must be resumed on alternate hosts. An organization that employs vSphere HA and DRS together can guarantee that restarted VMs do not interfere with the performance of another hostname VMs.
The VMware Fault Tolerance feature can also deliver very high levels of availability. If a vSphere HA restarts failed VMs after a short detection and boot, Fault Tolerance keeps a redundant copy of the saved VM that can seamlessly perform failed replication operations.
How VMware HA works
VMware vSphere HA uses a tool called Agent Fault Domain Manager to track the availability of the ESXi host and restart failed VMs. When setting up vSphere HA, the administrator defines a group of servers as a high-availability cluster.
The error handler works for each host in the cluster. One of the hosts in the cluster serves as the prime host, while all other hosts are called slaves – to monitor signals from the other hosts in the cluster and communicate with the vCenter server. The host servers in the HA cluster share via a heartbeat, which is a periodic message indicating that the host is working as expected.
If the prime host fails to detect a signal from other host or VM in the cluster, it directs vSphere HA to take corrective action. The type of action can vary depending on the type of fault detected and the user’s preferences.
In the event of a VM failure where the host server continues to run, vSphere HA restarts the VM on the original host. If the entire host fails, the tool restarts all affected VMs on the other hosts in the cluster. The VMware HA utility can also restart VMs if the host continues to operate but loses its network connection to the rest of the cluster.
The host can monitor if that host is still communicating with networked data repositories to find out if a network host is still operating. Shared storage, such as a storage space network, allows hosts to access VM disk files and restart the VM, even if it runs on another server in the cluster.
How to set up and utilize vSphere HA
The first step in setting up vSphere HA is to create a cluster from the vSphere Web Client under Create Cluster and then select ESXi Hosts and Shared Storage to join the cluster.
VMware HA clusters must contain at least two hosts, but many organizations maintain larger clusters that pool more resources and accept more failures. The administrator can then enable the vSphere HA feature from the web client under Management> Settings> vSphere HA. Finally, the user can adjust the configuration settings and preferences of vSphere HA from the vSphere Web Client.
HA VMware benefits
VMware vSphere HA delivers the following benefits:
- VMware vSphere HA is cost-effective and allows you to automatically restart VMs and vSphere hosts as soon as a server failure or OS failure is detected in the vSphere environment.
- Tracks all VMware vSphere hosts and VMs within the vSphere cluster
- Provides high availability to most applications running on virtual machines regardless of operating system and applications.
- The beauty of the VMware vSphere HA solution imposed through the VMware cluster is the simplicity that is organized. High availability is configured with a few clicks via a wizard-driven interface.
The number one misconception about VMware HA
The most common misconception about vSphere HA is that it uses vMotion to move virtual machines from one host to another. This assumption is incorrect.
VM is already off (dropped). HA powers virtual machines to reuse another healthy host in the cluster. VMware vSphere HA and vMotion are different technologies with different requirements and benefits. In short, HA does not require a vMotion network to operate.
However, it needs shared storage so that other hosts can access the files on the virtual machine and restart the VM again after a host failure.
VMware HA Best Practices
Number of VMs per host
Make sure you take advantage of Distributed Resource Scheduler (DRS) feature in your clusters. Using DRS will ensure that the workload is balanced concerning all the hosts in your cluster.
Simply, imagine a scenario where a host running most of your VMs fails. It will take some time for all the virtual machines to reconnect to other healthy hosts. By using DRS, you can minimize downtime due to host failure.
Large hosts vs. small hosts
As with the previous point, do you want hundreds of virtual machines to be affected by a host malfunction or a few dozen? Making sure your hosts are not too big and yet not too small is a balancing act. Cluster resilience and hardware costs are vital points to consider when setting up your clusters.
When setting up HA, using Admission Control is generally a good idea. Enabling Admission Control will prevent you from turning on new virtual machines that violate your “Number of Failed Tolerance Hosts” setting.
Admission control prevents you from entering this scenario by informing you that there is not enough power to power new virtual machines.
So, VMware HA is a great way to ensure that your vSphere cluster provides a highly resilient high availability protection against common ESXi host failures in your vSphere cluster.