How to Reduce Downtime with VMware Clusters?

Share this:

We often hear about clusters, but what does a “cluster” actually mean? How are clusters used and what are their benefits? In this article, I will take a look at the most frequently used types of clusters and advantages of using them within the framework of the VMware virtual infrastructure.

A cluster is a group of servers that are connected to each other, use special software, and operate together like a single system by performing common tasks. Servers that belong to a cluster are called nodes. There must be at least two nodes within the cluster for it to work.

VMware vSphere supports two types of clusters: Distributed Resource Scheduler (DRS) and High Availability (HA) clusters.

vSphere cluster settings

Distributed Resource Scheduler (DRS) clusters are used to balance workloads for optimal performance of virtual machines running on the ESXi hosts within the cluster.

When a host is added to a DRS cluster, it shares its resources (CPU and memory) with other hosts in the same cluster. DRS monitors the use of resources for every host, compares it to ideal resource distribution patterns, and provides recommendations on virtual machine migration in an automatic or manual mode using resource pools. Thus, a DRS cluster can help optimize the CPU and memory utilization by migrating VMs from an overloaded ESXi host to another ESXi host with available computing resources.

High Availability (HA) clusters are used to ensure high availability of virtual machines and services running on them. In the event of failure, they migrate VMs from the failed host to another host in the cluster with sufficient capacity, thereby providing near-continuous operation. If you are looking to minimize downtime for your VMs, HA clusters should be your primary choice.

A simplified illustration of an HA cluster

There is one more feature available for VMware HA clusters that can help further reduce the VM downtime. It’s called Fault Tolerance.

What is Fault Tolerance?

Fault tolerance (FT) is a clustering feature that helps ensure continuous protection of VMs in case of ESXi host failure. FT replicates the primary virtual machine’s state, including all processor and virtual device inputs. However, only the primary VM produces the outputs. While the primary VM is running on the first ESXi host, the ghost replica is running on the second ESXi host with its network connection disabled. Both primary and replica VMs share the same virtual disk located on the shared datastore.

When the ESXi host with the primary VM fails, the transparent failover occurs instantly – the secondary VM (replica) becomes primary and active, with networking available. The failover is so fast that the only thing you might notice is a slightly higher latency of network packets transmitted at the moment of the failover. Once the secondary VM becomes primary, a new secondary virtual machine will be created on another ESXi host (if such host is present in your infrastructure). With fault tolerance, it is possible to achieve zero downtime in case of ESXi host hardware failure.

To enable fault tolerance for a virtual machine, the ESXi host on which the VM is running must belong to a HA cluster. Additionally, there are some requirements and limitations for FT usage that we will consider below, along with the requirements for HA clusters.

Note: Starting from vSphere 6.0, it is possible to use VADP (vStorage APIs for Data Protection) for fault-tolerant VMs, as well. vSphere 6.0 also supports copying vmx and vmdk files of primary and secondary VMs to different datastores.

Requirements for Creating a High Availability Cluster

The following requirements must be met to create a High Availability cluster:

  • DNS server and vCenter server installed
  • Compatible CPUs for ESXi servers. The same memory and processors for all hosts are recommended
  • At least two ESXi servers of the same version and build, with different host names and static IP addresses
  • Shared datastore accessible for all ESXi hosts within the cluster
  • Network access (redundant network connection is recommended)
  • 10Gbit network connection with latency under 10 ms
  • Different subnets for the HA network and the management network

Requirements & Limitations of Fault Tolerance

The following requirements must be met to enable Fault Tolerance:

  • Configured HA cluster
  • Hardware virtualization enabled in BIOS
  • CPU compatibility

Limits for Fault Tolerance (for the newest version):

  • Maximum 4 vCPU per VM
  • Maximum 8 vCPU or 4VMs per host
  • Maximum 16 virtual disks
  • 2TB maximum disk size
  • Maximum 64GB RAM (per FT VM)

Achieving Tighter RTOs with HA Clusters

RTO (Recovery Time Objective) is a metric that defines the maximum amount of time that your company can afford to spend on recovering virtual machines and restoring other data after a crash. If, for example, you set a 1-hour RTO, and you can recover your virtual machines in 30 minutes, it means that your RTO goal is achieved. As described above, using a High Availability cluster can help reduce the downtime of virtual machines in cases of host failure. For HA clusters, the VM downtime is equal to the time of loading the VM on a different host. What is even better, fault tolerance allows you to achieve zero downtime and, thus, helps you meet the shortest RTOs possible. For more information about establishing, improving, and troubleshooting your RTOs, download and read the free White Paper “How to Calculate a Recovery Time Objective and Cut Downtime Costs”.

Which Protection Method Is Better: Clustering or VM Backup?

Clustering features such as High Availability and fault tolerance are indispensable for business-critical processes, because they can help recover VMs in cases of ESXi failure within the shortest possible time. What they won’t do, however, is recover data after accidental deletion or disaster. This is because the virtual machine’s disk is located on a shared datastore, and both primary and secondary VMs share this disk. If the files on that virtual disk get deleted by users or corrupted by ransomware, it won’t be possible to recover them. Even if both VMs use a replicated virtual disk on a different datastore, it won’t be possible to recover consistent data after replication of corrupted data or after a disaster (flood, earthquake, fire, etc.). That’s why HA and FT are not a replacement for VM backup. Running backup jobs, saving VM backups to different media, and storing backup copies offsite will improve your probability of successful recovery. Combining clustering features with consistent VM backup can provide you with the most reliable protection for your virtual machines.

Choosing the Right Backup Solution

Choosing the right product that will fit your VM backup needs is a difficult task. That’s why in this article, I’d like to review NAKIVO Backup & Replication – a powerful solution that can become a good companion for VMware clustering features.

Overview

NAKIVO Backup & Replication is a product designed for protecting VMware and Hyper-V VMs and AWS EC2 instances. It has a rather simple and user-friendly web interface, where you can navigate to your ESXi hosts, clusters, and virtual machines and run jobs for them.

The interface of NAKIVO Backup & Replication

Key features of NAKIVO Backup & Replication

  • Native, image-based, application-aware backups of VMs with flexible scheduling and reporting.
  • Deduplication, compression, and exclusion of swap files and partitions during VM backup to save storage space.
  • Simple disaster recovery with VM replicas.
  • Automated verification that VM backups and replicas are functional.
  • GFS retention settings to store multiple recovery points for backups and replicas (e.g., if you accidentally deleted a file two months ago, you can restore it from the monthly backup made three months ago, without the need to store all daily or weekly backups for this period).
  • Backup copy offsite or to the cloud (AWS/Azure).
  • Fast recovery of VMs and granular recovery of files and application (Microsoft Exchange, Microsoft SQL, and Microsoft Active Directory) objects.
  • Instant VM recovery from the compressed and deduplicated backups stored in the backup repository.
  • Network acceleration, LAN-free data transfer, and direct installation on NAS storage devices for faster backup and replication jobs.

By using these and other features, you can protect your virtual machines and ensure consistent and near-instant recovery. You can try them out in your infrastructure by downloading a full-featured free trial of NAKIVO Backup & Replication.

Conclusion

VMware clustering features are indispensable for protecting virtual machines and reducing downtime caused by hardware failure. High Availability clusters and the fault tolerance feature can help reduce downtime virtually to zero and achieve the shortest RTOs possible, which is extremely important for business-critical processes running on virtual machines.

Even better results can be achieved with VM backup. By complementing the advantages of VMware clustering with the comprehensive set of features of NAKIVO Backup & Replication, you can meet even the shortest Recovery Time Objectives (RTO) set for your virtual environment and improve your business continuity plan.

The following two tabs change content below.
michael.bose@nakivo.com'

Michael Bose

Michael Bose is a VMware administrator at NAKIVO with 10+ years of experience in the virtualization area. He is also an active contributor to the NAKIVO Blog.
michael.bose@nakivo.com'

Latest posts by Michael Bose (see all)

michael.bose@nakivo.com'

About Michael Bose

Michael Bose is a VMware administrator at NAKIVO with 10+ years of experience in the virtualization area. He is also an active contributor to the NAKIVO Blog.
Bookmark the permalink.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.