Virtual Machine High Availability (VM HA)
In this document, you will learn about Private Cloud Director Virtual Machine High Availability (VM HA), a feature that automatically detects physical host failures within a cluster and restarts the affected VMs on other healthy hosts in the same cluster.
Introduction
Hardware, firmware, or network issues can cause a virtualization host to go offline with little warning. Without an automated recovery mechanism, every VM on that host would remain down until an operator intervenes, leading to a scenario that violates most service-level objectives.
Virtual Machine High Availability (VM HA) protects against this risk. The process is designed to be automatic and requires minimal manual intervention during a failure event:
Continuous Host Monitoring: VM HA service constantly monitors the health and responsiveness of all hypervisor hosts participating in an HA-enabled virtualized cluster.
Failure Detection: If a host stops responding (due to hardware failure, operating system crash, or certain network isolation scenarios), the system detects the failure
Automatic VM Recovery: Upon confirmation of a host failure, which involves verifying the failure on both the management plane and cluster hosts, VM HA automatically initiates the process of restarting any VMs running on the failed host. These VMs are powered on using available resources on the remaining healthy hosts within the cluster.
After recovery, complementary features such as Dynamic Resource Rebalancing (DRR) can redistribute load to restore optimal balance across the cluster, ensuring sustained performance.
Benefits of VM HA
Enabling VM HA in your Private Cloud Director environment delivers the following benefits:
Minimized Downtime: Automatically restarts or evacuates VMs when a host fails, reducing mean-time-to-recovery from hours to minutes.
Service Continuity: Keeps business-critical applications online even during unexpected infrastructure outages.
Operational Efficiency: Eliminates the need for round-the-clock manual monitoring and intervention, freeing administrators to focus on higher-value tasks.
Policy-Driven Control: Respects host aggregates, affinity/anti-affinity rules, and VM-specific settings, allowing you to decide how each workload is treated during failover.
Seamless Interoperation: Works in concert with DRR and other Private Cloud Director services to maintain both availability and resource efficiency after a failure event.
VM HA Pre-Requisites
VM HA always operates in the context of a virtualized cluster. You need to turn it on at the cluster level. Once turned on, VM HA applies to all virtual machines within the cluster.
A minimum of two hosts with a hypervisor role in a cluster.
Shared storage is required for VM HA. For VM HA to work at the cluster level:
All VMs should be using a block storage volume as the root disk (non-ephemeral root disk), or
If any VMs are using ephemeral storage for root disk, then Ephemeral Shared Storage should be used for all hosts in the virtualized cluster.
When Ephemeral Shared Storage is not used, any VMs using ephemeral root disk will get rebuilt as part of recovery on another host.
VM HA requires a minimum number of healthy hosts in a cluster to function correctly. A minimum of two hosts is required for HA activation.
VM HA uses the VM Evacuation operation behind the scenes. VM Evacuation Prerequisites must be met for the operation to succeed.
All hosts in the cluster must belong to the same host aggregate, or none of the hosts should have host aggregate assignments. Mixed host aggregate configurations are not supported as they prevent VM migration between hosts.
If any VMs in the cluster use block storage, the block storage role must be assigned to at least two hosts in the cluster.
Image library role must be assigned to at least two hosts in the cluster and Image library must use shared storage.
Operating System Compatibility: All hosts in the VM HA enabled cluster must run the same operating system version.
Mixed Ubuntu versions (22.04 and 24.04) in the same cluster are not supported.
VM evacuation between different OS versions will fail due to the KVM version incompatibilities.
We recommend disabling VM HA before upgrading the operating system on any host of the cluster, then re-enabling it after the host(s) have been upgraded.
Configure VM HA for a Cluster
VM HA configuration follows a two-step process to ensure cluster readiness:
Step 1: Create the Cluster
When you create a new cluster via Infrastructure > Clusters > Add Cluster, the VM High Availability toggle will be visible but disabled. You cannot enable VM HA during cluster creation.
Step 2: Enable VM HA After Adding Hosts
After creating the cluster:
Add at least two hosts with the hypervisor role to the cluster.
Verify that all hosts are assigned to the same host aggregate (if using host aggregates)
Navigate to the cluster settings and enable the VM High Availability toggle
VM HA Enablement Rules
The system enforces these validation rules for VM HA:
Minimum Host Requirement: A minimum of two hypervisor hosts is required in the cluster. If fewer than two hosts are present, the toggle displays: "At least two hosts are required to enable VM HA."
Host Aggregate Consistency: All hosts must either belong to the same host aggregate or have no aggregate assignment. If the hosts are part of different aggregates, the toggle displays the following message:"VM HA cannot be enabled because hosts belong to different host aggregates, which prevents cross-host migration."
On the Clusters grid view, you will see one of the following VM HA statuses for each cluster:
Disabled: VM HA is disabled for the cluster.
Waiting: VM HA is enabled but remains inactive for this cluster because fewer than two hypervisor hosts have been assigned. To activate VM HA, assign at least two hosts with the hypervisor role to the cluster. The status will automatically change to Active once this requirement is met.
Active: VM HA is enabled and active for the cluster.
Upgrading you Cluster Hosts
When upgrading your cluster hosts from one operating system version to another (for example, Ubuntu 22.04 to 24.04), follow the steps below:
Disable VM HA before starting the upgrade process.
Evacuate all VMs from the hosts that are scheduled for upgrade.
Upgrade hosts one at a time to the new operating system version.
Do not attempt to migrate VMs between upgraded and non-upgraded hosts.
Re-enable VM HA only after all hosts have been upgraded to the same operating system version.
Important
VM evacuation and migration operations require matching operating system and KVM versions between source and destination hosts. Attempting to evacuate VMs between Ubuntu 22.04 and 24.04 hosts will fail.
How VM HA Works
Private Cloud Director Virtual Machine High Availability (VM HA) provides automated protection against host failures by continuously monitoring host health and, when needed, evacuating affected virtual machines (VMs) to healthy hosts with minimal disruption. Three cooperating services enable the automated workflow:
High Availability Manager: Central coordinator that runs in the Private Cloud Director management plane and watches for clusters with VHMA enabled, gathers health reports, confirms host failures, and issues evacuation requests.
Host Agents: Daemon that runs on every hypervisor host and probes its peers’ liveness and reports findings to the High Availability Manager at regular intervals.
VM Evacuation Service: Orchestrator that runs in the Private Cloud Director management plane and receives confirmed host‑down events from High Availability Manager and live‑migrates each impacted VM to a suitable target host.
End-to-end Flow
Cluster discovery: The High Availability Manager polls the infrastructure to discover clusters where VHMA is enabled. For VM HA-enabled clusters, it verifies the presence of the pf9-ha-slave role on every hypervisor host. Clusters with fewer than two hosts are ignored until additional hosts join.
Peer list distribution: Every 59 seconds, each Host Agent requests a peer list from the High Availability Manager. The list is a randomized subset of hosts in the same cluster, helping to spread traffic and avoid single‑point bias.
Distributed health probing: Using the peer list, the host agent performs a liveness check on its peer hosts via the libvirt exporter. This check is entirely agent‑to‑agent, and no central polling path is involved.
Status aggregation & reporting: Every 127 seconds, the agent posts an aggregated view of peer health to the High Availability Manager.
Failure detection and correlation: When the High Availability Manager receives a report that a host appears to be down, it cross-checks the same host’s status in reports from other agents to avoid false positives while ensuring a fast reaction to true outages. Once a host is confirmed down via peer agent reports, a 150‑second cooldown timer starts:
If the host recovers before the timer expires, the event is cleared, and no action is taken. This is to prevent routine reboots from triggering VM evacuations.
If the host remains down when the timer ends, High Availability Manager confirms the failure and emits a host-down notification to the VM Evacuation Service.
Automated VM evacuation: Upon receiving the host‑down notification, the VM Evacuation Service verifies the host’s state, collects all resident VMs, and evacuates them one VM at a time to a healthy host in the cluster. Target selection honors existing placement constraints (such as aggregates or affinity rules).
Continuous protection loop: After evacuation is complete, normal monitoring resumes. Host Agents continue probing both the recovered host (if it comes back online) and all remaining hosts, enabling VM HA to maintain an always-on protection loop with no manual intervention. Note that if a host that went offline comes back online, VM HA will not automatically migrate the VMs that were originally located on this host back to it. The Dynamic Resource Rebalancing (DRR) service will make VM balancing decisions independently if resource contention occurs on any host.
VM HA Interoperation with Other Services
This section describes how VM HA interoperates with other services configured for your cluster.
Host Aggregates
VM HA will honor Host Aggregates and migrate VMs to another host from the same host aggregate.
Here are some limitations you might need to consider:
VM HA requires all hosts in a cluster to belong to the same host aggregate or have no aggregate assignment at all. Mixed aggregate configurations will prevent VM HA from being enabled.
All hosts from a host aggregate must belong to a single cluster and not span multiple clusters.
If a host aggregate has only one host that goes down, VM HA cannot find a suitable migration target, resulting in VMs entering an error state.
DRR
DRR and Virtual Machine High Availability are designed to interoperate well together. A host failure event may occur while DRR is actively rebalancing VMs, either from the same host or from other hosts in the cluster. When this happens:
VM HA will detect the host failure and initiate VM evacuations
The VM evacuations may result in cluster imbalance
DRR will then detect the imbalance during either the current run or the next run.
DRR will redistribute the load across the cluster to address the imbalance.
VMs with Hard Affinity or Anti-Affinity Rules
Hard affinity: VM HA will fail to evacuate VMs with hard affinity, and VMs will land in
Errorstate. This will be addressed in an upcoming release.Soft affinity: VM HA identifies a host with sufficient capacity to host all VMs in the affinity group and migrates all VMs sequentially to the target host. If VM HA cannot find a suitable target host, VM HA will migrate VMs to available hosts, potentially violating the soft affinity policy.
Hard anti-affinity: VM HA will fail to evacuate VMs with hard anti-affinity, and VMs will land in
Errorstate. This will be addressed in an upcoming release.Soft anti-affinity: VM HA will attempt to find target hosts that satisfy the anti-affinity policy for each VM to be migrated, ensuring that all VMs part of the soft anti-affinity policy are placed on separate hosts. If VM HA cannot find a suitable target host, VM HA will evacuate the VMs to available hosts, potentially violating the soft anti-affinity policy.
VM States
VMs in a suspended or paused state will not be evacuated to another host.
VMs with Special Properties
Virtual TPM-enabled VMs: VM HA does not currently support live-migrating VMs with Virtual TPM enabled. This support is coming soon.
VMs with hot-added CPU or memory: VM HA will evacuate VMs that have hot-added CPU or memory resources.
Resized VMs: VM HA will fail for a VM that has been resized but not confirmed yet.
Supported Scale
VM HA is currently supported for up to 100 hosts per region.
Last updated
Was this helpful?
