Virtual Machine High Availability (VM HA)

In this document, you will learn about Private Cloud Director Virtual Machine High Availability (VM HA), a feature that automatically detects physical host failures within a cluster and restarts the affected VMs on other healthy hosts in the same cluster.

Introduction

Hardware, firmware, or network issues can cause a virtualization host to go offline with little warning. Without an automated recovery mechanism, every VM on that host would remain down until an operator intervenes, violating most service-level objectives.

Virtual Machine High Availability (VM HA) protects against this risk. The process is designed to be automatic and requires minimal manual intervention during a failure event:

Continuous Host Monitoring: The VM HA service continuously monitors the health and responsiveness of all hypervisor hosts in an HA-enabled virtualized cluster.
Failure Detection: If a host stops responding (due to hardware failure, operating system crash, or certain network isolation scenarios), the system detects the failure
Automatic VM Recovery: Upon confirmation of a host failure, which involves verifying the failure on both the management plane and the cluster hosts, VM HA automatically restarts any VMs running on the failed host. These VMs are powered on using available resources on the remaining healthy hosts within the cluster.

After recovery, complementary features such as Dynamic Resource Rebalancing (DRR) can redistribute load to restore optimal balance across the cluster, ensuring sustained performance.

Benefits of VM HA

Enabling VM HA in your Private Cloud Director environment delivers the following benefits:

Minimized Downtime: Automatically restarts or evacuates VMs when a host fails, reducing mean time to recovery from hours to minutes.
Service Continuity: Keeps business-critical applications online even during unexpected infrastructure outages.
Operational Efficiency: Eliminates the need for round-the-clock manual monitoring and intervention, freeing administrators to focus on higher-value tasks.
Policy-Driven Control: Respects host aggregates, affinity/anti-affinity rules, and VM-specific settings, enabling you to determine how each workload is handled during failover.
Seamless Interoperation: Works in concert with DRR and other Private Cloud Director services to maintain both availability and resource efficiency after a failure event.

VM HA Pre-Requisites

VM HA always operates within a virtualized cluster. You need to turn it on at the cluster level. Once enabled, VM HA applies to all virtual machines in the cluster.

Shared storage is required for VM HA. For VM HA to work at the cluster level:
- All VMs should be using a block storage volume as the root disk (non-ephemeral root disk), or
- If any VMs use ephemeral storage for the root disk, Ephemeral Shared Storage should be used for all hosts in the virtualized cluster, and the This is Shared Storage toggle is enabled in the Cluster Blueprint under the Customize Cluster Defaults tab.
- When Ephemeral Shared Storage is not used, any VMs using an ephemeral root disk will be rebuilt on another host during recovery.
VM HA requires a minimum number of healthy hosts in a cluster to function correctly. A minimum of two hosts is required for HA activation.
VM HA uses the VM Evacuation operation behind the scenes. VM Evacuation Prerequisites must be met for the operation to succeed.
If any VMs in the cluster use a Flavor that assigns the VM to a host aggregate, then that host aggregate should have at least two hosts in the cluster for VM HA failover redundancy.
If any VMs in the cluster use block storage, the block storage role must be assigned to at least two hosts in the cluster.
If any VMs in the cluster use vTPM, then
- The virtual machine ephemeral storage directory and the vTPM state file directory must be on shared storage (eg NFS) that is mounted on all of your hypervisor hosts in the cluster.
- This is required as long as you are using vTPM as a feature for your VMs, even if the VMs do not otherwise use ephemeral storage for their root disk.
- These directories must be owned by pf9 user and pf9group group.
The Image library role must be assigned to at least two hosts in the cluster, and the Image library must use shared storage.
Operating System Compatibility: All hosts in the VM HA-enabled cluster must run the same operating system version.
- Mixed Ubuntu versions (22.04 and 24.04) in the same cluster are not supported.
- VM evacuation between different OS versions will fail due to the KVM version incompatibilities.
- We recommend disabling VM HA before upgrading the operating system on any host of the cluster, then re-enabling it after the host(s) have been upgraded.
Remember to read and understand the current VM HA before proceeding to configure it at the cluster level.

Configure VM HA for a Cluster

VM HA configuration follows a two-step process to ensure cluster readiness:

Step 1: Create the Cluster

When you create a new cluster via Infrastructure > Clusters > Add Cluster, the VM High Availability toggle will appear but be disabled. You cannot enable VM HA during cluster creation.

Step 2: Enable VM HA After Adding Hosts

After creating the cluster:

Add at least two hosts with the hypervisor role to the cluster.
Verify that all hosts are assigned to the same host aggregate (if using host aggregates)
Navigate to the cluster settings and enable the VM High Availability toggle

NOTE

The VM HA toggle is disabled until the minimum requirements are met. Hovering over the disabled toggle displays the message: "At least two hosts are required to enable VM HA."

VM HA Enablement Rules

The system enforces these validation rules for VM HA:

Minimum Host Requirement: At least 2 hypervisor hosts are required in the cluster.

If fewer than two hosts are present, the toggle displays: "At least two hosts are required to enable VM HA."

Host Aggregate Consistency: All hosts must either belong to the same host aggregate or have no aggregate assignment.

If the hosts are part of different aggregates, the toggle displays the following message:"VM HA cannot be enabled because hosts belong to different host aggregates, which prevents cross-host migration."

On the Clusters grid view, you will see one of the following VM HA statuses for each cluster:

Disabled: Shows the number of clusters for which VM HA is disabled.
Protected: Shows the number of clusters with VM HA enabled that meet all prerequisites.
Degraded: Shows the number of clusters with VM HA enabled that have a single point of failure (for example, only one host in the cluster has the persistent storage or image library role). Such clusters can absorb host failures as long as the host is not the single point of failure.
Not Protected: Shows the number of clusters with VM HA enabled that have not met one or more prerequisites.

Upgrading your Cluster Hosts

When upgrading your cluster hosts from one operating system version to another (for example, Ubuntu 22.04 to 24.04), follow the steps below:

Disable VM HA before starting the upgrade process.
Evacuate all VMs from the hosts that are scheduled for upgrade.
Upgrade hosts one at a time to the new operating system version.
Do not attempt to migrate VMs between upgraded and non-upgraded hosts.
Re-enable VM HA only after all hosts have been upgraded to the same operating system version.

Important

VM evacuation and migration operations require matching operating system and KVM versions between source and destination hosts. Attempting to evacuate VMs between Ubuntu 22.04 and 24.04 hosts will fail.

How VM HA Works

Private Cloud Director Virtual Machine High Availability (VM HA) provides automated protection against host failures by continuously monitoring host health and, when needed, evacuating affected virtual machines (VMs) to healthy hosts with minimal disruption. Three cooperating services enable the automated workflow:

High Availability Manager: Central coordinator that runs in the Private Cloud Director management plane and watches for clusters with VHMA enabled, gathers health reports, confirms host failures, and issues evacuation requests.
Host Agents: Daemons that run on every hypervisor host and probe their peers’ liveness and report findings to the High Availability Manager at regular intervals.
VM Evacuation Service: Orchestrator that runs in the Private Cloud Director management plane and receives confirmed host‑down events from High Availability Manager and live‑migrates each impacted VM to a suitable target host.

End-to-end Flow

Cluster discovery: The High Availability Manager polls the infrastructure to discover clusters where VHMA is enabled. For VM HA-enabled clusters, it verifies that the pf9-ha-slave role is present on every hypervisor host. Clusters with fewer than two hosts are ignored until additional hosts join.
Peer list distribution: Every 127  seconds, each Host Agent requests a peer list from the High Availability Manager. The list is a randomized subset of hosts in the same cluster, helping to spread traffic and avoid single‑point bias.
Distributed health probing: Using the peer list, the host agent performs a liveness check on its peer hosts via the libvirt exporter. This check is entirely agent‑to‑agent, with no central polling path.
Status aggregation & reporting: Every 59 seconds, the agent posts an aggregated view of peer health to the High Availability Manager.
Failure detection and correlation: When the High Availability Manager receives a report that a host appears to be down, it cross-checks the host’s status across reports from other agents to avoid false positives while ensuring a rapid response to true outages. Once a host is confirmed down via peer agent reports, a 150‑second cooldown timer starts:
- If the host recovers before the timer expires, the event is cleared, and no action is taken. This is to prevent routine reboots from triggering VM evacuations.
- If the host remains down when the timer ends, High Availability Manager confirms the failure and emits a host-down notification to the VM Evacuation Service.
Automated VM evacuation: Upon receiving the host‑down notification, the VM Evacuation Service verifies the host’s state, collects all resident VMs, and evacuates them one VM at a time to a healthy host in the cluster. Target selection honors existing placement constraints (such as aggregates or affinity rules).
Continuous protection loop: After evacuation is complete, normal monitoring resumes. Host Agents continue probing both the recovered host (if it comes back online) and all remaining hosts, enabling VM HA to maintain an always-on protection loop with no manual intervention. Note that if a host that went offline comes back online, VM HA will not automatically migrate the VMs that were originally located on this host back to it. The Dynamic Resource Rebalancing (DRR) service will make VM-balancing decisions independently when resource contention occurs on any host.

UI Observability

VM HA Status Across Clusters

The PCD UI shows the current status of VM HA for each cluster at the following locations:

The PCD home page displays a Cluster VM HA Status widget.
The Infrastructure > Clusters page. Hovering over the VMHA status in the table displays the status for each prerequisite.
The VM High Availability pane on the Cluster details page. Expanding the card will show the status for each individual prerequisite.

The possible VM HA statuses per cluster are:

Protected: Shows the number of clusters with VM HA enabled that meet all prerequisites.
Degraded: Shows the number of clusters with VM HA enabled that have a single point of failure (for example, only one host in the cluster has the persistent storage or image library role). Such clusters can absorb host failures as long as the host is not the single point of failure.
Not Protected: Shows the number of clusters with VM HA enabled that have not met one or more prerequisites.
Disabled: Shows the number of clusters for which VM HA is disabled.

VM HA Events

When an active VM HA event is in progress, a banner message is displayed in the PCD UI with details about the event. The banner message will persist until evacuations are complete or the banner is manually dismissed.

The Host details page for a host that experienced an outage shows the evacuation status for all VMs on the host. The details will be displayed for 24 hours after the host outage. The VMHA Past Events table on the Host details page will show all historical VM HA events that occurred on the host. Clicking View Details displays detailed VM evacuation status for the selected event. From the detailed VM evacuation status page, you can retry failed evacuations (for all failed evacuations or for individual VMs) for the most recent VM HA event.

Note that VMs in error, unknown, rescued, or resized status (i.e., the VM has been resized but the resize has not been confirmed) will not be evacuated.

VM HA Interoperation with Other Services

This section describes how VM HA interoperates with other services configured for your cluster.

Host Aggregates

VM HA will honor Host Aggregates and migrate VMs to another host from the same host aggregate.

Here are some limitations you might need to consider:

VM HA requires that all hosts in a cluster belong to the same host aggregate or have no aggregate assignment. Mixed aggregate configurations will prevent VM HA from being enabled.
All hosts from a host aggregate must belong to a single cluster and not span multiple clusters.
If a host aggregate has only one host that goes down, VM HA cannot find a suitable migration target, causing VMs to enter an error state.

DRR

DRR and Virtual Machine High Availability are designed to interoperate well together. A host failure event may occur while DRR is actively rebalancing VMs, either from the same host or from other hosts in the cluster. When this happens:

VM HA will detect the host failure and initiate VM evacuations
The VM evacuations may result in cluster imbalance
DRR will then detect the imbalance during either the current run or the next run.
DRR will redistribute the load across the cluster to address the imbalance.

VMs with Hard Affinity or Anti-Affinity Rules

Hard affinity: VM HA will fail to evacuate VMs with hard affinity, and VMs will land in Error state. This will be addressed in an upcoming release.
Soft affinity: VM HA identifies a host with sufficient capacity to host all VMs in the affinity group and migrates all VMs sequentially to the target host. If VM HA cannot find a suitable target host, VM HA will migrate VMs to available hosts, potentially violating the soft affinity policy.
Hard anti-affinity: VM HA will fail to evacuate VMs with hard anti-affinity, and VMs will land in Error state. This will be addressed in an upcoming release.
Soft anti-affinity: VM HA will attempt to find target hosts that satisfy the anti-affinity policy for each VM to be migrated, ensuring that all VMs part of the soft anti-affinity policy are placed on separate hosts. If VM HA cannot find a suitable target host, VM HA will evacuate the VMs to available hosts, potentially violating the soft anti-affinity policy.

NOTE

If VM evacuation fails in any of the scenarios listed above, the VM may still appear as active in the UI.

VM States

VMs in a suspended or paused state will not be evacuated to another host.

VMs with Special Properties

Virtual TPM-enabled VMs: VM HA does not currently support live migration of VMs with Virtual TPM enabled. This support is coming soon.
Resized VMs: VM HA will skip evacuating a VM that has been resized but not confirmed yet.

Supported Scale

VM HA is currently supported for up to 400 hosts per region.

Known Issues and Limitations

VM HA is presently not supported for clusters with GPUs enabled. This support will be added in a future release of Private Cloud Director
VM HA relies on connectivity with the management plane. If network connectivity to the management plane is degraded, VM HA performance will be negatively affected. If the management plane experiences an outage, VM HA will not be operational for the duration of the outage.

PreviousDynamic Resource Rebalancing (DRR)NextHost

Last updated 9 days ago

Was this helpful?

Good morning

hashtagIntroduction

hashtagBenefits of VM HA

hashtagVM HA Pre-Requisites

hashtagConfigure VM HA for a Cluster

hashtagStep 1: Create the Cluster

hashtagStep 2: Enable VM HA After Adding Hosts

hashtagVM HA Enablement Rules

hashtagUpgrading your Cluster Hosts

hashtagHow VM HA Works

hashtagEnd-to-end Flow

hashtagUI Observability

hashtagVM HA Status Across Clusters

hashtagVM HA Events

hashtagVM HA Interoperation with Other Services

hashtagHost Aggregates

hashtagDRR

hashtagVMs with Hard Affinity or Anti-Affinity Rules

hashtagVM States

hashtagVMs with Special Properties

hashtagSupported Scale

hashtagKnown Issues and Limitations