Virtual Machine High Availability (VM HA)

In this document, you will learn about Private Cloud Director Virtual Machine High Availability (VM HA), a feature that automatically detects physical host failures within a cluster and restarts the affected VMs on other healthy hosts in the same cluster.

Introduction

Hardware, firmware, or network issues can cause a virtualization host to go offline with little warning. Without an automated recovery mechanism, every VM on that host would remain down until an operator intervenes, violating most service-level objectives.

Virtual Machine High Availability (VM HA) protects against this risk. The process is designed to be automatic and requires minimal manual intervention during a failure event:

  • Continuous Host Monitoring: The VM HA service continuously monitors the health and responsiveness of all hypervisor hosts in an HA-enabled virtualized cluster.

  • Failure Detection: If a host stops responding (due to hardware failure, operating system crash, or certain network isolation scenarios), the system detects the failure

  • Automatic VM Recovery: Upon confirmation of a host failure, which involves verifying the failure on both the management plane and the cluster hosts, VM HA automatically restarts any VMs running on the failed host. These VMs are powered on using available resources on the remaining healthy hosts within the cluster.

After recovery, complementary features such as Dynamic Resource Rebalancing (DRR) can redistribute load to restore optimal balance across the cluster, ensuring sustained performance.

Benefits of VM HA

Enabling VM HA in your Private Cloud Director environment delivers the following benefits:

  • Minimized Downtime: Automatically restarts or evacuates VMs when a host fails, reducing mean time to recovery from hours to minutes.

  • Service Continuity: Keeps business-critical applications online even during unexpected infrastructure outages.

  • Operational Efficiency: Eliminates the need for round-the-clock manual monitoring and intervention, freeing administrators to focus on higher-value tasks.

  • Policy-Driven Control: Respects host aggregates, affinity/anti-affinity rules, and VM-specific settings, enabling you to determine how each workload is handled during failover.

  • Seamless Interoperation: Works in concert with DRR and other Private Cloud Director services to maintain both availability and resource efficiency after a failure event.

VM HA Pre-Requisites

VM HA always operates within a virtualized cluster. You need to turn it on at the cluster level. Once enabled, VM HA applies to all virtual machines in the cluster.

  • Shared storage is required for VM HA. For VM HA to work at the cluster level:

    • All VMs should be using a block storage volume as the root disk (non-ephemeral root disk), or

    • If any VMs use ephemeral storage for the root disk, Ephemeral Shared Storage should be used for all hosts in the virtualized cluster.

    • When Ephemeral Shared Storage is not used, any VMs using an ephemeral root disk will be rebuilt on another host during recovery.

  • VM HA requires a minimum number of healthy hosts in a cluster to function correctly. A minimum of two hosts is required for HA activation.

  • VM HA uses the VM Evacuation operation behind the scenes. VM Evacuation Prerequisites must be met for the operation to succeed.

  • If any VMs in the cluster use a Flavor that assigns the VM to a host aggregate, then that host aggregate should have at least two hosts in the cluster for VM HA failover redundancy.

  • If any VMs in the cluster use block storage, the block storage role must be assigned to at least two hosts in the cluster.

  • If any VMs in the cluster use vTPM, then

    • The virtual machine ephemeral storage directory and the vTPM state file directory must be on shared storage (eg NFS) that is mounted on all of your hypervisor hosts in the cluster.

    • This is required as long as you are using vTPM as a feature for your VMs, even if the VMs do not otherwise use ephemeral storage for their root disk.

    • These directories must be owned by pf9 user and pf9group group.

  • The Image library role must be assigned to at least two hosts in the cluster, and the Image library must use shared storage.

  • Operating System Compatibility: All hosts in the VM HA-enabled cluster must run the same operating system version.

    • Mixed Ubuntu versions (22.04 and 24.04) in the same cluster are not supported.

    • VM evacuation between different OS versions will fail due to the KVM version incompatibilities.

    • We recommend disabling VM HA before upgrading the operating system on any host of the cluster, then re-enabling it after the host(s) have been upgraded.

  • Remember to read and understand the current VM HA before proceeding to configure it at the cluster level.

Configure VM HA for a Cluster

VM HA configuration follows a two-step process to ensure cluster readiness:

Step 1: Create the Cluster

When you create a new cluster via Infrastructure > Clusters > Add Cluster, the VM High Availability toggle will appear but be disabled. You cannot enable VM HA during cluster creation.

Step 2: Enable VM HA After Adding Hosts

After creating the cluster:

  1. Add at least two hosts with the hypervisor role to the cluster.

  2. Verify that all hosts are assigned to the same host aggregate (if using host aggregates)

  3. Navigate to the cluster settings and enable the VM High Availability toggle

circle-info

NOTE

The VM HA toggle is disabled until the minimum requirements are met. Hovering over the disabled toggle displays the message: "At least two hosts are required to enable VM HA."

VM HA Enablement Rules

The system enforces these validation rules for VM HA:

  • Minimum Host Requirement: At least 2 hypervisor hosts are required in the cluster.

circle-info

If fewer than two hosts are present, the toggle displays: "At least two hosts are required to enable VM HA."

  • Host Aggregate Consistency: All hosts must either belong to the same host aggregate or have no aggregate assignment.

circle-info

If the hosts are part of different aggregates, the toggle displays the following message:"VM HA cannot be enabled because hosts belong to different host aggregates, which prevents cross-host migration."

On the Clusters grid view, you will see one of the following VM HA statuses for each cluster:

  • Disabled: VM HA is disabled for the cluster.

  • Waiting: VM HA is enabled but remains inactive for this cluster because fewer than two hypervisor hosts have been assigned. To activate VM HA, assign at least two hosts with the hypervisor role to the cluster. The status will automatically change to Active once this requirement is met.

  • Active: VM HA is enabled and active for the cluster.

Upgrading your Cluster Hosts

When upgrading your cluster hosts from one operating system version to another (for example, Ubuntu 22.04 to 24.04), follow the steps below:

  • Disable VM HA before starting the upgrade process.

  • Evacuate all VMs from the hosts that are scheduled for upgrade.

  • Upgrade hosts one at a time to the new operating system version.

  • Do not attempt to migrate VMs between upgraded and non-upgraded hosts.

  • Re-enable VM HA only after all hosts have been upgraded to the same operating system version.

circle-exclamation

How VM HA Works

Private Cloud Director Virtual Machine High Availability (VM HA) provides automated protection against host failures by continuously monitoring host health and, when needed, evacuating affected virtual machines (VMs) to healthy hosts with minimal disruption. Three cooperating services enable the automated workflow:

  1. High Availability Manager: Central coordinator that runs in the Private Cloud Director management plane and watches for clusters with VHMA enabled, gathers health reports, confirms host failures, and issues evacuation requests.

  2. Host Agents: Daemons that run on every hypervisor host and probe their peers’ liveness and report findings to the High Availability Manager at regular intervals.

  3. VM Evacuation Service: Orchestrator that runs in the Private Cloud Director management plane and receives confirmed host‑down events from High Availability Manager and live‑migrates each impacted VM to a suitable target host.

End-to-end Flow

  1. Cluster discovery: The High Availability Manager polls the infrastructure to discover clusters where VHMA is enabled. For VM HA-enabled clusters, it verifies that the pf9-ha-slave role is present on every hypervisor host. Clusters with fewer than two hosts are ignored until additional hosts join.

  2. Peer list distribution: Every 59  seconds, each Host Agent requests a peer list from the High Availability Manager. The list is a randomized subset of hosts in the same cluster, helping to spread traffic and avoid single‑point bias.

  3. Distributed health probing: Using the peer list, the host agent performs a liveness check on its peer hosts via the libvirt exporter. This check is entirely agent‑to‑agent, with no central polling path.

  4. Status aggregation & reporting: Every 127 seconds, the agent posts an aggregated view of peer health to the High Availability Manager.

  5. Failure detection and correlation: When the High Availability Manager receives a report that a host appears to be down, it cross-checks the host’s status across reports from other agents to avoid false positives while ensuring a rapid response to true outages. Once a host is confirmed down via peer agent reports, a 150‑second cooldown timer starts:

    • If the host recovers before the timer expires, the event is cleared, and no action is taken. This is to prevent routine reboots from triggering VM evacuations.

    • If the host remains down when the timer ends, High Availability Manager confirms the failure and emits a host-down notification to the VM Evacuation Service.

  6. Automated VM evacuation: Upon receiving the host‑down notification, the VM Evacuation Service verifies the host’s state, collects all resident VMs, and evacuates them one VM at a time to a healthy host in the cluster. Target selection honors existing placement constraints (such as aggregates or affinity rules).

  7. Continuous protection loop: After evacuation is complete, normal monitoring resumes. Host Agents continue probing both the recovered host (if it comes back online) and all remaining hosts, enabling VM HA to maintain an always-on protection loop with no manual intervention. Note that if a host that went offline comes back online, VM HA will not automatically migrate the VMs that were originally located on this host back to it. The Dynamic Resource Rebalancing (DRR) service will make VM-balancing decisions independently when resource contention occurs on any host.

VM HA Interoperation with Other Services

This section describes how VM HA interoperates with other services configured for your cluster.

Host Aggregates

VM HA will honor Host Aggregates and migrate VMs to another host from the same host aggregate.

Here are some limitations you might need to consider:

  • VM HA requires that all hosts in a cluster belong to the same host aggregate or have no aggregate assignment. Mixed aggregate configurations will prevent VM HA from being enabled.

  • All hosts from a host aggregate must belong to a single cluster and not span multiple clusters.

  • If a host aggregate has only one host that goes down, VM HA cannot find a suitable migration target, causing VMs to enter an error state.

DRR

DRR and Virtual Machine High Availability are designed to interoperate well together. A host failure event may occur while DRR is actively rebalancing VMs, either from the same host or from other hosts in the cluster. When this happens:

  1. VM HA will detect the host failure and initiate VM evacuations

  2. The VM evacuations may result in cluster imbalance

  3. DRR will then detect the imbalance during either the current run or the next run.

  4. DRR will redistribute the load across the cluster to address the imbalance.

VMs with Hard Affinity or Anti-Affinity Rules

  • Hard affinity: VM HA will fail to evacuate VMs with hard affinity, and VMs will land in Error state. This will be addressed in an upcoming release.

  • Soft affinity: VM HA identifies a host with sufficient capacity to host all VMs in the affinity group and migrates all VMs sequentially to the target host. If VM HA cannot find a suitable target host, VM HA will migrate VMs to available hosts, potentially violating the soft affinity policy.

  • Hard anti-affinity: VM HA will fail to evacuate VMs with hard anti-affinity, and VMs will land in Error state. This will be addressed in an upcoming release.

  • Soft anti-affinity: VM HA will attempt to find target hosts that satisfy the anti-affinity policy for each VM to be migrated, ensuring that all VMs part of the soft anti-affinity policy are placed on separate hosts. If VM HA cannot find a suitable target host, VM HA will evacuate the VMs to available hosts, potentially violating the soft anti-affinity policy.

circle-info

NOTE

If VM evacuation fails in any of the scenarios listed above, the VM may still appear as active in the UI.

VM States

VMs in a suspended or paused state will not be evacuated to another host.

VMs with Special Properties

  1. Virtual TPM-enabled VMs: VM HA does not currently support live migration of VMs with Virtual TPM enabled. This support is coming soon.

  2. VMs with hot-added CPU or memory: VM HA will evacuate VMs with hot-added CPU or memory.

  3. Resized VMs: VM HA will fail for a VM that has been resized but not confirmed yet.

Supported Scale

VM HA is currently supported for up to 100 hosts per region.

Known Issues and Limitations

  • If the down host serves both as a hypervisor host and a block storage host for a VM, evacuating that VM to another host may run into race conditions where volume and VM evacuations conflict, resulting in an error.

    • This issue will be fixed in the January 2026 release of Private Cloud Director.

    • The current recommended workaround is to avoid assigning hosts with both hypervisor and block storage roles to a cluster with VM HA enabled. You can do that by assigning a persistent storage role to one or two hosts and separating them from the hosts with a hypervisor role assigned.

  • VM HA is presently not supported for clusters with GPUs enabled. This support will be added in a future release of Private Cloud Director

  • VM HA relies on connectivity with the management plane. If network connectivity to the management plane is degraded, VM HA performance will be negatively affected. If the management plane experiences an outage, VM HA will not be operational for the duration of the outage.

Last updated

Was this helpful?