Troubleshoot VM HA

Overview

This runbook helps you diagnose and recover from the most common VM High Availability (VM HA) failures in Private Cloud Director. It covers five scenarios:

  1. A host failed but VMs were not evacuated ("VM HA did not trigger").

  2. Consul health prerequisites are not met, blocking failure detection (Self-Hosted only).

  3. Shared or FC storage is not correctly reachable on all hosts, causing evacuation to fail.

  4. Enabling or disabling VM HA returns an error.

  5. VM HA needs to be re-validated after a host or management-plane upgrade.

For background on how VM HA works, the prerequisites it requires, and the services involved, see Virtual Machine High Availability (VM HA). This runbook assumes you have already read that page.

In this guide, you will identify why VM HA did not trigger or is not functioning correctly, and restore it to a protected state.

Prerequisites

  • Access to the Private Cloud Director UI or pcdctl CLI.

  • SSH access to the hypervisor hosts in the cluster.

  • For Self-Hosted deployments: access to the management cluster node and kubectl credentials for the region namespace.


Diagnose "VM HA Did Not Trigger"

Use this section when a host failed but VMs were not evacuated to other hosts.

Step 1: Confirm VM HA Is Enabled on the Cluster

In the Private Cloud Director UI, navigate to Infrastructure > Clusters and open the affected cluster. Expand the VM High Availability card on the cluster details page. Confirm that:

  • VM HA is toggled on.

  • The cluster status shows Protected. If it shows Degraded or Not Protected, hover over the status indicator to see which prerequisite is not met.

If VM HA is off or the cluster is Not Protected, address the prerequisite failures shown in the UI before continuing.

Step 2: Verify the Host Was Actually Detected as Down

VM HA will not evacuate VMs until the High Availability Manager has confirmed the host is down (after the 150-second cooldown). In the UI:

  1. Navigate to Infrastructure > Cluster Hosts and find the affected host.

  2. Confirm the host shows an offline or failed connection status. If the host shows as online, VM HA correctly did not evacuate — the host was never confirmed down from the management plane's perspective.

  3. On the Host Details page, check the VMHA Past Events table for any evacuation events. If an event is present, the evacuation may have been attempted but failed; see Diagnose Evacuation Failures below.

Step 3: Check the pf9-ha-slave Role on All Hosts

The VM HA agent (installed via the pf9-ha-slave role) must be present on every hypervisor host in the cluster. A host missing this role does not participate in peer health probing, which can delay or prevent failure detection.

For each host in the cluster, SSH in and verify:

The package should be listed. If it is missing, the host has not received the role — re-authorize the host and ensure the pf9-ha-slave role is applied. From the UI, select the host under Infrastructure > Cluster Hosts, click Edit Roles, and confirm the HA slave role is assigned.

Step 4: Verify the libvirt Exporter Is Running

The VM HA agent probes peer hosts using the libvirt exporter. If the libvirt exporter is not running on a host, that host cannot be probed and will appear healthy even if it is not.

On each host:

The service should be active (running). If it is stopped or failed, restart it:

Check for errors in the service journal:

Step 5: Inspect HA Agent Logs

The VM HA agent logs on each host are the primary source of failure information:

Look for:

Log pattern
Meaning

peer list request failed

The agent could not reach the High Availability Manager to get its peer list.

liveness check failed for host <IP>

The agent detected a peer as down; this is normal if a host actually failed.

report submission failed

The agent could not post its health report to the High Availability Manager.

role not active

The pf9-ha-slave role is not converged; re-sync the host.

If the agent cannot reach the management plane, resolve connectivity first (see Troubleshooting Offline Or Failed Hosts).

Step 6: Verify High Availability Manager Health

Self-Hosted deployments only

The High Availability Manager runs as a pod in the region namespace of the management cluster. In SaaS deployments, Platform9 operates the High Availability Manager. If you suspect it is unhealthy, contact Platform9 Support.

From the management cluster node, list pods in the region namespace:

The hamgr pod should be in Running state. If it is in CrashLoopBackOff, Error, or Pending, inspect its logs:

Look for connection errors, database timeouts, or Consul-related failures. If the pod is crash-looping, escalate to Platform9 Support with the log output.

Diagnose Evacuation Failures

If the VMHA Past Events table on the Host Details page shows an evacuation event with failed VMs:

  1. Click View Details on the event to see per-VM evacuation status.

  2. Note the fault message for each failed VM. Common fault messages:

    Fault message
    Likely cause

    No valid host was found

    No destination host met placement constraints or had enough resources.

    Volume not found or connection to storage failed

    Shared storage was not reachable on the destination host. See Validate Shared and FC Storage.

    KVM version mismatch

    Source and destination hosts run different OS versions. Mixed OS versions are not supported for evacuation.

    Flavor with host aggregate requirement not met

    The VM's flavor pins it to a host aggregate with only one host.

  3. For VMs that can be retried, click Retry on the evacuation detail page, or retry from the CLI:

For general VM ERROR-state recovery after evacuation failures, see Recover VMs in ERROR State After Host Reboot or Patching.


Check Consul Health Prerequisites

Consul is used by the High Availability Manager for distributed coordination and service health detection. If Consul is unhealthy, the High Availability Manager cannot reliably detect host failures, and VM HA will not function correctly.

Self-Hosted deployments only

Consul runs as pods inside the management cluster, so the checks and recovery steps in this entire section apply to Self-Hosted deployments only. In SaaS deployments, Platform9 operates the management plane (including Consul). If VM HA is not triggering and you suspect a management-plane problem, contact Platform9 Support.

Identify Consul Symptoms

Symptoms of a Consul problem include:

  • The hamgr pod logs show repeated failed to connect to consul or consul agent not healthy messages.

  • VM HA events are not generated even though hosts are clearly down.

  • The hamgr pod is restarting frequently.

Verify Consul Health

From the management cluster node:

All Consul pods should be in Running state. A quorum of Consul pods is required for the cluster to be healthy — if more than half the Consul pods are down, the Consul cluster loses quorum and all distributed locks and health checks fail.

Check Consul cluster health using the Consul CLI from inside one of the running Consul pods:

All nodes should show Status: alive. Nodes in Status: failed or Status: left indicate that part of the Consul cluster is unavailable.

To check leader election status:

There must be exactly one leader. If no leader is elected, the cluster has lost quorum.

Recover a Degraded Consul Cluster

If one Consul pod is down and the remaining pods still have quorum (a majority are running), the cluster is degraded but functional. Restart the failed pod:

Kubernetes will automatically reschedule the pod. Wait for it to rejoin the Consul cluster (consul members shows it as alive).

If the Consul cluster has lost quorum (fewer than half the pods are running), recovery is more involved. Escalate to Platform9 Support immediately, as forcing a quorum recovery on a production Consul cluster carries risk.

After Consul is healthy, confirm the hamgr pod is also healthy and restart it if needed:

Wait for the pod to return to Running and confirm VM HA events resume.


Validate Shared and FC Storage

VM HA requires that the storage backing each VM's root disk is reachable from the destination host, not just the source host. Evacuation will fail if a shared NFS mount is not mounted on the destination, or if an FC or iSCSI volume is not accessible on the destination's storage backend.

Confirm Shared Storage Is Mounted on All Hosts

For NFS-backed ephemeral storage or image library storage, verify the mount is present on every host in the cluster:

The expected NFS share should appear. If it is missing on any host, remount it and verify it persists across reboots by checking /etc/fstab.

For each host, also confirm the mount is writable:

Remove the probe file after confirming:

Confirm Block Storage Volumes Are Accessible on Destination Hosts

For VMs using block storage volumes (iSCSI, FC, or other backends), the volume must be attachable to a host other than the one it is currently attached to. Verify:

  1. List the volume and confirm its type and backend:

    Check the volume_type field. Confirm that at least one other host in the cluster has the same backend configured.

  2. For FC (Fibre Channel) volumes specifically, ensure the destination host has:

    • A physical FC HBA (Host Bus Adapter) installed and enabled.

    • Zoning that allows the destination host to see the same storage array LUNs as the source host.

    On the destination host, scan for visible FC targets:

    If the LUN is not visible, contact your storage administrator to verify FC zoning includes the destination host.

  3. Verify that the Persistent Storage Service role is applied to at least two hosts in the cluster:

    All endpoints should show enabled and up. If only one endpoint is enabled, VM HA may have no valid storage backend to use on the destination host, causing evacuation to fail for block-volume VMs.

Confirm the "This Is Shared Storage" Toggle for Ephemeral Storage

If your cluster uses ephemeral storage (VMs without block volume root disks), ephemeral shared storage must be enabled on the cluster blueprint. In the UI:

  1. Navigate to Infrastructure > Clusters and open the cluster.

  2. Open the Cluster Blueprint and expand Customize Cluster Defaults.

  3. Confirm the This is Shared Storage toggle is enabled.

If the toggle is off, VM HA cannot evacuate VMs that use ephemeral root disks. See Ephemeral Shared Storage for setup details.


Diagnose Enable and Disable VM HA Failures

404 or 503 Error When Enabling VM HA

If the UI shows an error (for example, a 404 or 503) when you toggle VM HA on or off, the High Availability Manager is not reachable or is reporting an internal error.

Verify prerequisites first:

Before trying to enable VM HA, confirm the following are true. The toggle will silently fail or return an error if any prerequisite is not met:

  • At least two hosts with the Hypervisor role are in the cluster and both are online.

  • All hosts in the cluster belong to the same host aggregate, or none belong to any aggregate. Mixed-aggregate clusters cannot enable VM HA.

  • The pf9-ha-slave role is applied and converged on all hosts (not just assigned — the role status must be applied, not converging).

  • Consul is healthy (see Check Consul Health Prerequisites).

Check the High Availability Manager directly:

Self-Hosted deployments only

In SaaS deployments, contact Platform9 Support if you receive a persistent error when toggling VM HA.

From the management cluster node:

If the hamgr pod is not Running, inspect its logs and restart it:

After the pod is running, retry the toggle.

VM HA Toggle Appears Enabled but Status Is "Not Protected"

After enabling VM HA, if the cluster immediately shows Not Protected or Degraded, hover over the VM HA status indicator in the UI to see which specific prerequisite is failing. Common causes:

Reported prerequisite failure
Action

Fewer than two hypervisor hosts

Add a second host to the cluster.

Persistent Storage Service not redundant

Add the Persistent Storage Service role to a second host.

Image Library Service not on shared storage

Configure the Image Library Service to use shared storage; see Image Library High Availability.

pf9-ha-slave role not applied

Re-sync hosts: navigate to the host, click Other > Re-sync Host.

VM HA Remains Enabled but HA Manager Reports No Active Clusters

If VM HA is enabled but the High Availability Manager logs show no HA-enabled clusters being discovered, the management plane may have a stale cluster state. In the UI, disable VM HA on the cluster and re-enable it to trigger a fresh discovery cycle. If the problem persists in a Self-Hosted deployment, check the hamgr pod logs for database connectivity errors.


Validate VM HA After an Upgrade

After upgrading host packages or the management plane, run the following checks before relying on VM HA for workload protection.

Check 1: Confirm pf9-ha-slave Is Re-Applied on All Hosts

After a host OS upgrade, the pf9-ha-slave role must be re-applied. Verify on each host:

The service should be active (running). If it is not present or stopped, re-sync the host from the UI (Other > Re-sync Host) to re-apply the role, then confirm the service starts.

Check 2: Confirm the libvirt Exporter Is Running on All Hosts

If stopped, restart it:

Check 3: Verify High Availability Manager Health

Self-Hosted deployments only

After a management-plane upgrade, confirm the hamgr pod is running and healthy. In SaaS deployments, Platform9 performs management-plane upgrades; confirm VM HA status from the UI after Platform9 notifies you that the upgrade is complete.

If the pod is not Running, inspect its logs and restart the deployment:

Check 4: Re-Enable VM HA and Confirm "Protected" Status

Once all hosts are on the same OS version and all services are healthy:

  1. Navigate to Infrastructure > Clusters.

  2. Select the cluster.

  3. Toggle VM High Availability to on.

  4. Confirm the cluster transitions to Protected status within one to two minutes.

If the cluster shows Degraded or Not Protected after re-enabling, hover over the status indicator and address each failing prerequisite. See Diagnose Enable and Disable VM HA Failures for a reference table of common prerequisite failures.

Check 5: Perform a Safe Validation Test

To confirm VM HA is functional without inducing a real failure, you can use maintenance mode as a proxy test:

  1. Place a non-critical host into Maintenance Mode (see Maintenance Mode). This live-migrates its VMs to other hosts in the cluster.

  2. Confirm the VMs migrate successfully and return to ACTIVE state.

  3. Disable Maintenance Mode to bring the host back into service.

This validates that live migration works correctly on the current OS version and storage configuration. VM HA evacuation uses the same migration path, so a successful maintenance-mode drain confirms the underlying machinery is working.

Maintenance mode uses live migration, whereas VM HA uses evacuation when a host is unresponsive. Live migration requires the source host to be online; evacuation does not. The maintenance-mode test validates the migration path but does not exercise the failure detection path. If you want to validate the full VM HA detection path, contact Platform9 Support for guidance on scheduling a controlled failover test.


Last updated

Was this helpful?