> For the complete documentation index, see [llms.txt](https://docs.platform9.com/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://docs.platform9.com/private-cloud-director/virtualized-clusters/troubleshooting-and-log-files/troubleshoot-vm-ha.md).

# Troubleshoot VM HA

## Overview

This runbook helps you diagnose and recover from the most common VM High Availability (VM HA) failures in <code class="expression">space.vars.product\_name</code>. It covers five scenarios:

1. A host failed but VMs were not evacuated ("VM HA did not trigger").
2. Consul health prerequisites are not met, blocking failure detection (Self-Hosted only).
3. Shared or FC storage is not correctly reachable on all hosts, causing evacuation to fail.
4. Enabling or disabling VM HA returns an error.
5. VM HA needs to be re-validated after a host or management-plane upgrade.

For background on how VM HA works, the prerequisites it requires, and the services involved, see [Virtual Machine High Availability (VM HA)](/private-cloud-director/virtualized-clusters/virtualized-cluster/virtual-machine-high-availability-vm-ha.md). This runbook assumes you have already read that page.

In this guide, you will identify why VM HA did not trigger or is not functioning correctly, and restore it to a protected state.

## Prerequisites

* Access to the <code class="expression">space.vars.product\_name</code> UI or `pcdctl` CLI.
* SSH access to the hypervisor hosts in the cluster.
* For Self-Hosted deployments: access to the management cluster node and `kubectl` credentials for the region namespace.

***

## Diagnose "VM HA Did Not Trigger" <a href="#vm-ha-did-not-trigger" id="vm-ha-did-not-trigger"></a>

Use this section when a host failed but VMs were not evacuated to other hosts.

### Step 1: Confirm VM HA Is Enabled on the Cluster

In the <code class="expression">space.vars.product\_name</code> UI, navigate to **Infrastructure > Clusters** and open the affected cluster. Expand the **VM High Availability** card on the cluster details page. Confirm that:

* VM HA is toggled **on**.
* The cluster status shows **Protected**. If it shows **Degraded** or **Not Protected**, hover over the status indicator to see which prerequisite is not met.

If VM HA is off or the cluster is **Not Protected**, address the prerequisite failures shown in the UI before continuing.

### Step 2: Verify the Host Was Actually Detected as Down

VM HA will not evacuate VMs until the High Availability Manager has confirmed the host is down (after the 150-second cooldown). In the UI:

1. Navigate to **Infrastructure > Cluster Hosts** and find the affected host.
2. Confirm the host shows an **offline** or **failed** connection status. If the host shows as **online**, VM HA correctly did not evacuate — the host was never confirmed down from the management plane's perspective.
3. On the **Host Details** page, check the **VMHA Past Events** table for any evacuation events. If an event is present, the evacuation may have been attempted but failed; see [#diagnose-evacuation-failures](#diagnose-evacuation-failures "mention") below.

### Step 3: Check the pf9-ha-slave Role on All Hosts

The VM HA agent (installed via the `pf9-ha-slave` role) must be present on every hypervisor host in the cluster. A host missing this role does not participate in peer health probing, which can delay or prevent failure detection.

For each host in the cluster, SSH in and verify:

```bash
sudo apt list --installed 2>/dev/null | grep pf9-ha-slave
```

The package should be listed. If it is missing, the host has not received the role — re-authorize the host and ensure the `pf9-ha-slave` role is applied. From the UI, select the host under **Infrastructure > Cluster Hosts**, click **Edit Roles**, and confirm the HA slave role is assigned.

### Step 4: Verify the libvirt Exporter Is Running

The VM HA agent probes peer hosts using the libvirt exporter. If the libvirt exporter is not running on a host, that host cannot be probed and will appear healthy even if it is not.

On each host:

```bash
sudo systemctl status pf9-libvirt-exporter
```

The service should be `active (running)`. If it is stopped or failed, restart it:

```bash
sudo systemctl restart pf9-libvirt-exporter
```

Check for errors in the service journal:

```bash
sudo journalctl -u pf9-libvirt-exporter --since "1 hour ago" | tail -50
```

### Step 5: Inspect HA Agent Logs

The VM HA agent logs on each host are the primary source of failure information:

```bash
sudo tail -200 /var/log/pf9/ha-slave/ha-slave.log
```

Look for:

| Log pattern                           | Meaning                                                                       |
| ------------------------------------- | ----------------------------------------------------------------------------- |
| `peer list request failed`            | The agent could not reach the High Availability Manager to get its peer list. |
| `liveness check failed for host <IP>` | The agent detected a peer as down; this is normal if a host actually failed.  |
| `report submission failed`            | The agent could not post its health report to the High Availability Manager.  |
| `role not active`                     | The `pf9-ha-slave` role is not converged; re-sync the host.                   |

If the agent cannot reach the management plane, resolve connectivity first (see [Troubleshooting Offline Or Failed Hosts](/private-cloud-director/virtualized-clusters/troubleshooting-and-log-files/troubleshooting-offline-or-failed-hosts.md)).

### Step 6: Verify High Availability Manager Health

{% hint style="info" %}
**Self-Hosted deployments only**

The High Availability Manager runs as a pod in the region namespace of the management cluster. In SaaS deployments, Platform9 operates the High Availability Manager. If you suspect it is unhealthy, contact Platform9 Support.
{% endhint %}

From the management cluster node, list pods in the region namespace:

```bash
kubectl get pods -n <region-fqdn> | grep -i hamgr
```

The `hamgr` pod should be in `Running` state. If it is in `CrashLoopBackOff`, `Error`, or `Pending`, inspect its logs:

```bash
kubectl logs -n <region-fqdn> <hamgr-pod-name> --tail=100
```

Look for connection errors, database timeouts, or Consul-related failures. If the pod is crash-looping, escalate to Platform9 Support with the log output.

### Diagnose Evacuation Failures <a href="#diagnose-evacuation-failures" id="diagnose-evacuation-failures"></a>

If the **VMHA Past Events** table on the Host Details page shows an evacuation event with failed VMs:

1. Click **View Details** on the event to see per-VM evacuation status.
2. Note the fault message for each failed VM. Common fault messages:

   | Fault message                                        | Likely cause                                                                                                                                |
   | ---------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------- |
   | `No valid host was found`                            | No destination host met placement constraints or had enough resources.                                                                      |
   | `Volume not found` or `connection to storage failed` | Shared storage was not reachable on the destination host. See [#validate-shared-and-fc-storage](#validate-shared-and-fc-storage "mention"). |
   | `KVM version mismatch`                               | Source and destination hosts run different OS versions. Mixed OS versions are not supported for evacuation.                                 |
   | `Flavor with host aggregate requirement not met`     | The VM's flavor pins it to a host aggregate with only one host.                                                                             |
3. For VMs that can be retried, click **Retry** on the evacuation detail page, or retry from the CLI:

   ```bash
   pcdctl server evacuate --host <DESTINATION_HOST_UUID> <VM_UUID>
   ```

For general VM ERROR-state recovery after evacuation failures, see [Recover VMs in ERROR State After Host Reboot or Patching](/private-cloud-director/virtualized-clusters/troubleshooting-and-log-files/recover-vms-in-error-state.md).

***

## Check Consul Health Prerequisites <a href="#consul-health" id="consul-health"></a>

Consul is used by the High Availability Manager for distributed coordination and service health detection. If Consul is unhealthy, the High Availability Manager cannot reliably detect host failures, and VM HA will not function correctly.

{% hint style="info" %}
**Self-Hosted deployments only**

Consul runs as pods inside the management cluster, so the checks and recovery steps in this entire section apply to Self-Hosted deployments only. In SaaS deployments, Platform9 operates the management plane (including Consul). If VM HA is not triggering and you suspect a management-plane problem, contact Platform9 Support.
{% endhint %}

### Identify Consul Symptoms

Symptoms of a Consul problem include:

* The `hamgr` pod logs show repeated `failed to connect to consul` or `consul agent not healthy` messages.
* VM HA events are not generated even though hosts are clearly down.
* The `hamgr` pod is restarting frequently.

### Verify Consul Health

From the management cluster node:

```bash
kubectl get pods -n <region-fqdn> | grep consul
```

All Consul pods should be in `Running` state. A quorum of Consul pods is required for the cluster to be healthy — if more than half the Consul pods are down, the Consul cluster loses quorum and all distributed locks and health checks fail.

Check Consul cluster health using the Consul CLI from inside one of the running Consul pods:

```bash
kubectl exec -n <region-fqdn> <consul-pod-name> -- consul members
```

All nodes should show `Status: alive`. Nodes in `Status: failed` or `Status: left` indicate that part of the Consul cluster is unavailable.

To check leader election status:

```bash
kubectl exec -n <region-fqdn> <consul-pod-name> -- consul operator raft list-peers
```

There must be exactly one leader. If no leader is elected, the cluster has lost quorum.

### Recover a Degraded Consul Cluster

If one Consul pod is down and the remaining pods still have quorum (a majority are running), the cluster is degraded but functional. Restart the failed pod:

```bash
kubectl delete pod -n <region-fqdn> <failed-consul-pod-name>
```

Kubernetes will automatically reschedule the pod. Wait for it to rejoin the Consul cluster (`consul members` shows it as `alive`).

If the Consul cluster has lost quorum (fewer than half the pods are running), recovery is more involved. Escalate to Platform9 Support immediately, as forcing a quorum recovery on a production Consul cluster carries risk.

After Consul is healthy, confirm the `hamgr` pod is also healthy and restart it if needed:

```bash
kubectl rollout restart deployment/hamgr -n <region-fqdn>
```

Wait for the pod to return to `Running` and confirm VM HA events resume.

***

## Validate Shared and FC Storage <a href="#validate-shared-and-fc-storage" id="validate-shared-and-fc-storage"></a>

VM HA requires that the storage backing each VM's root disk is reachable from the destination host, not just the source host. Evacuation will fail if a shared NFS mount is not mounted on the destination, or if an FC or iSCSI volume is not accessible on the destination's storage backend.

### Confirm Shared Storage Is Mounted on All Hosts

For NFS-backed ephemeral storage or image library storage, verify the mount is present on every host in the cluster:

```bash
mount | grep nfs
```

The expected NFS share should appear. If it is missing on any host, remount it and verify it persists across reboots by checking `/etc/fstab`.

For each host, also confirm the mount is writable:

```bash
touch /mnt/<shared-storage-path>/.ha_probe_$(hostname) && echo "writable" || echo "not writable"
```

Remove the probe file after confirming:

```bash
rm /mnt/<shared-storage-path>/.ha_probe_$(hostname)
```

### Confirm Block Storage Volumes Are Accessible on Destination Hosts

For VMs using block storage volumes (iSCSI, FC, or other backends), the volume must be attachable to a host other than the one it is currently attached to. Verify:

1. List the volume and confirm its type and backend:

   ```bash
   pcdctl volume show <VOLUME_UUID>
   ```

   Check the `volume_type` field. Confirm that at least one other host in the cluster has the same backend configured.
2. For FC (Fibre Channel) volumes specifically, ensure the destination host has:

   * A physical FC HBA (Host Bus Adapter) installed and enabled.
   * Zoning that allows the destination host to see the same storage array LUNs as the source host.

   On the destination host, scan for visible FC targets:

   ```bash
   sudo rescan-scsi-bus.sh
   lsscsi | grep -i fc
   ```

   If the LUN is not visible, contact your storage administrator to verify FC zoning includes the destination host.
3. Verify that the Persistent Storage Service role is applied to at least two hosts in the cluster:

   ```bash
   pcdctl volume service list
   ```

   All endpoints should show `enabled` and `up`. If only one endpoint is enabled, VM HA may have no valid storage backend to use on the destination host, causing evacuation to fail for block-volume VMs.

### Confirm the "This Is Shared Storage" Toggle for Ephemeral Storage

If your cluster uses ephemeral storage (VMs without block volume root disks), ephemeral shared storage must be enabled on the cluster blueprint. In the UI:

1. Navigate to **Infrastructure > Clusters** and open the cluster.
2. Open the **Cluster Blueprint** and expand **Customize Cluster Defaults**.
3. Confirm the **This is Shared Storage** toggle is enabled.

If the toggle is off, VM HA cannot evacuate VMs that use ephemeral root disks. See [Ephemeral Shared Storage](/private-cloud-director/storage/ephemeral-storage.md#ephemeral-shared-storage) for setup details.

***

## Diagnose Enable and Disable VM HA Failures <a href="#enable-disable-failures" id="enable-disable-failures"></a>

### 404 or 503 Error When Enabling VM HA

If the UI shows an error (for example, a 404 or 503) when you toggle VM HA on or off, the High Availability Manager is not reachable or is reporting an internal error.

**Verify prerequisites first:**

Before trying to enable VM HA, confirm the following are true. The toggle will silently fail or return an error if any prerequisite is not met:

* At least two hosts with the Hypervisor role are in the cluster and both are **online**.
* All hosts in the cluster belong to the same host aggregate, or none belong to any aggregate. Mixed-aggregate clusters cannot enable VM HA.
* The `pf9-ha-slave` role is applied and **converged** on all hosts (not just assigned — the role status must be `applied`, not `converging`).
* Consul is healthy (see [#consul-health](#consul-health "mention")).

**Check the High Availability Manager directly:**

{% hint style="info" %}
**Self-Hosted deployments only**

In SaaS deployments, contact Platform9 Support if you receive a persistent error when toggling VM HA.
{% endhint %}

From the management cluster node:

```bash
kubectl get pods -n <region-fqdn> | grep hamgr
```

If the `hamgr` pod is not `Running`, inspect its logs and restart it:

```bash
kubectl logs -n <region-fqdn> <hamgr-pod-name> --tail=100
kubectl rollout restart deployment/hamgr -n <region-fqdn>
```

After the pod is running, retry the toggle.

### VM HA Toggle Appears Enabled but Status Is "Not Protected"

After enabling VM HA, if the cluster immediately shows **Not Protected** or **Degraded**, hover over the VM HA status indicator in the UI to see which specific prerequisite is failing. Common causes:

| Reported prerequisite failure               | Action                                                                                                                                                                                 |
| ------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Fewer than two hypervisor hosts             | Add a second host to the cluster.                                                                                                                                                      |
| Persistent Storage Service not redundant    | Add the Persistent Storage Service role to a second host.                                                                                                                              |
| Image Library Service not on shared storage | Configure the Image Library Service to use shared storage; see [Image Library High Availability](/private-cloud-director/images-and-image-library/image-library-high-availability.md). |
| `pf9-ha-slave` role not applied             | Re-sync hosts: navigate to the host, click **Other > Re-sync Host**.                                                                                                                   |

### VM HA Remains Enabled but HA Manager Reports No Active Clusters

If VM HA is enabled but the High Availability Manager logs show no HA-enabled clusters being discovered, the management plane may have a stale cluster state. In the UI, disable VM HA on the cluster and re-enable it to trigger a fresh discovery cycle. If the problem persists in a Self-Hosted deployment, check the `hamgr` pod logs for database connectivity errors.

***

## Validate VM HA After an Upgrade <a href="#post-upgrade-validation" id="post-upgrade-validation"></a>

After upgrading host packages or the management plane, run the following checks before relying on VM HA for workload protection.

{% hint style="warning" %}
**Re-enable VM HA only after all hosts are on the same OS version.** If you disabled VM HA before a host OS upgrade, do not re-enable it while any hosts in the cluster are still running the old OS version. VM evacuation between hosts with different KVM versions will fail. See [Host Upgrade Runbook](/private-cloud-director/upgrade/host-upgrade-runbook.md) for the full upgrade sequence, and [Post-Upgrade Verification](/private-cloud-director/upgrade/post-upgrade-verification.md#re-enable-vm-ha-and-drr) for the re-enablement steps.
{% endhint %}

### Check 1: Confirm pf9-ha-slave Is Re-Applied on All Hosts

After a host OS upgrade, the `pf9-ha-slave` role must be re-applied. Verify on each host:

```bash
sudo apt list --installed 2>/dev/null | grep pf9-ha-slave
sudo systemctl status pf9-ha-slave
```

The service should be `active (running)`. If it is not present or stopped, re-sync the host from the UI (**Other > Re-sync Host**) to re-apply the role, then confirm the service starts.

### Check 2: Confirm the libvirt Exporter Is Running on All Hosts

```bash
sudo systemctl status pf9-libvirt-exporter
```

If stopped, restart it:

```bash
sudo systemctl restart pf9-libvirt-exporter
```

### Check 3: Verify High Availability Manager Health

{% hint style="info" %}
**Self-Hosted deployments only**

After a management-plane upgrade, confirm the `hamgr` pod is running and healthy. In SaaS deployments, Platform9 performs management-plane upgrades; confirm VM HA status from the UI after Platform9 notifies you that the upgrade is complete.
{% endhint %}

```bash
kubectl get pods -n <region-fqdn> | grep hamgr
```

If the pod is not `Running`, inspect its logs and restart the deployment:

```bash
kubectl logs -n <region-fqdn> <hamgr-pod-name> --tail=100
kubectl rollout restart deployment/hamgr -n <region-fqdn>
```

### Check 4: Re-Enable VM HA and Confirm "Protected" Status

Once all hosts are on the same OS version and all services are healthy:

1. Navigate to **Infrastructure > Clusters**.
2. Select the cluster.
3. Toggle **VM High Availability** to on.
4. Confirm the cluster transitions to **Protected** status within one to two minutes.

If the cluster shows **Degraded** or **Not Protected** after re-enabling, hover over the status indicator and address each failing prerequisite. See [#enable-disable-failures](#enable-disable-failures "mention") for a reference table of common prerequisite failures.

### Check 5: Perform a Safe Validation Test

To confirm VM HA is functional without inducing a real failure, you can use maintenance mode as a proxy test:

1. Place a non-critical host into **Maintenance Mode** (see [Maintenance Mode](/private-cloud-director/virtualized-clusters/add-hosts-virtualized-cluster/maintenance-mode.md)). This live-migrates its VMs to other hosts in the cluster.
2. Confirm the VMs migrate successfully and return to **ACTIVE** state.
3. Disable Maintenance Mode to bring the host back into service.

This validates that live migration works correctly on the current OS version and storage configuration. VM HA evacuation uses the same migration path, so a successful maintenance-mode drain confirms the underlying machinery is working.

{% hint style="info" %}
Maintenance mode uses **live migration**, whereas VM HA uses **evacuation** when a host is unresponsive. Live migration requires the source host to be online; evacuation does not. The maintenance-mode test validates the migration path but does not exercise the failure detection path. If you want to validate the full VM HA detection path, contact Platform9 Support for guidance on scheduling a controlled failover test.
{% endhint %}

***

## Related Pages

* [Virtual Machine High Availability (VM HA)](/private-cloud-director/virtualized-clusters/virtualized-cluster/virtual-machine-high-availability-vm-ha.md)
* [Recover VMs in ERROR State After Host Reboot or Patching](/private-cloud-director/virtualized-clusters/troubleshooting-and-log-files/recover-vms-in-error-state.md)
* [Troubleshooting Offline or Failed Hosts](/private-cloud-director/virtualized-clusters/troubleshooting-and-log-files/troubleshooting-offline-or-failed-hosts.md)
* [Maintenance Mode](/private-cloud-director/virtualized-clusters/add-hosts-virtualized-cluster/maintenance-mode.md)
* [Block Storage High Availability](/private-cloud-director/storage/block-storage/block-storage-high-availability.md)
* [Host Upgrade Runbook](/private-cloud-director/upgrade/host-upgrade-runbook.md)
* [Post-Upgrade Verification](/private-cloud-director/upgrade/post-upgrade-verification.md)


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.platform9.com/private-cloud-director/virtualized-clusters/troubleshooting-and-log-files/troubleshoot-vm-ha.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
