Recover VMs in ERROR State After Host Reboot or Patching

Overview

After a hypervisor host is rebooted or patched, virtual machines that were running on that host may land in ERROR state instead of returning to ACTIVE. This guide explains why that happens, how to safely diagnose the cause, and how to bring VMs back into service — including when to attempt a hard reboot, when to use evacuation, and when to escalate.

In this guide, you will restore ERROR-state VMs to a running state following a host reboot or maintenance event.

Why VMs Land in ERROR After a Host Reboot

When a host is rebooted outside of Maintenance Mode, running VMs are not migrated first. The Compute Service treats an unexpected host disappearance as a failure. Several conditions lead to VMs landing in ERROR:

  • The host took too long to come back and the Compute Service timed out the pending state.

  • The VM's state on disk was inconsistent at the time of the reboot (for example, a write was in progress).

  • The Compute Service on the host did not restart cleanly after the reboot, so it reported VMs as failed during reconciliation.

  • VM HA attempted to evacuate the VM to another host but the evacuation failed (for example, the destination host had insufficient resources or the storage was unavailable at that moment), leaving the VM in ERROR.

Prerequisites

  • You can reach the Private Cloud Director UI or have pcdctl access.

  • The hypervisor host is back online and shows as Active in the UI under Infrastructure > Cluster Hosts.

  • Storage (ephemeral shared storage or block storage volumes) is accessible.

Diagnose the ERROR State

Step 1: Identify Affected VMs

List all VMs currently in ERROR state. Use the Private Cloud Director UI or the CLI:

For each ERROR-state VM, retrieve its details and fault message:

Look for the fault field in the output. Common fault messages and their meanings:

Fault message
Likely cause

No valid host was found

Evacuation found no suitable destination host

Build of instance ... aborted: Instance failed to spawn

Compute Service on the host could not start the VM

Connection to libvirt failed

libvirtd was not running when the Compute Service tried to reconcile

Exceeded maximum number of retries

Repeated provisioning attempts all failed

Step 2: Check the Host Is Healthy

Before touching the VM, confirm the host is fully recovered:

The host's nova-compute entry should show state: up and status: enabled. If it shows state: down, the Compute Service on that host has not recovered yet. Address the host first — see Troubleshoot libvirt and Compute Service Failures and Troubleshooting Offline or Failed Hosts.

Step 3: Review the Compute Service Log on the Host

Log in to the hypervisor and inspect the Compute Service log for errors around the time of the reboot:

Look for libvirt errors, storage attachment failures, or reconciliation failures referencing the VM's UUID.

Recovery Procedures

Choose the appropriate procedure based on the fault and your storage configuration.

Procedure A: Hard Reboot (VM Disk on Shared Storage or Block Volume)

A hard reboot instructs the Compute Service to reset and restart the VM in place on the same host. Use this when:

  • The host is healthy and the Compute Service is up.

  • The VM's root disk is on ephemeral shared storage or a block storage volume (not ephemeral local storage).

  • The fault indicates a transient failure (libvirt timeout, failed reconciliation) rather than a missing resource.

  1. From the Private Cloud Director UI, navigate to Virtual Machines, select the VM, and choose Hard Reboot from the Actions menu.

    Alternatively, from the CLI:

  2. Wait for the VM to transition to ACTIVE. This typically takes one to three minutes.

  3. If the VM returns to ERROR after the hard reboot, proceed to Procedure B or C.

Procedure B: Reset VM State and Retry (When the VM Is Stuck Transitioning)

If the VM is stuck in a transitioning task state (for example, rebooting, rebuilding, or powering-off) and does not complete, you can reset the state to ERROR explicitly and then attempt recovery:

Self-Hosted deployments only

The reset-state command requires access to the region management plane. In SaaS deployments, contact Platform9 Support to perform this operation on your behalf.

After the state is reset to ERROR, attempt a hard reboot (Procedure A).

Procedure C: Evacuate to Another Host

Use evacuation when:

  • The host that originally ran the VM is still offline or unhealthy.

  • The VM's root disk is on a block storage volume or ephemeral shared storage (evacuation requires the disk to be accessible from a new host).

  • A hard reboot has failed and the host cannot be recovered quickly.

  1. Confirm the VM's storage type. In the Private Cloud Director UI, look at the VM's volume attachments. If the root disk is a named volume, evacuation will work. If the root disk shows as ephemeral and the VM's cluster does not use shared storage, evacuation will not work.

  2. Run the evacuation command, targeting a specific healthy host:

    Or, to evacuate all ERROR-state VMs from a specific host (for example, after a partial failure):

    See Virtual Machine Migration for full evacuation prerequisites.

  3. Monitor the VM status until it reaches ACTIVE.

Procedure D: Rebuild from Recovery (Last Resort)

If the VM cannot be hard rebooted or evacuated and the disk is accessible, a rebuild re-creates the VM on its existing volume using the original image. This replaces the VM's in-memory and CPU state but preserves attached volumes.

After Recovery: Verify VM Health

After the VM returns to ACTIVE:

  1. Confirm the VM is reachable on the network (SSH or ICMP ping).

  2. Check guest OS logs for application errors that may have resulted from the abrupt shutdown.

  3. If the host was rebooted for patching and the Compute Service is still disabled on that host, re-enable it:

  4. Review whether Maintenance Mode should be used for future planned maintenance to avoid ERROR-state VMs.

Prevent ERROR States During Planned Maintenance

The most reliable way to avoid ERROR-state VMs during host reboots is to use Maintenance Mode:

  1. Enable Maintenance Mode on the host before rebooting. The Compute Service will live-migrate all running VMs to other hosts first.

  2. Reboot or patch the host.

  3. Verify the host is healthy after it comes back online.

  4. Disable Maintenance Mode to allow new VMs to schedule on the host again.

See Maintenance Mode for the full procedure.

Last updated

Was this helpful?