Recover VMs in ERROR State After Host Reboot or Patching
Overview
After a hypervisor host is rebooted or patched, virtual machines that were running on that host may land in ERROR state instead of returning to ACTIVE. This guide explains why that happens, how to safely diagnose the cause, and how to bring VMs back into service — including when to attempt a hard reboot, when to use evacuation, and when to escalate.
In this guide, you will restore ERROR-state VMs to a running state following a host reboot or maintenance event.
Why VMs Land in ERROR After a Host Reboot
When a host is rebooted outside of Maintenance Mode, running VMs are not migrated first. The Compute Service treats an unexpected host disappearance as a failure. Several conditions lead to VMs landing in ERROR:
The host took too long to come back and the Compute Service timed out the pending state.
The VM's state on disk was inconsistent at the time of the reboot (for example, a write was in progress).
The Compute Service on the host did not restart cleanly after the reboot, so it reported VMs as failed during reconciliation.
VM HA attempted to evacuate the VM to another host but the evacuation failed (for example, the destination host had insufficient resources or the storage was unavailable at that moment), leaving the VM in ERROR.
Data safety caution: Before resetting a VM's state, confirm the host is back online and healthy. Performing a hard reboot or rebuild on a VM whose underlying disk is still inaccessible may corrupt the guest OS. Verify storage health before proceeding.
Prerequisites
You can reach the Private Cloud Director UI or have
pcdctlaccess.The hypervisor host is back online and shows as Active in the UI under Infrastructure > Cluster Hosts.
Storage (ephemeral shared storage or block storage volumes) is accessible.
Diagnose the ERROR State
Step 1: Identify Affected VMs
List all VMs currently in ERROR state. Use the Private Cloud Director UI or the CLI:
For each ERROR-state VM, retrieve its details and fault message:
Look for the fault field in the output. Common fault messages and their meanings:
No valid host was found
Evacuation found no suitable destination host
Build of instance ... aborted: Instance failed to spawn
Compute Service on the host could not start the VM
Connection to libvirt failed
libvirtd was not running when the Compute Service tried to reconcile
Exceeded maximum number of retries
Repeated provisioning attempts all failed
Step 2: Check the Host Is Healthy
Before touching the VM, confirm the host is fully recovered:
The host's nova-compute entry should show state: up and status: enabled. If it shows state: down, the Compute Service on that host has not recovered yet. Address the host first — see Troubleshoot libvirt and Compute Service Failures and Troubleshooting Offline or Failed Hosts.
Step 3: Review the Compute Service Log on the Host
Log in to the hypervisor and inspect the Compute Service log for errors around the time of the reboot:
Look for libvirt errors, storage attachment failures, or reconciliation failures referencing the VM's UUID.
Recovery Procedures
Choose the appropriate procedure based on the fault and your storage configuration.
Procedure A: Hard Reboot (VM Disk on Shared Storage or Block Volume)
A hard reboot instructs the Compute Service to reset and restart the VM in place on the same host. Use this when:
The host is healthy and the Compute Service is up.
The VM's root disk is on ephemeral shared storage or a block storage volume (not ephemeral local storage).
The fault indicates a transient failure (libvirt timeout, failed reconciliation) rather than a missing resource.
From the Private Cloud Director UI, navigate to Virtual Machines, select the VM, and choose Hard Reboot from the Actions menu.
Alternatively, from the CLI:
Wait for the VM to transition to ACTIVE. This typically takes one to three minutes.
If the VM returns to ERROR after the hard reboot, proceed to Procedure B or C.
Procedure B: Reset VM State and Retry (When the VM Is Stuck Transitioning)
If the VM is stuck in a transitioning task state (for example, rebooting, rebuilding, or powering-off) and does not complete, you can reset the state to ERROR explicitly and then attempt recovery:
Self-Hosted deployments only
The reset-state command requires access to the region management plane. In SaaS deployments, contact Platform9 Support to perform this operation on your behalf.
After the state is reset to ERROR, attempt a hard reboot (Procedure A).
Procedure C: Evacuate to Another Host
Use evacuation when:
The host that originally ran the VM is still offline or unhealthy.
The VM's root disk is on a block storage volume or ephemeral shared storage (evacuation requires the disk to be accessible from a new host).
A hard reboot has failed and the host cannot be recovered quickly.
Evacuation is not possible for VMs using ephemeral local storage. If the VM was using local (non-shared) ephemeral storage and the host's disk is unavailable, the VM data may be unrecoverable. Contact Platform9 Support for guidance.
Confirm the VM's storage type. In the Private Cloud Director UI, look at the VM's volume attachments. If the root disk is a named volume, evacuation will work. If the root disk shows as ephemeral and the VM's cluster does not use shared storage, evacuation will not work.
Run the evacuation command, targeting a specific healthy host:
Or, to evacuate all ERROR-state VMs from a specific host (for example, after a partial failure):
See Virtual Machine Migration for full evacuation prerequisites.
Monitor the VM status until it reaches ACTIVE.
Procedure D: Rebuild from Recovery (Last Resort)
If the VM cannot be hard rebooted or evacuated and the disk is accessible, a rebuild re-creates the VM on its existing volume using the original image. This replaces the VM's in-memory and CPU state but preserves attached volumes.
Rebuild rewrites the root disk from the named image. Any data written to the ephemeral root disk after VM creation will be lost. Only use rebuild if the root disk is a block volume and you have confirmed its data integrity, or if the VM is stateless and re-initialization from the base image is acceptable.
After Recovery: Verify VM Health
After the VM returns to ACTIVE:
Confirm the VM is reachable on the network (SSH or ICMP ping).
Check guest OS logs for application errors that may have resulted from the abrupt shutdown.
If the host was rebooted for patching and the Compute Service is still disabled on that host, re-enable it:
Review whether Maintenance Mode should be used for future planned maintenance to avoid ERROR-state VMs.
Prevent ERROR States During Planned Maintenance
The most reliable way to avoid ERROR-state VMs during host reboots is to use Maintenance Mode:
Enable Maintenance Mode on the host before rebooting. The Compute Service will live-migrate all running VMs to other hosts first.
Reboot or patch the host.
Verify the host is healthy after it comes back online.
Disable Maintenance Mode to allow new VMs to schedule on the host again.
See Maintenance Mode for the full procedure.
Related Pages
Last updated
Was this helpful?
