> For the complete documentation index, see [llms.txt](https://docs.platform9.com/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://docs.platform9.com/private-cloud-director/virtualized-clusters/troubleshooting-and-log-files/recover-vms-in-error-state.md).

# Recover VMs in ERROR State After Host Reboot or Patching

## Overview

After a hypervisor host is rebooted or patched, virtual machines that were running on that host may land in **ERROR** state instead of returning to **ACTIVE**. This guide explains why that happens, how to safely diagnose the cause, and how to bring VMs back into service — including when to attempt a hard reboot, when to use evacuation, and when to escalate.

In this guide, you will restore ERROR-state VMs to a running state following a host reboot or maintenance event.

## Why VMs Land in ERROR After a Host Reboot

When a host is rebooted outside of [Maintenance Mode](/private-cloud-director/virtualized-clusters/add-hosts-virtualized-cluster/maintenance-mode.md), running VMs are not migrated first. The Compute Service treats an unexpected host disappearance as a failure. Several conditions lead to VMs landing in ERROR:

* The host took too long to come back and the Compute Service timed out the pending state.
* The VM's state on disk was inconsistent at the time of the reboot (for example, a write was in progress).
* The Compute Service on the host did not restart cleanly after the reboot, so it reported VMs as failed during reconciliation.
* [VM HA](/private-cloud-director/virtualized-clusters/virtualized-cluster/virtual-machine-high-availability-vm-ha.md) attempted to evacuate the VM to another host but the evacuation failed (for example, the destination host had insufficient resources or the storage was unavailable at that moment), leaving the VM in ERROR.

{% hint style="warning" %}
**Data safety caution:** Before resetting a VM's state, confirm the host is back online and healthy. Performing a hard reboot or rebuild on a VM whose underlying disk is still inaccessible may corrupt the guest OS. Verify storage health before proceeding.
{% endhint %}

## Prerequisites

* You can reach the <code class="expression">space.vars.product\_name</code> UI or have `pcdctl` access.
* The hypervisor host is back online and shows as **Active** in the UI under **Infrastructure > Cluster Hosts**.
* Storage (ephemeral shared storage or block storage volumes) is accessible.

## Diagnose the ERROR State

### Step 1: Identify Affected VMs

List all VMs currently in ERROR state. Use the <code class="expression">space.vars.product\_name</code> UI or the CLI:

```bash
pcdctl server list --status ERROR
```

For each ERROR-state VM, retrieve its details and fault message:

```bash
pcdctl server show <VM_UUID>
```

Look for the `fault` field in the output. Common fault messages and their meanings:

| Fault message                                             | Likely cause                                                           |
| --------------------------------------------------------- | ---------------------------------------------------------------------- |
| `No valid host was found`                                 | Evacuation found no suitable destination host                          |
| `Build of instance ... aborted: Instance failed to spawn` | Compute Service on the host could not start the VM                     |
| `Connection to libvirt failed`                            | `libvirtd` was not running when the Compute Service tried to reconcile |
| `Exceeded maximum number of retries`                      | Repeated provisioning attempts all failed                              |

### Step 2: Check the Host Is Healthy

Before touching the VM, confirm the host is fully recovered:

```bash
pcdctl compute service list
```

The host's `nova-compute` entry should show **state: up** and **status: enabled**. If it shows **state: down**, the Compute Service on that host has not recovered yet. Address the host first — see [Troubleshoot libvirt and Compute Service Failures](/private-cloud-director/virtualized-clusters/troubleshooting-and-log-files/recover-libvirt-and-compute-service.md) and [Troubleshooting Offline or Failed Hosts](/private-cloud-director/virtualized-clusters/troubleshooting-and-log-files/troubleshooting-offline-or-failed-hosts.md).

### Step 3: Review the Compute Service Log on the Host

Log in to the hypervisor and inspect the Compute Service log for errors around the time of the reboot:

```bash
sudo grep -i "error\|exception\|fail" /var/log/pf9/ostackhost.log | tail -100
```

Look for libvirt errors, storage attachment failures, or reconciliation failures referencing the VM's UUID.

## Recovery Procedures

Choose the appropriate procedure based on the fault and your storage configuration.

### Procedure A: Hard Reboot (VM Disk on Shared Storage or Block Volume)

A hard reboot instructs the Compute Service to reset and restart the VM in place on the same host. Use this when:

* The host is healthy and the Compute Service is up.
* The VM's root disk is on ephemeral shared storage or a block storage volume (not ephemeral local storage).
* The fault indicates a transient failure (libvirt timeout, failed reconciliation) rather than a missing resource.

1. From the <code class="expression">space.vars.product\_name</code> UI, navigate to **Virtual Machines**, select the VM, and choose **Hard Reboot** from the **Actions** menu.

   Alternatively, from the CLI:

   ```bash
   pcdctl server reboot --hard <VM_UUID>
   ```
2. Wait for the VM to transition to **ACTIVE**. This typically takes one to three minutes.
3. If the VM returns to ERROR after the hard reboot, proceed to Procedure B or C.

### Procedure B: Reset VM State and Retry (When the VM Is Stuck Transitioning)

If the VM is stuck in a transitioning task state (for example, `rebooting`, `rebuilding`, or `powering-off`) and does not complete, you can reset the state to **ERROR** explicitly and then attempt recovery:

{% hint style="info" %}
**Self-Hosted deployments only**

The `reset-state` command requires access to the region management plane. In SaaS deployments, contact Platform9 Support to perform this operation on your behalf.
{% endhint %}

```bash
pcdctl server reset-state <VM_UUID>
```

After the state is reset to ERROR, attempt a hard reboot (Procedure A).

### Procedure C: Evacuate to Another Host

Use evacuation when:

* The host that originally ran the VM is still offline or unhealthy.
* The VM's root disk is on a block storage volume or ephemeral shared storage (evacuation requires the disk to be accessible from a new host).
* A hard reboot has failed and the host cannot be recovered quickly.

{% hint style="warning" %}
**Evacuation is not possible for VMs using ephemeral local storage.** If the VM was using local (non-shared) ephemeral storage and the host's disk is unavailable, the VM data may be unrecoverable. Contact Platform9 Support for guidance.
{% endhint %}

1. Confirm the VM's storage type. In the <code class="expression">space.vars.product\_name</code> UI, look at the VM's volume attachments. If the root disk is a named volume, evacuation will work. If the root disk shows as ephemeral and the VM's cluster does not use shared storage, evacuation will not work.
2. Run the evacuation command, targeting a specific healthy host:

   ```bash
   pcdctl server evacuate --host <DESTINATION_HOST_UUID> <VM_UUID>
   ```

   Or, to evacuate all ERROR-state VMs from a specific host (for example, after a partial failure):

   ```bash
   pcdctl server evacuate --host <DESTINATION_HOST_UUID> --on-shared-storage
   ```

   See [Virtual Machine Migration](/private-cloud-director/virtualized-clusters/virtualmachine/vm-migration.md) for full evacuation prerequisites.
3. Monitor the VM status until it reaches **ACTIVE**.

### Procedure D: Rebuild from Recovery (Last Resort)

If the VM cannot be hard rebooted or evacuated and the disk is accessible, a rebuild re-creates the VM on its existing volume using the original image. This replaces the VM's in-memory and CPU state but preserves attached volumes.

```bash
pcdctl server rebuild <VM_UUID> --image <IMAGE_UUID>
```

{% hint style="warning" %}
Rebuild rewrites the root disk from the named image. Any data written to the ephemeral root disk after VM creation will be lost. Only use rebuild if the root disk is a block volume and you have confirmed its data integrity, or if the VM is stateless and re-initialization from the base image is acceptable.
{% endhint %}

## After Recovery: Verify VM Health

After the VM returns to ACTIVE:

1. Confirm the VM is reachable on the network (SSH or ICMP ping).
2. Check guest OS logs for application errors that may have resulted from the abrupt shutdown.
3. If the host was rebooted for patching and the Compute Service is still disabled on that host, re-enable it:

   ```bash
   pcdctl compute service enable nova-compute <HOST_FQDN>
   ```
4. Review whether [Maintenance Mode](/private-cloud-director/virtualized-clusters/add-hosts-virtualized-cluster/maintenance-mode.md) should be used for future planned maintenance to avoid ERROR-state VMs.

## Prevent ERROR States During Planned Maintenance

The most reliable way to avoid ERROR-state VMs during host reboots is to use Maintenance Mode:

1. Enable Maintenance Mode on the host before rebooting. The Compute Service will live-migrate all running VMs to other hosts first.
2. Reboot or patch the host.
3. Verify the host is healthy after it comes back online.
4. Disable Maintenance Mode to allow new VMs to schedule on the host again.

See [Maintenance Mode](/private-cloud-director/virtualized-clusters/add-hosts-virtualized-cluster/maintenance-mode.md) for the full procedure.

## Related Pages

* [Maintenance Mode](/private-cloud-director/virtualized-clusters/add-hosts-virtualized-cluster/maintenance-mode.md)
* [Virtual Machine Migration](/private-cloud-director/virtualized-clusters/virtualmachine/vm-migration.md)
* [Virtual Machine High Availability (VM HA)](/private-cloud-director/virtualized-clusters/virtualized-cluster/virtual-machine-high-availability-vm-ha.md)
* [Recover libvirt and Compute Service Failures](/private-cloud-director/virtualized-clusters/troubleshooting-and-log-files/recover-libvirt-and-compute-service.md)
* [Diagnose VM Scheduling Failures](/private-cloud-director/virtualized-clusters/troubleshooting-and-log-files/diagnose-vm-scheduling-failures.md)


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.platform9.com/private-cloud-director/virtualized-clusters/troubleshooting-and-log-files/recover-vms-in-error-state.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
