> For the complete documentation index, see [llms.txt](https://docs.platform9.com/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://docs.platform9.com/private-cloud-director/virtualized-clusters/troubleshooting-and-log-files/recover-libvirt-and-compute-service.md).

# Recover libvirt and Compute Service Failures

## Overview

<code class="expression">space.vars.product\_name</code> relies on `libvirtd` (the libvirt daemon) to manage virtual machine lifecycle on each hypervisor host. If `libvirtd` becomes unresponsive or fails, the Platform9 Compute Service (`pf9-ostackhost`) cannot communicate with the hypervisor, causing hosts to appear offline and VMs to become unreachable.

This guide covers how to diagnose `libvirtd` failures, when and how to safely restart the affected services, and how to confirm that the host and its VMs have fully recovered.

In this guide, you will restore a hypervisor host that has a failing or unresponsive `libvirtd` or Compute Service to a healthy, operational state.

## Prerequisites

* SSH access to the affected hypervisor host.
* Access to the <code class="expression">space.vars.product\_name</code> UI or `pcdctl` CLI to monitor host and VM state.
* Sufficient free resources on other cluster hosts if VM migration is needed during recovery.

## Symptoms

You may be experiencing a `libvirtd` or Compute Service failure if:

* The <code class="expression">space.vars.product\_name</code> Service Health dashboard shows the host as **Offline** or the Compute Service as **Unhealthy**.
* Running `virsh list` on the host hangs or returns an error such as `error: failed to connect to the hypervisor` or `error: unable to connect to server at 'qemu:///system'`.
* VMs show as **ERROR** or **Unknown** in the UI, even though the host itself is reachable over SSH.
* The Compute Service log (`/var/log/pf9/ostackhost.log`) contains repeated lines like `libvirt connection refused`, `Connection reset by peer`, or `Timeout waiting for response`.

## Diagnose the Failure

### Step 1: Check libvirtd Status

Log in to the hypervisor host and check whether `libvirtd` is running:

```bash
sudo systemctl status libvirtd
```

* **Active (running):** The daemon is up. The issue may be a socket or permissions problem rather than a crash. Check the libvirt log (Step 3).
* **Failed or inactive:** `libvirtd` has crashed or was stopped. Proceed to the restart procedure.
* **Activating (start):** The daemon is stuck starting. Check for a hung process (Step 2).

### Step 2: Check for Hung Processes

If `libvirtd` appears to be running but `virsh list` hangs, a qemu process or socket may be blocking the daemon:

```bash
# Check if virsh can connect at all (run with a short timeout)
timeout 10 virsh list 2>&1 || echo "virsh timed out or failed"

# Check for zombie or D-state (uninterruptible) qemu processes
ps aux | grep -E "(qemu|libvirt)" | grep -v grep
```

A large number of `D` (uninterruptible sleep) processes under `qemu-system-x86_64` usually indicates a storage I/O hang. Resolve the underlying storage issue before restarting `libvirtd`, or the daemon will hang again.

### Step 3: Review libvirt Logs

Examine the libvirt daemon log for errors leading up to the failure:

```bash
sudo tail -200 /var/log/libvirt/libvirtd.log | grep -i "error\|crit\|fail"
```

For per-VM issues, check the QEMU log for the affected VM:

```bash
sudo tail -100 /var/log/libvirt/qemu/<VM_UUID>.log
```

### Step 4: Check the Compute Service Status

```bash
sudo systemctl status pf9-ostackhost
```

Also check the Compute Service log:

```bash
sudo grep -i "error\|exception\|libvirt" /var/log/pf9/ostackhost.log | tail -100
```

## Restart Procedure

{% hint style="warning" %}
Restarting `libvirtd` will cause a brief interruption to running VMs on the host while the connection is re-established. Running VMs themselves are not terminated by a `libvirtd` restart, but they will be temporarily unresponsive. If your workloads cannot tolerate any interruption, use [Maintenance Mode](/private-cloud-director/virtualized-clusters/add-hosts-virtualized-cluster/maintenance-mode.md) to migrate VMs off the host first.
{% endhint %}

Restart services in the following order. Wait for each service to reach `active (running)` before starting the next.

### Step 1: Restart libvirtd

```bash
sudo systemctl restart libvirtd
sudo systemctl status libvirtd
```

After the restart, confirm that `virsh` can connect:

```bash
virsh list
```

This should return without hanging and display any VMs currently defined on the host (running or shut off).

### Step 2: Restart the Platform9 Compute Service

Once `libvirtd` is healthy, restart the Platform9 Compute Service:

```bash
sudo systemctl restart pf9-ostackhost
sudo systemctl status pf9-ostackhost
```

### Step 3: Restart the Platform9 Host Agent (If the Host Is Still Showing Offline)

If the host is still shown as **Offline** in the <code class="expression">space.vars.product\_name</code> UI after the Compute Service restart, restart the host agent as well:

```bash
sudo systemctl restart pf9-hostagent
sudo systemctl status pf9-hostagent
```

The host agent is responsible for reporting host health to the management plane. After restarting it, wait two to three minutes and then check the UI.

## Validate Recovery

### Confirm Host Is Online

In the <code class="expression">space.vars.product\_name</code> UI, navigate to **Infrastructure > Cluster Hosts** and verify the host status returns to **Active**.

Using the CLI:

```bash
pcdctl compute service list
```

The `nova-compute` entry for the host should show **state: up** and **status: enabled**.

### Confirm VMs Have Reconciled

After the Compute Service restarts, it re-queries `libvirtd` to reconcile the state of all VMs on the host. Most VMs will transition from **ERROR** or **Unknown** back to **ACTIVE** automatically within two to three minutes.

```bash
pcdctl server list --host <HOST_UUID>
```

Any VMs that remain in **ERROR** after five minutes may require additional recovery steps. See [Recover VMs in ERROR State After Host Reboot or Patching](/private-cloud-director/virtualized-clusters/troubleshooting-and-log-files/recover-vms-in-error-state.md).

### Confirm virsh Reflects Running VMs

```bash
virsh list --all
```

VMs that are **ACTIVE** in <code class="expression">space.vars.product\_name</code> should appear as `running` here. If a VM is **ACTIVE** in the UI but `shut off` in `virsh list`, restart the VM using a hard reboot:

```bash
pcdctl server reboot --hard <VM_UUID>
```

## Additional Checks

### Verify All Platform9 Services Are Running

The Platform9 stack requires multiple services to be healthy on each hypervisor:

```bash
sudo systemctl status pf9-hostagent
sudo systemctl status pf9-ostackhost
sudo systemctl status pf9-comms
sudo systemctl status pf9-sidekick
```

If any service is in a `failed` state, restart it individually:

```bash
sudo systemctl restart <service-name>
```

### Check for Disk Space Issues

A full disk is a common cause of `libvirtd` failures and Compute Service crashes:

```bash
df -h /var/log
df -h /var/lib/libvirt
```

If disk usage is above 90%, free space before restarting services.

### Check for Socket File Issues

If `libvirtd` is running but `virsh` cannot connect, the UNIX socket may be in a bad state:

```bash
ls -la /var/run/libvirt/libvirt-sock
```

Restarting `libvirtd` normally recreates the socket. If the socket file persists after a restart and `virsh` still cannot connect, remove it manually and restart:

```bash
sudo rm /var/run/libvirt/libvirt-sock
sudo systemctl restart libvirtd
```

## When to Contact Support

Escalate to Platform9 Support if:

* `libvirtd` crashes immediately after each restart (check for recurring errors in `/var/log/libvirt/libvirtd.log`).
* VMs do not reconcile to **ACTIVE** after the host and Compute Service are healthy.
* You see storage I/O errors in the libvirt or QEMU logs that indicate underlying storage hardware issues.
* Multiple hosts in the cluster are affected simultaneously.

Contact [Platform9 Support](https://support.platform9.com/) and provide the libvirt log, the Compute Service log, and the output of `pcdctl compute service list`.

## Related Pages

* [Troubleshooting Offline or Failed Hosts](/private-cloud-director/virtualized-clusters/troubleshooting-and-log-files/troubleshooting-offline-or-failed-hosts.md)
* [Recover VMs in ERROR State After Host Reboot or Patching](/private-cloud-director/virtualized-clusters/troubleshooting-and-log-files/recover-vms-in-error-state.md)
* [Maintenance Mode](/private-cloud-director/virtualized-clusters/add-hosts-virtualized-cluster/maintenance-mode.md)
* [Compute Service Advanced Configuration](/private-cloud-director/virtualized-clusters/nova-override.md)


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.platform9.com/private-cloud-director/virtualized-clusters/troubleshooting-and-log-files/recover-libvirt-and-compute-service.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
