Recover libvirt and Compute Service Failures
Overview
Private Cloud Director relies on libvirtd (the libvirt daemon) to manage virtual machine lifecycle on each hypervisor host. If libvirtd becomes unresponsive or fails, the Platform9 Compute Service (pf9-ostackhost) cannot communicate with the hypervisor, causing hosts to appear offline and VMs to become unreachable.
This guide covers how to diagnose libvirtd failures, when and how to safely restart the affected services, and how to confirm that the host and its VMs have fully recovered.
In this guide, you will restore a hypervisor host that has a failing or unresponsive libvirtd or Compute Service to a healthy, operational state.
Prerequisites
SSH access to the affected hypervisor host.
Access to the Private Cloud Director UI or
pcdctlCLI to monitor host and VM state.Sufficient free resources on other cluster hosts if VM migration is needed during recovery.
Symptoms
You may be experiencing a libvirtd or Compute Service failure if:
The Private Cloud Director Service Health dashboard shows the host as Offline or the Compute Service as Unhealthy.
Running
virsh liston the host hangs or returns an error such aserror: failed to connect to the hypervisororerror: unable to connect to server at 'qemu:///system'.VMs show as ERROR or Unknown in the UI, even though the host itself is reachable over SSH.
The Compute Service log (
/var/log/pf9/ostackhost.log) contains repeated lines likelibvirt connection refused,Connection reset by peer, orTimeout waiting for response.
Diagnose the Failure
Step 1: Check libvirtd Status
Log in to the hypervisor host and check whether libvirtd is running:
Active (running): The daemon is up. The issue may be a socket or permissions problem rather than a crash. Check the libvirt log (Step 3).
Failed or inactive:
libvirtdhas crashed or was stopped. Proceed to the restart procedure.Activating (start): The daemon is stuck starting. Check for a hung process (Step 2).
Step 2: Check for Hung Processes
If libvirtd appears to be running but virsh list hangs, a qemu process or socket may be blocking the daemon:
A large number of D (uninterruptible sleep) processes under qemu-system-x86_64 usually indicates a storage I/O hang. Resolve the underlying storage issue before restarting libvirtd, or the daemon will hang again.
Step 3: Review libvirt Logs
Examine the libvirt daemon log for errors leading up to the failure:
For per-VM issues, check the QEMU log for the affected VM:
Step 4: Check the Compute Service Status
Also check the Compute Service log:
Restart Procedure
Restarting libvirtd will cause a brief interruption to running VMs on the host while the connection is re-established. Running VMs themselves are not terminated by a libvirtd restart, but they will be temporarily unresponsive. If your workloads cannot tolerate any interruption, use Maintenance Mode to migrate VMs off the host first.
Restart services in the following order. Wait for each service to reach active (running) before starting the next.
Step 1: Restart libvirtd
After the restart, confirm that virsh can connect:
This should return without hanging and display any VMs currently defined on the host (running or shut off).
Step 2: Restart the Platform9 Compute Service
Once libvirtd is healthy, restart the Platform9 Compute Service:
Step 3: Restart the Platform9 Host Agent (If the Host Is Still Showing Offline)
If the host is still shown as Offline in the Private Cloud Director UI after the Compute Service restart, restart the host agent as well:
The host agent is responsible for reporting host health to the management plane. After restarting it, wait two to three minutes and then check the UI.
Validate Recovery
Confirm Host Is Online
In the Private Cloud Director UI, navigate to Infrastructure > Cluster Hosts and verify the host status returns to Active.
Using the CLI:
The nova-compute entry for the host should show state: up and status: enabled.
Confirm VMs Have Reconciled
After the Compute Service restarts, it re-queries libvirtd to reconcile the state of all VMs on the host. Most VMs will transition from ERROR or Unknown back to ACTIVE automatically within two to three minutes.
Any VMs that remain in ERROR after five minutes may require additional recovery steps. See Recover VMs in ERROR State After Host Reboot or Patching.
Confirm virsh Reflects Running VMs
VMs that are ACTIVE in Private Cloud Director should appear as running here. If a VM is ACTIVE in the UI but shut off in virsh list, restart the VM using a hard reboot:
Additional Checks
Verify All Platform9 Services Are Running
The Platform9 stack requires multiple services to be healthy on each hypervisor:
If any service is in a failed state, restart it individually:
Check for Disk Space Issues
A full disk is a common cause of libvirtd failures and Compute Service crashes:
If disk usage is above 90%, free space before restarting services.
Check for Socket File Issues
If libvirtd is running but virsh cannot connect, the UNIX socket may be in a bad state:
Restarting libvirtd normally recreates the socket. If the socket file persists after a restart and virsh still cannot connect, remove it manually and restart:
When to Contact Support
Escalate to Platform9 Support if:
libvirtdcrashes immediately after each restart (check for recurring errors in/var/log/libvirt/libvirtd.log).VMs do not reconcile to ACTIVE after the host and Compute Service are healthy.
You see storage I/O errors in the libvirt or QEMU logs that indicate underlying storage hardware issues.
Multiple hosts in the cluster are affected simultaneously.
Contact Platform9 Support and provide the libvirt log, the Compute Service log, and the output of pcdctl compute service list.
Related Pages
Last updated
Was this helpful?
