Recover libvirt and Compute Service Failures

Overview

Private Cloud Director relies on libvirtd (the libvirt daemon) to manage virtual machine lifecycle on each hypervisor host. If libvirtd becomes unresponsive or fails, the Platform9 Compute Service (pf9-ostackhost) cannot communicate with the hypervisor, causing hosts to appear offline and VMs to become unreachable.

This guide covers how to diagnose libvirtd failures, when and how to safely restart the affected services, and how to confirm that the host and its VMs have fully recovered.

In this guide, you will restore a hypervisor host that has a failing or unresponsive libvirtd or Compute Service to a healthy, operational state.

Prerequisites

  • SSH access to the affected hypervisor host.

  • Access to the Private Cloud Director UI or pcdctl CLI to monitor host and VM state.

  • Sufficient free resources on other cluster hosts if VM migration is needed during recovery.

Symptoms

You may be experiencing a libvirtd or Compute Service failure if:

  • The Private Cloud Director Service Health dashboard shows the host as Offline or the Compute Service as Unhealthy.

  • Running virsh list on the host hangs or returns an error such as error: failed to connect to the hypervisor or error: unable to connect to server at 'qemu:///system'.

  • VMs show as ERROR or Unknown in the UI, even though the host itself is reachable over SSH.

  • The Compute Service log (/var/log/pf9/ostackhost.log) contains repeated lines like libvirt connection refused, Connection reset by peer, or Timeout waiting for response.

Diagnose the Failure

Step 1: Check libvirtd Status

Log in to the hypervisor host and check whether libvirtd is running:

  • Active (running): The daemon is up. The issue may be a socket or permissions problem rather than a crash. Check the libvirt log (Step 3).

  • Failed or inactive: libvirtd has crashed or was stopped. Proceed to the restart procedure.

  • Activating (start): The daemon is stuck starting. Check for a hung process (Step 2).

Step 2: Check for Hung Processes

If libvirtd appears to be running but virsh list hangs, a qemu process or socket may be blocking the daemon:

A large number of D (uninterruptible sleep) processes under qemu-system-x86_64 usually indicates a storage I/O hang. Resolve the underlying storage issue before restarting libvirtd, or the daemon will hang again.

Step 3: Review libvirt Logs

Examine the libvirt daemon log for errors leading up to the failure:

For per-VM issues, check the QEMU log for the affected VM:

Step 4: Check the Compute Service Status

Also check the Compute Service log:

Restart Procedure

Restart services in the following order. Wait for each service to reach active (running) before starting the next.

Step 1: Restart libvirtd

After the restart, confirm that virsh can connect:

This should return without hanging and display any VMs currently defined on the host (running or shut off).

Step 2: Restart the Platform9 Compute Service

Once libvirtd is healthy, restart the Platform9 Compute Service:

Step 3: Restart the Platform9 Host Agent (If the Host Is Still Showing Offline)

If the host is still shown as Offline in the Private Cloud Director UI after the Compute Service restart, restart the host agent as well:

The host agent is responsible for reporting host health to the management plane. After restarting it, wait two to three minutes and then check the UI.

Validate Recovery

Confirm Host Is Online

In the Private Cloud Director UI, navigate to Infrastructure > Cluster Hosts and verify the host status returns to Active.

Using the CLI:

The nova-compute entry for the host should show state: up and status: enabled.

Confirm VMs Have Reconciled

After the Compute Service restarts, it re-queries libvirtd to reconcile the state of all VMs on the host. Most VMs will transition from ERROR or Unknown back to ACTIVE automatically within two to three minutes.

Any VMs that remain in ERROR after five minutes may require additional recovery steps. See Recover VMs in ERROR State After Host Reboot or Patching.

Confirm virsh Reflects Running VMs

VMs that are ACTIVE in Private Cloud Director should appear as running here. If a VM is ACTIVE in the UI but shut off in virsh list, restart the VM using a hard reboot:

Additional Checks

Verify All Platform9 Services Are Running

The Platform9 stack requires multiple services to be healthy on each hypervisor:

If any service is in a failed state, restart it individually:

Check for Disk Space Issues

A full disk is a common cause of libvirtd failures and Compute Service crashes:

If disk usage is above 90%, free space before restarting services.

Check for Socket File Issues

If libvirtd is running but virsh cannot connect, the UNIX socket may be in a bad state:

Restarting libvirtd normally recreates the socket. If the socket file persists after a restart and virsh still cannot connect, remove it manually and restart:

When to Contact Support

Escalate to Platform9 Support if:

  • libvirtd crashes immediately after each restart (check for recurring errors in /var/log/libvirt/libvirtd.log).

  • VMs do not reconcile to ACTIVE after the host and Compute Service are healthy.

  • You see storage I/O errors in the libvirt or QEMU logs that indicate underlying storage hardware issues.

  • Multiple hosts in the cluster are affected simultaneously.

Contact Platform9 Support and provide the libvirt log, the Compute Service log, and the output of pcdctl compute service list.

Last updated

Was this helpful?