Recover from Messaging Layer Failures

Overview

Private Cloud Director's Compute Service components communicate with each other through a messaging layer. When the messaging layer becomes unhealthy, VM create requests appear to hang or fail silently — the VM stays in BUILD state indefinitely, or a burst of VM creates causes a large number of failures at scale.

This guide explains how to recognize a messaging layer problem, what symptoms distinguish it from other failure modes, and how to recover. Management-plane remediation steps (such as restarting messaging layer pods) apply only to Self-Hosted deployments; SaaS customers should contact Platform9 Support.

In this guide, you will identify a messaging layer failure and restore normal VM creation behavior.

Prerequisites

  • Access to the Private Cloud Director UI or pcdctl CLI.

  • For Self-Hosted deployments: kubectl access to the region namespace.

Recognize a Messaging Layer Failure

Messaging layer failures produce a characteristic pattern that distinguishes them from resource exhaustion or host failures:

Symptom
Messaging layer failure
Resource exhaustion
Host failure

VMs stuck in BUILD indefinitely

Yes

No — fails quickly

Partial — depends on timing

Error message

None, or generic timeout

"No valid host was found"

"Failed to spawn" or libvirt error

Affects all new VM creates simultaneously

Yes (all fail at the same time)

Yes (all fail at the same time)

No (only VMs on the affected host)

Existing running VMs affected

No

No

Yes (VMs on failed host)

pcdctl compute service list shows hosts up

Yes

Yes

No (affected host shows down)

If you see VMs stuck in BUILD for more than ten minutes with no fault message, and pcdctl compute service list shows all hosts as state: up, a messaging layer issue is the most likely cause.

Diagnose the Failure

Step 1: Confirm VMs Are Stuck in BUILD

If this returns a large number of VMs, or if VMs have been in BUILD for an unusually long time (more than ten minutes for a standard VM), proceed to the next step.

Step 2: Check for Timeout Errors in the Compute Service Log

On an affected hypervisor host, inspect the Compute Service log:

Errors referencing MessagingTimeout, AMQP, or rabbit confirm that the Compute Service cannot reach the messaging layer.

Also check the nova-conductor or nova-scheduler pod logs for messaging errors (Self-Hosted only):

Self-Hosted deployments only

The following steps require kubectl access to the region namespace. In SaaS deployments, contact Platform9 Support to inspect management-plane component health and perform any pod-level remediation.

Step 3: Check Messaging Layer Pod Health (Self-Hosted Only)

Check whether the messaging layer pods in the region namespace are healthy:

Look for pods in CrashLoopBackOff, OOMKilled, Pending, or Error state. A CrashLoopBackOff messaging layer pod is a strong indicator of the root cause.

Describe the pod to see recent events:

Check the pod logs:

Look for memory pressure events, disk quota errors, or authentication failures.

Recovery Procedure

For Self-Hosted Deployments

If the messaging layer pod is in a crash state, attempt a pod restart:

Wait for the pod to return to Running state:

After the messaging layer pod is healthy, the Compute Service components reconnect automatically. Allow two to three minutes for reconnection, then verify that new VM creates succeed.

If restarting the pod does not resolve the issue, or if the pod crashes again immediately, escalate to Platform9 Support. The underlying cause may be resource exhaustion (memory or disk), a misconfiguration, or a persistent authentication issue that requires deeper investigation.

For SaaS Deployments

You do not have access to manage messaging layer infrastructure in a SaaS deployment. Contact Platform9 Support and provide:

  • The time range when VM creates began failing.

  • The output of pcdctl server list --status BUILD.

  • The output of pcdctl compute service list.

  • The relevant lines from /var/log/pf9/ostackhost.log on an affected hypervisor.

After Recovery: Clean Up Stuck VMs

After the messaging layer is healthy, VMs that were stuck in BUILD may not automatically recover. Check their status:

For each stuck VM, attempt to reset its state and delete it, then re-create:

If the delete also hangs, reset the VM state first (Self-Hosted only, requires management-plane access — contact Platform9 Support for SaaS):

Prevent Messaging Layer Failures at Scale

When creating a large number of VMs in a burst, the following practices reduce the risk of overwhelming the messaging layer:

  • Stage VM creation in batches. Rather than creating hundreds of VMs simultaneously, create them in groups of 20–50 and wait for each batch to reach ACTIVE before proceeding.

  • Monitor VM creation rate against available messaging layer capacity. In Self-Hosted deployments, review messaging layer resource allocation (CPU, memory) in the region namespace and increase limits if burst workloads regularly hit capacity.

  • Use tenant quotas to limit simultaneous creation. See Tenant Quotas, User Quotas and VM Leases to set per-tenant instance and core limits.

Last updated

Was this helpful?