> For the complete documentation index, see [llms.txt](https://docs.platform9.com/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://docs.platform9.com/private-cloud-director/virtualized-clusters/troubleshooting-and-log-files/recover-messaging-layer-failures.md).

# Recover from Messaging Layer Failures

## Overview

<code class="expression">space.vars.product\_name</code>'s Compute Service components communicate with each other through a messaging layer. When the messaging layer becomes unhealthy, VM create requests appear to hang or fail silently — the VM stays in **BUILD** state indefinitely, or a burst of VM creates causes a large number of failures at scale.

This guide explains how to recognize a messaging layer problem, what symptoms distinguish it from other failure modes, and how to recover. Management-plane remediation steps (such as restarting messaging layer pods) apply only to Self-Hosted deployments; SaaS customers should contact Platform9 Support.

In this guide, you will identify a messaging layer failure and restore normal VM creation behavior.

## Prerequisites

* Access to the <code class="expression">space.vars.product\_name</code> UI or `pcdctl` CLI.
* For Self-Hosted deployments: `kubectl` access to the region namespace.

## Recognize a Messaging Layer Failure

Messaging layer failures produce a characteristic pattern that distinguishes them from resource exhaustion or host failures:

| Symptom                                      | Messaging layer failure         | Resource exhaustion             | Host failure                       |
| -------------------------------------------- | ------------------------------- | ------------------------------- | ---------------------------------- |
| VMs stuck in BUILD indefinitely              | Yes                             | No — fails quickly              | Partial — depends on timing        |
| Error message                                | None, or generic timeout        | "No valid host was found"       | "Failed to spawn" or libvirt error |
| Affects all new VM creates simultaneously    | Yes (all fail at the same time) | Yes (all fail at the same time) | No (only VMs on the affected host) |
| Existing running VMs affected                | No                              | No                              | Yes (VMs on failed host)           |
| `pcdctl compute service list` shows hosts up | Yes                             | Yes                             | No (affected host shows down)      |

If you see VMs stuck in **BUILD** for more than ten minutes with no fault message, and `pcdctl compute service list` shows all hosts as `state: up`, a messaging layer issue is the most likely cause.

## Diagnose the Failure

### Step 1: Confirm VMs Are Stuck in BUILD

```bash
pcdctl server list --status BUILD
```

If this returns a large number of VMs, or if VMs have been in BUILD for an unusually long time (more than ten minutes for a standard VM), proceed to the next step.

### Step 2: Check for Timeout Errors in the Compute Service Log

On an affected hypervisor host, inspect the Compute Service log:

```bash
sudo grep -i "timeout\|connection refused\|AMQP\|rabbit\|MessagingTimeout" \
    /var/log/pf9/ostackhost.log | tail -50
```

Errors referencing `MessagingTimeout`, `AMQP`, or `rabbit` confirm that the Compute Service cannot reach the messaging layer.

Also check the `nova-conductor` or `nova-scheduler` pod logs for messaging errors (Self-Hosted only):

{% hint style="info" %}
**Self-Hosted deployments only**

The following steps require `kubectl` access to the region namespace. In SaaS deployments, contact Platform9 Support to inspect management-plane component health and perform any pod-level remediation.
{% endhint %}

```bash
kubectl logs deployment/nova-conductor -n <WORKLOAD_REGION> | \
    grep -i "timeout\|AMQP\|rabbit\|MessagingTimeout" | tail -50

kubectl logs deployment/nova-scheduler -n <WORKLOAD_REGION> | \
    grep -i "timeout\|AMQP\|rabbit\|MessagingTimeout" | tail -50
```

### Step 3: Check Messaging Layer Pod Health (Self-Hosted Only)

Check whether the messaging layer pods in the region namespace are healthy:

```bash
kubectl get pods -n <WORKLOAD_REGION> | grep -i rabbit
```

Look for pods in `CrashLoopBackOff`, `OOMKilled`, `Pending`, or `Error` state. A `CrashLoopBackOff` messaging layer pod is a strong indicator of the root cause.

Describe the pod to see recent events:

```bash
kubectl describe pod <RABBITMQ_POD_NAME> -n <WORKLOAD_REGION>
```

Check the pod logs:

```bash
kubectl logs <RABBITMQ_POD_NAME> -n <WORKLOAD_REGION> --tail=100
```

Look for memory pressure events, disk quota errors, or authentication failures.

## Recovery Procedure

### For Self-Hosted Deployments

If the messaging layer pod is in a crash state, attempt a pod restart:

```bash
kubectl rollout restart deployment/<RABBITMQ_DEPLOYMENT_NAME> -n <WORKLOAD_REGION>
```

Wait for the pod to return to `Running` state:

```bash
kubectl rollout status deployment/<RABBITMQ_DEPLOYMENT_NAME> -n <WORKLOAD_REGION>
```

After the messaging layer pod is healthy, the Compute Service components reconnect automatically. Allow two to three minutes for reconnection, then verify that new VM creates succeed.

If restarting the pod does not resolve the issue, or if the pod crashes again immediately, escalate to Platform9 Support. The underlying cause may be resource exhaustion (memory or disk), a misconfiguration, or a persistent authentication issue that requires deeper investigation.

### For SaaS Deployments

You do not have access to manage messaging layer infrastructure in a SaaS deployment. Contact [Platform9 Support](https://support.platform9.com/) and provide:

* The time range when VM creates began failing.
* The output of `pcdctl server list --status BUILD`.
* The output of `pcdctl compute service list`.
* The relevant lines from `/var/log/pf9/ostackhost.log` on an affected hypervisor.

### After Recovery: Clean Up Stuck VMs

After the messaging layer is healthy, VMs that were stuck in BUILD may not automatically recover. Check their status:

```bash
pcdctl server list --status BUILD
```

For each stuck VM, attempt to reset its state and delete it, then re-create:

```bash
pcdctl server delete <VM_UUID>
```

If the delete also hangs, reset the VM state first (Self-Hosted only, requires management-plane access — contact Platform9 Support for SaaS):

```bash
pcdctl server reset-state <VM_UUID>
pcdctl server delete <VM_UUID>
```

## Prevent Messaging Layer Failures at Scale

When creating a large number of VMs in a burst, the following practices reduce the risk of overwhelming the messaging layer:

* **Stage VM creation in batches.** Rather than creating hundreds of VMs simultaneously, create them in groups of 20–50 and wait for each batch to reach **ACTIVE** before proceeding.
* **Monitor VM creation rate against available messaging layer capacity.** In Self-Hosted deployments, review messaging layer resource allocation (CPU, memory) in the region namespace and increase limits if burst workloads regularly hit capacity.
* **Use tenant quotas to limit simultaneous creation.** See [Tenant Quotas, User Quotas and VM Leases](/private-cloud-director/identity-and-multi-tenancy/tenant-quotas-user-quotas-and-vm-leases.md) to set per-tenant instance and core limits.

## Related Pages

* [Failed to Deploy Virtual Machine](/private-cloud-director/virtualized-clusters/troubleshooting-and-log-files/failed-to-deploy-virtual-machine.md)
* [Diagnose VM Scheduling Failures](/private-cloud-director/virtualized-clusters/troubleshooting-and-log-files/diagnose-vm-scheduling-failures.md)
* [Recover VMs in ERROR State After Host Reboot or Patching](/private-cloud-director/virtualized-clusters/troubleshooting-and-log-files/recover-vms-in-error-state.md)
* [Tenant Quotas, User Quotas and VM Leases](/private-cloud-director/identity-and-multi-tenancy/tenant-quotas-user-quotas-and-vm-leases.md)


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.platform9.com/private-cloud-director/virtualized-clusters/troubleshooting-and-log-files/recover-messaging-layer-failures.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
