> For the complete documentation index, see [llms.txt](https://docs.platform9.com/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://docs.platform9.com/private-cloud-director/upgrade/host-upgrade-runbook.md).

# Host Upgrade Runbook

## Overview

This runbook guides you through upgrading the hosts in a <code class="expression">space.vars.product\_name</code> virtualized cluster. It covers the pre-upgrade environment checks you must complete before upgrading any host, the recommended ordering when hosts carry multiple roles, Ubuntu 22.04-to-24.04 host OS upgrade considerations, and how to recover from common upgrade failures.

{% hint style="info" %}
**Applies to both deployment models.** <code class="expression">space.vars.product\_name</code> supports two deployment models — **SaaS** (Platform9 hosts and operates the management plane) and **Self-Hosted** (you operate the management plane on-premise). The host-level guidance on this page applies to both models. Steps that act on the on-premise management plane — for example, the `airctl` commands and management-plane health checks — apply to **Self-Hosted deployments only** and are called out in boxes like the ones below. In SaaS deployments, Platform9 operates and upgrades the management plane.
{% endhint %}

In this guide, you will prepare your environment for a host upgrade, sequence the upgrade safely across multi-role hosts, handle Ubuntu OS upgrades without losing network connectivity, and recover if an upgrade stops partway through.

## Pre-Upgrade Checklist <a href="#pre-upgrade-checklist" id="pre-upgrade-checklist"></a>

Complete every item in this checklist before upgrading any host. Skipping checks is the most common cause of upgrade failures and extended maintenance windows.

### Check Region Health

The management plane must be healthy before any host upgrade begins. A degraded management plane cannot coordinate the host upgrade process reliably.

From the <code class="expression">space.vars.product\_name</code> UI, navigate to **Infrastructure > Regions** and confirm that all regions show a healthy status. Resolve any failing services before continuing.

{% hint style="info" %}
**Self-Hosted deployments only.** From the management cluster node, confirm region health from the command line:

```bash
airctl status
```

Confirm that every region shows `region health: ✅ Ready` and that `desired services` matches `ready services`. Resolve any failing pods or services before continuing.
{% endhint %}

### Verify Per-Host Free Disk Space

Insufficient disk space on a host causes the upgrade package installation to fail partway through, leaving the host in a partially upgraded state that is difficult to recover.

SSH to each host you plan to upgrade and check the following mount points:

```bash
df -h /var /var/log /opt /tmp
```

Recommended minimums before starting a host upgrade:

| Mount point | Recommended free space |
| ----------- | ---------------------- |
| `/var`      | At least 5 GB          |
| `/var/log`  | At least 2 GB          |
| `/opt`      | At least 3 GB          |
| `/tmp`      | At least 1 GB          |

If `/var/log` is nearly full, rotate or archive old logs before proceeding:

```bash
journalctl --vacuum-size=1G
find /var/log -name "*.gz" -mtime +30 -delete
```

### Confirm All Hosts Are Authorized and Healthy

Every host you intend to upgrade must show an `applied` role status and an `online` connection status. Hosts in `failed`, `error`, `converging`, or `unknown` states should be resolved before the upgrade.

From the UI, navigate to **Infrastructure > Cluster Hosts** and confirm that no host shows a warning or error badge.

{% hint style="info" %}
**Self-Hosted deployments only.** You can also check host status from the management cluster node:

```bash
airctl host-status --config /opt/pf9/airctl/conf/airctl-config.yaml
```

Look for any host where `Status` is not `ok` or `Agent Status` is not `running`. Investigate and resolve those hosts first.
{% endhint %}

### Verify Storage Backend Connectivity

If any host carries the Persistent Storage Service role (block storage) or the Image Library Service role, verify that those services are reachable and healthy before upgrading those hosts.

**Persistent Storage Service (block storage):**

```bash
pcdctl volume service list
```

All volume service endpoints should show `enabled` and `up`. If any endpoint is `down` or `disabled`, investigate before upgrading the host that carries that role.

**Image Library Service:**

```bash
pcdctl image-service list
```

Confirm that the Image Library Service endpoints are `enabled` and `up`.

You can also verify storage and image library health from the UI: navigate to **Infrastructure > Storage** and **Infrastructure > Image Library** respectively.

### Verify Network Connectivity

Confirm that all hosts are reachable over the management network. From any host (or, for Self-Hosted deployments, the management cluster node) that can reach the management network:

```bash
for ip in <host-ip-1> <host-ip-2> ...; do
  ping -c 2 "$ip" && echo "$ip OK" || echo "$ip UNREACHABLE"
done
```

Replace `<host-ip-1>`, `<host-ip-2>`, and so on with the management IP addresses of your hosts. If any host is unreachable, resolve the connectivity issue before upgrading.

### Disable VM HA and DRR Before Host OS Upgrades

{% hint style="warning" %}
**Important: disable VM HA and DRR before upgrading host operating systems.**

VM evacuation and live migration require matching operating system and KVM versions between source and destination hosts. If VM HA or DRR is active while some hosts run Ubuntu 22.04 and others run Ubuntu 24.04, evacuation attempts between those hosts will fail.
{% endhint %}

If you are upgrading the host OS (for example, from Ubuntu 22.04 to 24.04) — as distinct from upgrading the <code class="expression">space.vars.product\_name</code> host agent packages only — disable VM HA and DRR at the cluster level before starting:

1. Navigate to **Infrastructure > Clusters** in the UI.
2. Select the cluster whose hosts you are upgrading.
3. Toggle **VM High Availability** to off.
4. Toggle **Dynamic Resource Rebalancing** to off.

Re-enable both only after all hosts in the cluster have been upgraded to the same OS version. See [Post-Upgrade Verification](/private-cloud-director/upgrade/post-upgrade-verification.md#re-enable-vm-ha-and-drr) for the re-enablement steps.

For full pre-conditions and behavior details, see [Virtual Machine High Availability](/private-cloud-director/virtualized-clusters/virtualized-cluster/virtual-machine-high-availability-vm-ha.md) and [Dynamic Resource Rebalancing (DRR)](/private-cloud-director/virtualized-clusters/virtualized-cluster/dynamic-resource-rebalancing-drr.md).

### Plan Your Maintenance Window

Estimate the upgrade window before you begin:

* Host agent-only upgrades (no OS change) typically complete in 5–10 minutes per host.
* Host OS upgrades (Ubuntu 22.04 to 24.04) require an additional 20–40 minutes per host plus reboot time.

Schedule the window to avoid peak workload hours. Communicate the window to application owners, because workloads on hosts entering maintenance mode will be live-migrated to other cluster hosts during the upgrade.

If any VMs remain on a host while its OVN controller is upgraded — for example, when maintenance mode is skipped or VM HA is disabled — those VMs briefly lose network packets while the controller restarts and re-establishes flows. Platform9 testing has observed roughly **4–6 seconds** of packet loss per host (4 seconds with 47 VMs on Ubuntu 24.04, 6 seconds with 15 VMs on Ubuntu 22.04). Existing TCP connections recover automatically. Following the maintenance-mode procedure below avoids this disruption by draining the host before its OVN controller restarts.

## Host Upgrade Ordering and Role Dependencies <a href="#host-upgrade-ordering" id="host-upgrade-ordering"></a>

When hosts carry multiple roles, the order in which you upgrade them affects service availability. Follow these sequencing rules. They apply to both deployment models.

### Recommended Upgrade Order

1. **Hosts with only the Hypervisor role** — upgrade these first. They carry no shared service responsibility, so their upgrade has the narrowest blast radius.
2. **Hosts with the Networking Service role** — upgrade these next. Put each host into maintenance mode before upgrading to migrate its VMs first.
3. **Hosts with the Image Library Service role** — upgrade image library hosts one at a time. Confirm the Image Library Service is healthy on the remaining hosts before upgrading the next one.
4. **Hosts with the Persistent Storage Service role** — upgrade block storage hosts one at a time. Confirm that volumes are accessible from other hosts before proceeding to the next block storage host.
5. **Multi-role hosts** — hosts that carry hypervisor, image library, or block storage roles simultaneously are the highest-risk to upgrade. Treat them like the most sensitive single role they carry (block storage > image library > hypervisor-only).

### Put Each Host Into Maintenance Mode Before Upgrading

Before upgrading any host, use maintenance mode to drain its VMs:

1. Navigate to **Infrastructure > Cluster Hosts**.
2. Select the host.
3. Click **Other > Enable Maintenance Mode** and follow the prompts.
4. Wait for the **Migration Status** banner to show all VMs migrated.

Then proceed with the upgrade for that host. After the upgrade is complete and the host is back online, disable maintenance mode before moving to the next host.

See [Maintenance Mode](/private-cloud-director/virtualized-clusters/add-hosts-virtualized-cluster/maintenance-mode.md) for full details on the enable and disable process.

### Persistent Storage Service Host Considerations

If your region has only one host with the Persistent Storage Service role, upgrading that host briefly makes block storage unavailable. Schedule that upgrade during a maintenance window when no block storage volume attach or detach operations are expected. If possible, add a second block storage host before the upgrade window so there is no single point of failure.

### Image Library Service Host Considerations

If your region has only one host with the Image Library Service role, new VM provisioning from images will fail during that host's upgrade. All existing running VMs continue operating normally. Schedule that upgrade during a window when no new VMs need to be deployed.

## Ubuntu 22.04 to 24.04 Host OS Upgrade Caveats <a href="#ubuntu-os-upgrade-caveats" id="ubuntu-os-upgrade-caveats"></a>

{% hint style="warning" %}
**Network connectivity risk on hypervisor hosts.**

Upgrading from Ubuntu 22.04 to 24.04 on a host with the Hypervisor role carries a risk of OVS (Open vSwitch) bridge misconfiguration during the OS upgrade. If the OVS bridge loses its interface binding after the reboot, the host loses network connectivity and cannot re-connect to the management plane. Plan for out-of-band console access (IPMI, iDRAC, or equivalent) before starting the OS upgrade on hypervisor hosts.
{% endhint %}

### Before the OS Upgrade

1. Record the current OVS bridge configuration on the host:

```bash
ovs-vsctl show
ip addr show
```

Save the output. You will need it to verify the configuration is intact after the reboot.

2. Confirm the host is in maintenance mode and all its VMs have been migrated off.
3. If the host has the Hypervisor role, ensure you have out-of-band console access in case network connectivity is lost after the OS upgrade.

### After the OS Upgrade and Reboot

Perform network re-validation immediately after the host comes back online:

1. Confirm the OVS bridge is present and the physical interface is still a member:

```bash
ovs-vsctl show
```

The output should show your bridge (for example, `br-ex` or `br-data`) with the physical interface listed as a port. If the physical interface is missing from the bridge, re-add it:

```bash
ovs-vsctl add-port <bridge-name> <physical-interface>
```

Replace `<bridge-name>` and `<physical-interface>` with the values from your pre-upgrade recording.

2. Confirm the host can reach the management plane:

```bash
curl -sk https://<management-plane-fqdn>/resmgr/v1/hosts | head -c 200
```

3. Confirm the host agent is running and connected:

```bash
systemctl status pf9-hostagent
```

The service should be `active (running)`. If it is stopped or failed, restart it:

```bash
systemctl restart pf9-hostagent
```

4. From the management plane, confirm the host shows `online` connection status in **Infrastructure > Cluster Hosts**.
5. After confirming network and host agent health, disable maintenance mode and allow the host to rejoin scheduling before proceeding to the next host.

### Mixed OS Versions in a Cluster

Do not leave a cluster in a mixed OS state longer than necessary. A cluster where some hosts run Ubuntu 22.04 and others run Ubuntu 24.04 has the following limitations:

* VM HA evacuations between mixed-OS hosts will fail (KVM version mismatch).
* DRR live migrations between mixed-OS hosts will fail.
* Manual VM migrations between mixed-OS hosts are not supported.

Keep VM HA and DRR disabled for the entire cluster until all hosts are on the same OS version.

## Host Upgrade Failure Recovery <a href="#failure-recovery" id="failure-recovery"></a>

The host agent and package-level recovery steps below apply to both deployment models. Steps that re-submit an upgrade through `airctl` or inspect the region's Kubernetes namespace apply to Self-Hosted deployments only and are marked. In SaaS deployments, if a host upgrade fails and the host-level steps below do not recover it, contact Platform9 Support.

### 409 Conflict Error During Host Upgrade

A `409 Conflict` response during a host upgrade typically means the management plane has a stale lock or an in-progress record for that host from a previous attempt. This prevents the upgrade from being re-submitted.

**Recovery steps:**

1. Verify the host agent is running and reports back to the management plane:

```bash
# On the affected host
systemctl status pf9-hostagent
```

{% hint style="info" %}
**Self-Hosted deployments only.** Check whether the host upgrade pod from the previous attempt is still running, and clear it before retrying:

```bash
kubectl get pods -n <region-fqdn> | grep host-upgrade
```

If a `host-upgrade-*` pod is in `Running` or `Pending` state from a prior attempt, wait for it to complete or delete it:

```bash
kubectl delete pod <host-upgrade-pod-name> -n <region-fqdn>
```

{% endhint %}

Once the host agent is confirmed running (and, for Self-Hosted deployments, the stale pod is cleared), retry the host upgrade. If the conflict persists, contact Platform9 Support.

### "Cluster Name Removed from Host" Error

This error appears when the host's local configuration no longer contains the cluster association record. It typically occurs if the host was deauthorized or had its roles removed while an upgrade was in progress.

**Recovery steps:**

1. Check the host's role status. From the UI, navigate to **Infrastructure > Cluster Hosts** and inspect the affected host.
2. If the host shows `unauthorized` or missing roles, re-authorize it and re-assign roles from the UI: select the host, click **Edit Roles**, and re-assign the appropriate roles.
3. Wait for the host to reach `applied` role status (the status transitions through `converging`).
4. Once the host is back to `applied` status, retry the host upgrade.

### Partial or Failed Upgrade Cleanup

If a host upgrade fails midway, the host may be in an inconsistent state with mixed package versions. Before retrying, perform the following cleanup on the affected host:

1. Check which <code class="expression">space.vars.product\_name</code> packages are installed on the host and their versions:

```bash
dpkg -l | grep pf9
```

2. If you see a mix of old and new package versions, force-reinstall the current target packages to bring the host to a consistent state:

```bash
apt-get install --reinstall pf9-ostackhost pf9-neutron-base pf9-neutron-ovn-controller
```

Add any other `pf9-*` packages shown by the `dpkg -l` output that are at the wrong version.

3. Restart the host agent after reinstalling:

```bash
systemctl restart pf9-hostagent
```

4. Wait for the host to return to `applied` status in the management plane, then retry the upgrade.

### Host Re-Sync and Retry

If a host is stuck in `converging` state for more than 10 minutes after a failed upgrade:

1. From the UI, navigate to **Infrastructure > Cluster Hosts**, select the host, and use **Other > Re-sync Host** (if available) to trigger a re-convergence.
2. Alternatively, restart the host agent on the host directly:

```bash
systemctl restart pf9-hostagent
```

3. Monitor the hostagent log for convergence progress:

```bash
tail -f /var/log/pf9/hostagent.log
```

Look for `converge successful` or `role application complete` messages.

4. If the host does not converge within another 10 minutes, contact Platform9 Support with the hostagent log from the affected host (and, for Self-Hosted deployments, the output of `airctl host-status`).

## Next Steps

After all hosts are upgraded, complete the [Post-Upgrade Verification](/private-cloud-director/upgrade/post-upgrade-verification.md) checklist to confirm that all services are healthy and re-enable VM HA and DRR.


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.platform9.com/private-cloud-director/upgrade/host-upgrade-runbook.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
