For the complete documentation index, see llms.txt. This page is also available as Markdown.

Host Upgrade Runbook

Overview

This runbook guides you through upgrading the hosts in a Private Cloud Director virtualized cluster. It covers the pre-upgrade environment checks you must complete before upgrading any host, the recommended ordering when hosts carry multiple roles, Ubuntu 22.04-to-24.04 host OS upgrade considerations, and how to recover from common upgrade failures.

Applies to both deployment models. Private Cloud Director supports two deployment models — SaaS (Platform9 hosts and operates the management plane) and Self-Hosted (you operate the management plane on-premise). The host-level guidance on this page applies to both models. Steps that act on the on-premise management plane — for example, the airctl commands and management-plane health checks — apply to Self-Hosted deployments only and are called out in boxes like the ones below. In SaaS deployments, Platform9 operates and upgrades the management plane.

In this guide, you will prepare your environment for a host upgrade, sequence the upgrade safely across multi-role hosts, handle Ubuntu OS upgrades without losing network connectivity, and recover if an upgrade stops partway through.

Pre-Upgrade Checklist

Complete every item in this checklist before upgrading any host. Skipping checks is the most common cause of upgrade failures and extended maintenance windows.

Check Region Health

The management plane must be healthy before any host upgrade begins. A degraded management plane cannot coordinate the host upgrade process reliably.

From the Private Cloud Director UI, navigate to Infrastructure > Regions and confirm that all regions show a healthy status. Resolve any failing services before continuing.

Self-Hosted deployments only. From the management cluster node, confirm region health from the command line:

airctl status

Confirm that every region shows region health: ✅ Ready and that desired services matches ready services. Resolve any failing pods or services before continuing.

Verify Per-Host Free Disk Space

Insufficient disk space on a host causes the upgrade package installation to fail partway through, leaving the host in a partially upgraded state that is difficult to recover.

SSH to each host you plan to upgrade and check the following mount points:

Recommended minimums before starting a host upgrade:

Mount point
Recommended free space

/var

At least 5 GB

/var/log

At least 2 GB

/opt

At least 3 GB

/tmp

At least 1 GB

If /var/log is nearly full, rotate or archive old logs before proceeding:

Confirm All Hosts Are Authorized and Healthy

Every host you intend to upgrade must show an applied role status and an online connection status. Hosts in failed, error, converging, or unknown states should be resolved before the upgrade.

From the UI, navigate to Infrastructure > Cluster Hosts and confirm that no host shows a warning or error badge.

Self-Hosted deployments only. You can also check host status from the management cluster node:

Look for any host where Status is not ok or Agent Status is not running. Investigate and resolve those hosts first.

Verify Storage Backend Connectivity

If any host carries the Persistent Storage Service role (block storage) or the Image Library Service role, verify that those services are reachable and healthy before upgrading those hosts.

Persistent Storage Service (block storage):

All volume service endpoints should show enabled and up. If any endpoint is down or disabled, investigate before upgrading the host that carries that role.

Image Library Service:

Confirm that the Image Library Service endpoints are enabled and up.

You can also verify storage and image library health from the UI: navigate to Infrastructure > Storage and Infrastructure > Image Library respectively.

Verify Network Connectivity

Confirm that all hosts are reachable over the management network. From any host (or, for Self-Hosted deployments, the management cluster node) that can reach the management network:

Replace <host-ip-1>, <host-ip-2>, and so on with the management IP addresses of your hosts. If any host is unreachable, resolve the connectivity issue before upgrading.

Disable VM HA and DRR Before Host OS Upgrades

If you are upgrading the host OS (for example, from Ubuntu 22.04 to 24.04) — as distinct from upgrading the Private Cloud Director host agent packages only — disable VM HA and DRR at the cluster level before starting:

  1. Navigate to Infrastructure > Clusters in the UI.

  2. Select the cluster whose hosts you are upgrading.

  3. Toggle VM High Availability to off.

  4. Toggle Dynamic Resource Rebalancing to off.

Re-enable both only after all hosts in the cluster have been upgraded to the same OS version. See Post-Upgrade Verification for the re-enablement steps.

For full pre-conditions and behavior details, see Virtual Machine High Availability and Dynamic Resource Rebalancing (DRR).

Plan Your Maintenance Window

Estimate the upgrade window before you begin:

  • Host agent-only upgrades (no OS change) typically complete in 5–10 minutes per host.

  • Host OS upgrades (Ubuntu 22.04 to 24.04) require an additional 20–40 minutes per host plus reboot time.

Schedule the window to avoid peak workload hours. Communicate the window to application owners, because workloads on hosts entering maintenance mode will be live-migrated to other cluster hosts during the upgrade.

If any VMs remain on a host while its OVN controller is upgraded — for example, when maintenance mode is skipped or VM HA is disabled — those VMs briefly lose network packets while the controller restarts and re-establishes flows. Platform9 testing has observed roughly 4–6 seconds of packet loss per host (4 seconds with 47 VMs on Ubuntu 24.04, 6 seconds with 15 VMs on Ubuntu 22.04). Existing TCP connections recover automatically. Following the maintenance-mode procedure below avoids this disruption by draining the host before its OVN controller restarts.

Host Upgrade Ordering and Role Dependencies

When hosts carry multiple roles, the order in which you upgrade them affects service availability. Follow these sequencing rules. They apply to both deployment models.

  1. Hosts with only the Hypervisor role — upgrade these first. They carry no shared service responsibility, so their upgrade has the narrowest blast radius.

  2. Hosts with the Networking Service role — upgrade these next. Put each host into maintenance mode before upgrading to migrate its VMs first.

  3. Hosts with the Image Library Service role — upgrade image library hosts one at a time. Confirm the Image Library Service is healthy on the remaining hosts before upgrading the next one.

  4. Hosts with the Persistent Storage Service role — upgrade block storage hosts one at a time. Confirm that volumes are accessible from other hosts before proceeding to the next block storage host.

  5. Multi-role hosts — hosts that carry hypervisor, image library, or block storage roles simultaneously are the highest-risk to upgrade. Treat them like the most sensitive single role they carry (block storage > image library > hypervisor-only).

Put Each Host Into Maintenance Mode Before Upgrading

Before upgrading any host, use maintenance mode to drain its VMs:

  1. Navigate to Infrastructure > Cluster Hosts.

  2. Select the host.

  3. Click Other > Enable Maintenance Mode and follow the prompts.

  4. Wait for the Migration Status banner to show all VMs migrated.

Then proceed with the upgrade for that host. After the upgrade is complete and the host is back online, disable maintenance mode before moving to the next host.

See Maintenance Mode for full details on the enable and disable process.

Persistent Storage Service Host Considerations

If your region has only one host with the Persistent Storage Service role, upgrading that host briefly makes block storage unavailable. Schedule that upgrade during a maintenance window when no block storage volume attach or detach operations are expected. If possible, add a second block storage host before the upgrade window so there is no single point of failure.

Image Library Service Host Considerations

If your region has only one host with the Image Library Service role, new VM provisioning from images will fail during that host's upgrade. All existing running VMs continue operating normally. Schedule that upgrade during a window when no new VMs need to be deployed.

Ubuntu 22.04 to 24.04 Host OS Upgrade Caveats

Before the OS Upgrade

  1. Record the current OVS bridge configuration on the host:

Save the output. You will need it to verify the configuration is intact after the reboot.

  1. Confirm the host is in maintenance mode and all its VMs have been migrated off.

  2. If the host has the Hypervisor role, ensure you have out-of-band console access in case network connectivity is lost after the OS upgrade.

After the OS Upgrade and Reboot

Perform network re-validation immediately after the host comes back online:

  1. Confirm the OVS bridge is present and the physical interface is still a member:

The output should show your bridge (for example, br-ex or br-data) with the physical interface listed as a port. If the physical interface is missing from the bridge, re-add it:

Replace <bridge-name> and <physical-interface> with the values from your pre-upgrade recording.

  1. Confirm the host can reach the management plane:

  1. Confirm the host agent is running and connected:

The service should be active (running). If it is stopped or failed, restart it:

  1. From the management plane, confirm the host shows online connection status in Infrastructure > Cluster Hosts.

  2. After confirming network and host agent health, disable maintenance mode and allow the host to rejoin scheduling before proceeding to the next host.

Mixed OS Versions in a Cluster

Do not leave a cluster in a mixed OS state longer than necessary. A cluster where some hosts run Ubuntu 22.04 and others run Ubuntu 24.04 has the following limitations:

  • VM HA evacuations between mixed-OS hosts will fail (KVM version mismatch).

  • DRR live migrations between mixed-OS hosts will fail.

  • Manual VM migrations between mixed-OS hosts are not supported.

Keep VM HA and DRR disabled for the entire cluster until all hosts are on the same OS version.

Host Upgrade Failure Recovery

The host agent and package-level recovery steps below apply to both deployment models. Steps that re-submit an upgrade through airctl or inspect the region's Kubernetes namespace apply to Self-Hosted deployments only and are marked. In SaaS deployments, if a host upgrade fails and the host-level steps below do not recover it, contact Platform9 Support.

409 Conflict Error During Host Upgrade

A 409 Conflict response during a host upgrade typically means the management plane has a stale lock or an in-progress record for that host from a previous attempt. This prevents the upgrade from being re-submitted.

Recovery steps:

  1. Verify the host agent is running and reports back to the management plane:

Self-Hosted deployments only. Check whether the host upgrade pod from the previous attempt is still running, and clear it before retrying:

If a host-upgrade-* pod is in Running or Pending state from a prior attempt, wait for it to complete or delete it:

Once the host agent is confirmed running (and, for Self-Hosted deployments, the stale pod is cleared), retry the host upgrade. If the conflict persists, contact Platform9 Support.

"Cluster Name Removed from Host" Error

This error appears when the host's local configuration no longer contains the cluster association record. It typically occurs if the host was deauthorized or had its roles removed while an upgrade was in progress.

Recovery steps:

  1. Check the host's role status. From the UI, navigate to Infrastructure > Cluster Hosts and inspect the affected host.

  2. If the host shows unauthorized or missing roles, re-authorize it and re-assign roles from the UI: select the host, click Edit Roles, and re-assign the appropriate roles.

  3. Wait for the host to reach applied role status (the status transitions through converging).

  4. Once the host is back to applied status, retry the host upgrade.

Partial or Failed Upgrade Cleanup

If a host upgrade fails midway, the host may be in an inconsistent state with mixed package versions. Before retrying, perform the following cleanup on the affected host:

  1. Check which Private Cloud Director packages are installed on the host and their versions:

  1. If you see a mix of old and new package versions, force-reinstall the current target packages to bring the host to a consistent state:

Add any other pf9-* packages shown by the dpkg -l output that are at the wrong version.

  1. Restart the host agent after reinstalling:

  1. Wait for the host to return to applied status in the management plane, then retry the upgrade.

Host Re-Sync and Retry

If a host is stuck in converging state for more than 10 minutes after a failed upgrade:

  1. From the UI, navigate to Infrastructure > Cluster Hosts, select the host, and use Other > Re-sync Host (if available) to trigger a re-convergence.

  2. Alternatively, restart the host agent on the host directly:

  1. Monitor the hostagent log for convergence progress:

Look for converge successful or role application complete messages.

  1. If the host does not converge within another 10 minutes, contact Platform9 Support with the hostagent log from the affected host (and, for Self-Hosted deployments, the output of airctl host-status).

Next Steps

After all hosts are upgraded, complete the Post-Upgrade Verification checklist to confirm that all services are healthy and re-enable VM HA and DRR.

Last updated

Was this helpful?