> For the complete documentation index, see [llms.txt](https://docs.platform9.com/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://docs.platform9.com/private-cloud-director/gpu/gpu-support-pcd/gpu-post-upgrade-recovery.md).

# GPU VM Recovery After Host Upgrade

## Overview

After upgrading a GPU host's operating system or the Platform9 agent packages, GPU passthrough and vGPU VMs may fail to power on or may lose GPU access. This happens because the upgrade process can alter kernel modules, reset GRUB parameters, or change driver binding state — undoing the configuration that the GPU setup script applied before the upgrade.

This guide walks you through re-validating and restoring GPU host configuration after an upgrade so that GPU VMs can run again.

{% hint style="info" %}
**Applies to both deployment models.** Host-level steps (driver checks, vfio-pci binding, mdev recreation, service restarts) apply equally to SaaS and Self-Hosted deployments. Steps that interact with the management plane directly — such as inspecting region namespace pods — are marked **Self-Hosted deployments only**.
{% endhint %}

Before running through this guide, complete the [Post-Upgrade Verification](/private-cloud-director/upgrade/post-upgrade-verification.md) checklist to confirm the region and host are healthy. Then return here to restore GPU-specific configuration.

In this guide, you will re-validate NVIDIA driver state, confirm vfio-pci binding for passthrough hosts, re-create mdev devices for vGPU hosts, and bring GPU VMs back online.

## Prerequisites

* The host upgrade has completed and the host is back online and visible in the UI.
* You have SSH access to the host with administrator privileges.
* You have reviewed [Host Upgrade Runbook](/private-cloud-director/upgrade/host-upgrade-runbook.md) and confirmed no outstanding host-health issues.

## Step 1: Verify NVIDIA Driver State

The first thing to check after an upgrade is whether the NVIDIA kernel module loaded successfully.

```bash
nvidia-smi
```

If `nvidia-smi` returns an error such as `NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver`, the module did not load. This is a common result of a kernel upgrade — the NVIDIA driver must match the running kernel version.

**Reinstall or recompile the NVIDIA driver for the new kernel:**

If you upgraded the host OS kernel (for example, Ubuntu 22.04 to 24.04, or a kernel patch), reinstall the NVIDIA vGPU manager package:

```bash
sudo apt-get install --reinstall nvidia-vgpu-ubuntu-<version>
```

Replace `<version>` with the NVIDIA vGPU manager version used in your deployment. After reinstallation, reboot the host and confirm `nvidia-smi` returns expected GPU information.

**Verify the NVIDIA vGPU manager service (vGPU hosts only):**

```bash
sudo systemctl status nvidia-vgpu-mgr
```

If the service is not running, start it:

```bash
sudo systemctl start nvidia-vgpu-mgr
sudo systemctl enable nvidia-vgpu-mgr
```

## Step 2: Confirm vfio-pci Binding (Passthrough Hosts)

For passthrough GPU hosts, the GPU must be bound to the `vfio-pci` driver — not the NVIDIA driver — before passthrough VMs can use it. Kernel and initramfs updates can reset this binding.

**Check the current driver binding for each GPU PCI device:**

```bash
lspci -k | grep -A 3 "NVIDIA"
```

Look for the `Kernel driver in use` line for each GPU. For passthrough hosts, it must show `vfio-pci`. If it shows `nvidia` or is absent, the GPU is not bound correctly for passthrough.

**Verify GRUB parameters include the vfio-pci device IDs:**

```bash
cat /etc/default/grub | grep GRUB_CMDLINE_LINUX_DEFAULT
```

The line must contain `vfio-pci.ids=<vendor:device>` with your GPU's PCI vendor and device IDs. If this parameter is missing, the GPU setup script's GRUB configuration was overwritten by the upgrade.

**Re-run the GPU passthrough configuration script to restore GRUB and driver binding:**

```bash
cd /opt/pf9/gpu
sudo ./pf9-gpu-configure.sh
```

Select option **1** (`PCI Passthrough`) and complete the prompts. The script will re-add the required GRUB parameters.

After the script completes, update GRUB and reboot:

```bash
sudo update-grub
sudo reboot
```

After the reboot, confirm the binding:

```bash
lspci -k | grep -A 3 "NVIDIA"
```

The GPU should show `vfio-pci` as the kernel driver in use.

**Run the validation option to confirm passthrough readiness:**

```bash
cd /opt/pf9/gpu
sudo ./pf9-gpu-configure.sh
```

Select option **4** (`Validate Passthrough`).

## Step 3: Re-Create mdev Devices (vGPU Hosts)

mdev devices are not persistent across reboots. After an upgrade that involves a reboot, all mdev devices on vGPU hosts must be re-created before vGPU VMs can run.

**Confirm SR-IOV virtual functions are present:**

```bash
cat /sys/bus/pci/devices/<GPU_PCI_ADDR>/sriov_numvfs
```

If this returns `0`, the VF count was lost. Re-run the SR-IOV configuration:

```bash
cd /opt/pf9/gpu
sudo ./pf9-gpu-configure.sh
```

Select option **3** (`vGPU SR-IOV configure`) and complete the prompts.

**Re-create mdev devices for each vGPU VM that was running before the upgrade.**

If you saved the `mdevctl list` output before the upgrade (as recommended in the [pre-upgrade checklist](/private-cloud-director/upgrade/host-upgrade-runbook.md)), use those UUIDs and PCI addresses to recreate each device:

```bash
echo <UUID> > /sys/class/mdev_bus/<VF_PCI_ADDR>/mdev_supported_types/<PROFILE>/create
```

For example:

```bash
echo 295ba0f1-20d3-4a94-9d24-ad29edc5ac15 > /sys/class/mdev_bus/0000:c1:01.1/mdev_supported_types/nvidia-924/create
echo 916542a1-f966-4195-97d6-7cf2a10c850d > /sys/class/mdev_bus/0000:c1:03.2/mdev_supported_types/nvidia-924/create
```

Replace the UUIDs, PCI addresses, and profile name (`nvidia-924`) with your deployment's values.

**Verify the mdev devices were created:**

```bash
mdevctl list
```

The output should match the devices that were running before the upgrade.

{% hint style="info" %}
**Tip.** To avoid having to manually re-create mdev devices after every reboot, define them using `mdevctl define --auto` so that the mdev manager recreates them automatically on startup:

```bash
mdevctl define --uuid <UUID> --parent <VF_PCI_ADDR> --type <PROFILE>
mdevctl modify --uuid <UUID> --auto
```

Run this for each mdev UUID you want to persist across reboots.
{% endhint %}

## Step 4: Restart the Compute Service Agent

After restoring driver binding and mdev state, restart the Compute Service agent on the host so it re-discovers the GPU resources and reports updated inventory.

```bash
sudo systemctl restart pf9-ostackhost
```

Allow two to three minutes for the agent to re-register and report the host's GPU resources.

## Step 5: Verify GPU Host Status in the UI

1. Navigate to **Infrastructure > GPU Hosts** in the UI.
2. Confirm the host appears in the list with the expected GPU mode (Passthrough or vGPU) and model.
3. For vGPU hosts, confirm the expected number of available vGPU slots.

If the host does not appear or shows incorrect inventory, see [vGPU Inventory Troubleshooting and Multiple vGPU per VM](/private-cloud-director/gpu/gpu-support-pcd/vgpu-inventory-and-multi-vgpu.md) for detailed inventory diagnosis steps.

## Step 6: Power On and Verify GPU VMs

Attempt to power on a GPU VM that was running before the upgrade:

```bash
pcdctl server start <VM_UUID>
```

Monitor the server state:

```bash
pcdctl server show <VM_UUID>
```

If the VM transitions to `ACTIVE`, GPU access is restored. Verify inside the VM:

```bash
nvidia-smi
```

**If the VM fails to start:**

* Check the Compute Service agent log: `grep -i "error\|pci\|vgpu" /var/log/pf9/ostackhost.log | tail -50`
* Confirm the PCI device is listed as available: `lspci | grep -i nvidia`
* For passthrough VMs, confirm the vfio-pci binding is correct (Step 2 above).
* For vGPU VMs, confirm the mdev UUID used in the VM's libvirt domain XML matches an active mdev in `mdevctl list`.

{% hint style="info" %}
**Self-Hosted deployments only.** If VMs continue to fail after host-side remediation, inspect the Compute Service pod logs in the region namespace:

```bash
kubectl -n <region-namespace> logs deployment/nova-compute-<host> | grep -i "pci\|vgpu\|error" | tail -50
```

In SaaS deployments, contact Platform9 Support if VMs do not recover after completing all steps in this guide.
{% endhint %}

## Common Post-Upgrade GPU Failure Causes

| Symptom                                               | Most Likely Cause                                       | Resolution                                            |
| ----------------------------------------------------- | ------------------------------------------------------- | ----------------------------------------------------- |
| VM fails to start, fault: `PCI device not found`      | GPU not bound to vfio-pci after kernel upgrade          | Re-run passthrough config script; update GRUB; reboot |
| vGPU VM fails to start, fault: `No resource provider` | mdev devices not re-created after reboot                | Re-run SR-IOV config; recreate mdev devices manually  |
| `nvidia-smi` fails on host                            | NVIDIA module not loaded for new kernel                 | Reinstall NVIDIA vGPU manager; reboot                 |
| Host not visible in GPU Hosts list                    | Compute Service agent not restarted after config change | Restart `pf9-ostackhost`                              |
| VM starts but GPU not visible inside guest            | Guest NVIDIA driver version mismatch                    | Reinstall matching NVIDIA guest driver inside VM      |

## Related Pages

* [Host Upgrade Runbook](/private-cloud-director/upgrade/host-upgrade-runbook.md) — pre-upgrade steps including capturing mdev state
* [Post-Upgrade Verification](/private-cloud-director/upgrade/post-upgrade-verification.md) — post-upgrade checklist including GPU validation
* [Troubleshooting GPU Support](/private-cloud-director/gpu/gpu-support-pcd/troubleshooting-gpu-support.md) — general GPU troubleshooting scenarios
* [vGPU Inventory Troubleshooting and Multiple vGPU per VM](/private-cloud-director/gpu/gpu-support-pcd/vgpu-inventory-and-multi-vgpu.md) — detailed vGPU inventory diagnosis


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.platform9.com/private-cloud-director/gpu/gpu-support-pcd/gpu-post-upgrade-recovery.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
