# Troubleshooting GPU Support

When you configure GPU support or deploy GPU-enabled VMs, you might encounter configuration errors or issues with GPU resource availability. The following troubleshooting information addresses specific problems that have been identified during GPU setup and operations.

{% hint style="info" %}
**NOTE**

Before troubleshooting GPU issues, verify that your GPU model is supported and that you have completed all infrastructure setup steps. Most GPU errors result from incomplete configuration or mismatched settings between hosts and flavors.
{% endhint %}

#### vGPU script fails with SR-IOV unbindLock error

When running the vGPU configuration script, you may encounter this error during SR-IOV configuration.

{% tabs %}
{% tab title="Bash" %}

```bash
Step 3: Enable SR-IOV for NVIDIA GPUs
Detecting PCI devices from /sys/bus/pci/devices...
Found the following NVIDIA PCI devices:
Found the following NVIDIA devices: 0000:c1:00.0
Enter the full PCI device IDs (e.g., 0000:17:00.0 0000:18:00.0) to enable sriov, separated by spaces.
Press Enter without input to configure ALL listed NVIDIA GPUs:
No PCI device IDs provided. Configuring all NVIDIA GPUs...
Enabling SR-IOV for 0000:c1:00.0...
Enabling VFs on 0000:c1:00.0
Cannot obtain unbindLock for 0000:c1:00.0
```

{% endtab %}
{% endtabs %}

This error occurs when NVIDIA services are holding a lock on the PCI device, preventing SR-IOV configuration.

**Prerequisites for this resolution**

* NVIDIA license and drivers are installed successfully
* NVIDIA license server is created and configured

**Resolution steps:**

1. Stop all NVIDIA-related services that might be holding the device lock:

{% tabs %}
{% tab title="Bash" %}

```bash
systemctl stop nvidia-vgpu-mgr
systemctl stop nvidia-persistenced || echo "nvidia-persistenced not running"
systemctl stop nvidia-dcgm || echo "nvidia-dcgm not running"
killall -q nv-hostengine || echo "nv-hostengine not running"
```

{% endtab %}
{% endtabs %}

2. Remove and rescan the PCI device to reset its state:

{% tabs %}
{% tab title="Bash" %}

```bash
echo 1 > /sys/bus/pci/devices/0000:c1:00.0/remove 
echo 1 > /sys/bus/pci/rescan
```

{% endtab %}
{% endtabs %}

Replace `0000:c1:00.0` with your actual PCI device ID from the error message.

3. Manually enable SR-IOV for the GPU device:

{% tabs %}
{% tab title="Bash" %}

```bash
/usr/lib/nvidia/sriov-manage -e 0000:c1:00.0
```

{% endtab %}
{% endtabs %}

4. Restart the NVIDIA vGPU manager:

{% tabs %}
{% tab title="Bash" %}

```bash
systemctl start nvidia-vgpu-mgr
```

{% endtab %}
{% endtabs %}

5. Verify SR-IOV configuration by checking for virtual functions:

{% tabs %}
{% tab title="Bash" %}

```bash
nvidia-smi vgpu 
mdevctl list
```

{% endtab %}
{% endtabs %}

**Expected output after successful resolution:**

{% tabs %}
{% tab title="Bash" %}

```bash
# nvidia-smi vgpu output
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 570.148.06    Driver Version: 570.148.06                       |
|---------------------------------+------------------------------+------------|
| GPU  Name                       | Bus-Id                       | GPU-Util   |
| vGPU ID  Name                   | VM ID     VM Name            | vGPU-Util  |
|=================================+==============================+============|
|   0  NVIDIA L4                  | 00000000:C1:00.0             |      0%    |
| 3251634321  NVIDIA L4-1A        | ef2a...   ef2acb59-4400-4... |      0%    |
+---------------------------------+------------------------------+------------+

# mdevctl list output
8bfe3867-7918-4da6-b6f9-1a3ff4db030c 0000:c1:02.6 nvidia-918
```

{% endtab %}
{% endtabs %}

#### vGPU VMs fail to recover after GPU host reboot

When a GPU host with vGPU VMs is rebooted, the VMs scheduled on that host do not automatically recover. The system fails to recreate the required mediated device (mdev) configurations after the host restarts.

This occurs because mdev devices are not persistent across reboots. When the host restarts, the SR-IOV configuration and mdev device mappings are lost, preventing vGPU VMs from accessing their assigned GPU resources.

**Resolution steps:**

1. Before rebooting the host, capture the current mdev device configuration:

{% tabs %}
{% tab title="Bash" %}

```bash
root@elaborate-rate:~ mdevctl list
```

{% endtab %}
{% endtabs %}

2. Save this output.

You will need the device UUIDs and PCI addresses. Your output will look similar to the following:

{% tabs %}
{% tab title="Bash" %}

```bash
295ba0f1-20d3-4a94-9d24-ad29edc5ac15 0000:c1:01.1 nvidia-924
916542a1-f966-4195-97d6-7cf2a10c850d 0000:c1:03.2 nvidia-924
```

{% endtab %}
{% endtabs %}

3. After the host reboots, reconfigure SR-IOV by running the GPU configuration script:

{% tabs %}
{% tab title="Bash" %}

```bash
./pf9-gpu-configure.sh
```

{% endtab %}
{% endtabs %}

4. Select option 3 when prompted:

{% tabs %}
{% tab title="Bash" %}

```bash
Select configuration type:
1) PCI Passthrough
2) vGPU pre configure
3) vGPU SR-IOV configure
4) Validate Passthrough
5) Validate vGPU
Enter choice (1 - 5): 3
...
...
```

{% endtab %}
{% endtabs %}

5. Recreate the mdev devices for each vGPU VM using the UUIDs and PCI addresses you saved in step 1:

{% tabs %}
{% tab title="Bash" %}

```bash
echo 916542a1-f966-4195-97d6-7cf2a10c850d > /sys/class/mdev_bus/0000:c1:03.2/mdev_supported_types/nvidia-924/create
echo 295ba0f1-20d3-4a94-9d24-ad29edc5ac15 > /sys/class/mdev_bus/0000:c1:01.1/mdev_supported_types/nvidia-924/create
```

{% endtab %}
{% endtabs %}

Replace the UUIDs, PCI addresses, and nvidia-924 profile names with the values from your saved output.

6. Verify the mdev devices are recreated:

{% tabs %}
{% tab title="Bash" %}

```bash
mdevctl list
```

{% endtab %}
{% endtabs %}

The output should match your saved configuration from step 1.

Your vGPU VMs should now be able to start and access the assigned GPU resources.

#### vGPU Profile Selection Restricted by Existing vGPU Allocations

When configuring vGPU on a GPU host, some vGPU profiles may not appear as selectable options even though the GPU supports them.

This typically happens when **mediated devices (mdev) or vGPU-backed VMs already exist in the vGPU cluster**. When any host in the cluster has active or stale vGPU allocations, the system prevents selecting a vGPU profile that changes the **slice configuration of the GPU**.

For example:

* If a host in the cluster already has a vGPU VM using a profile that divides the GPU into **4 slices**, the cluster is effectively locked to that slice layout.
* Attempting to switch to a profile that uses a **different slice configuration** (for example 8 slices) will not be allowed.
* In this case, those profiles may not appear in the available options during GPU host configuration.

This restriction ensures **consistent vGPU resource partitioning across all hosts in the vGPU cluster**.

**Resolution steps:**

1. Remove all vGPU VMs: Delete or power off and remove any VMs using vGPU profiles across **all hosts in the cluster**.
2. Check for existing mediated devices:

{% tabs %}
{% tab title="Bash" %}

```bash
ls /sys/bus/mdev/devices/
```

{% endtab %}
{% endtabs %}

3. Count the number of devices present:

{% tabs %}
{% tab title="Bash" %}

```bash
ls /sys/bus/mdev/devices/ | wc -l
```

{% endtab %}
{% endtabs %}

4. Stop NVIDIA vGPU services:

{% tabs %}
{% tab title="Bash" %}

```bash
systemctl stop nvidia-vgpu-mgr
```

{% endtab %}
{% endtabs %}

5. Remove existing mediated devices:

{% tabs %}
{% tab title="Bash" %}

```bash
for device in /sys/bus/mdev/devices/*; do
    if [ -d "$device" ]; then
        echo 1 > "$device/remove"
    fi
done
```

{% endtab %}
{% endtabs %}

6. Restart NVIDIA vGPU services:

{% tabs %}
{% tab title="Bash" %}

```bash
systemctl start nvidia-vgpu-mgr
```

{% endtab %}
{% endtabs %}

7. Verify the devices are cleared:

{% tabs %}
{% tab title="Bash" %}

```bash
ls /sys/bus/mdev/devices/
```

{% endtab %}
{% endtabs %}

8. Return to the GPU host configuration in PCD to see the full list of available vGPU profiles.

#### VM creation fails with GPU model validation error

If you select a GPU model in the VM flavor that doesn't match the configured GPU on the host, the system will provide an error during VM creation. This ensures that the selected GPU in the flavor aligns with the underlying GPU hardware.

To resolve this issue, verify that the GPU model specified in your flavor matches the GPU model configured on the host.

#### GPU host authorization and visibility issues

GPU host does not appear as a listed GPU host

After running the GPU configuration script and authorizing, the GPU host should appear in the GPU host list showing compatibility (passthrough or vGPU), GPU model, and device ID.

If your GPU host does not appear:

* Verify the GPU configuration script ran successfully on the host.
* Confirm you rebooted the host after running the script.
* Check that host authorization completed.
* Ensure both host configuration and cluster have GPU enabled.

#### Script execution requirements

The GPU configuration script requires administrator privileges to run. Only administrators with access to the host and script should execute it. End users or developers requesting GPU resources do not need to run the script themselves.

#### Incorrect host configuration resulting in Cold Migration failure and Flavor resize of GPU enabled VMs

1. Failure of GPU enabled VM Cold Migration.

Error in Ostackhost logs post cold migration attempt:

{% tabs %}
{% tab title="Ostackhost log" %}

```bash
Insufficient compute resources: Claim pci failed.
```

{% endtab %}
{% endtabs %}

2. Failure of Flavor resize of GPU enabled VMs

Resizing the flavor of VMs are failing for Flavor upgrade and Flavor downgrade with error:

{% tabs %}
{% tab title="UI Error" %}

```bash
Invalid PCI alias definition: Device type mismatch for alias 'nvidia-14'
```

{% endtab %}
{% endtabs %}

Steps to resolve the issues related to the incorrect host configurations:

1. Deauthorise existing GPU enabled hosts part of the host configuration that needs to be corrected/modified. Reference: [Remove all roles and deauthorize a Host](/private-cloud-director/virtualized-clusters/add-hosts-virtualized-cluster.md#step-1---remove-all-roles-and-deauthorize-a-host)

{% tabs %}
{% tab title="Affected Host" %}

```bash
$ pcdctl deauthorize-node
```

{% endtab %}
{% endtabs %}

2. Run the below GPU passthrough script to cleanly remove the existing GPU configurations to avoid any conflicts in future:

{% tabs %}
{% tab title="Affected Host" %}

```bash
#!/bin/bash

echo "### Reverting GPU PCI passthrough configuration... ###"

# Remove VFIO modprobe config
if [ -f /etc/modprobe.d/vfio.conf ]; then
    echo "Removing /etc/modprobe.d/vfio.conf"
    rm -f /etc/modprobe.d/vfio.conf
fi

# Remove blacklist files
if [ -f /etc/modprobe.d/blacklist-nvidia.conf ]; then
    echo "Removing /etc/modprobe.d/blacklist-nvidia.conf"
    rm -f /etc/modprobe.d/blacklist-nvidia.conf
fi

if [ -f /etc/modprobe.d/blacklist-nouveau.conf ]; then
    echo "Removing /etc/modprobe.d/blacklist-nouveau.conf"
    rm -f /etc/modprobe.d/blacklist-nouveau.conf
fi

# Remove KVM config
if [ -f /etc/modprobe.d/kvm.conf ]; then
    echo "Removing /etc/modprobe.d/kvm.conf"
    rm -f /etc/modprobe.d/kvm.conf
fi

# Clean initramfs-tools/modules
if [ -f /etc/initramfs-tools/modules ]; then
    echo "Cleaning VFIO modules from /etc/initramfs-tools/modules"
    sed -i '/^vfio$/d' /etc/initramfs-tools/modules
    sed -i '/^vfio_pci$/d' /etc/initramfs-tools/modules
    sed -i '/^vfio_iommu_type1$/d' /etc/initramfs-tools/modules
fi

echo "Updating initramfs..."
update-initramfs -u

# Restore GRUB config
GRUB_FILE="/etc/default/grub"
if [ -f "$GRUB_FILE" ]; then
    echo "Cleaning GRUB_CMDLINE_LINUX_DEFAULT in $GRUB_FILE"
    sed -i 's/vfio-pci\.ids=[^ ]*//g' "$GRUB_FILE"
    sed -i 's/vfio_iommu_type1\.allow_unsafe_interrupts=1//g' "$GRUB_FILE"
    sed -i 's/modprobe\.blacklist=[^ ]*//g' "$GRUB_FILE"
    sed -i 's/kvm\.ignore_msrs=1//g' "$GRUB_FILE"
    sed -i 's/  / /g' "$GRUB_FILE"
fi

echo "Updating GRUB..."
update-grub

# Unbind GPU from vfio-pci
echo "Attempting to unbind NVIDIA GPUs from vfio-pci and rebind to nvidia driver..."
for PCI in $(lspci -nn | grep -i nvidia | awk '{print $1}'); do
    echo "Processing $PCI"
    if [ -e /sys/bus/pci/devices/0000:$PCI/driver ]; then
        DRIVER=$(basename "$(readlink /sys/bus/pci/devices/0000:$PCI/driver)")
        if [ "$DRIVER" = "vfio-pci" ]; then
            echo "Unbinding $PCI from vfio-pci"
            echo 0000:$PCI > /sys/bus/pci/devices/0000:$PCI/driver/unbind
            echo "Rebinding $PCI to nvidia driver"
            echo 0000:$PCI > /sys/bus/pci/drivers/nvidia/bind 2>/dev/null || echo "Could not bind $PCI to nvidia driver now (may require reboot)"
        fi
    fi
done

echo "### Revert complete. Please reboot the system to fully apply changes. ###"
```

{% endtab %}
{% endtabs %}

3. Now decommission the GPU host. Reference: [Decommission a Host](/private-cloud-director/virtualized-clusters/add-hosts-virtualized-cluster.md#step-2---decommission-a-host)

{% tabs %}
{% tab title="Affected Host" %}

```bash
$ pcdctl decommission-node
```

{% endtab %}
{% endtabs %}

4. Add the required GPU configurations in the Host config section in the Cluster blueprint from the Management Plane UI. [Authorize your GPU Hosts](/private-cloud-director/gpu/gpu-support-pcd/set-up-gpu-passthrough.md#step-4-authorize-your-gpu-hosts)
5. Now onboard the GPU host to the Management plane. Reference: [Authorize Host](/private-cloud-director/reference/pcdctl-command-line.md#prep-node)
6. From the GPU host run the gpu-passthrough script `/opt/pf9/gpu/pf9-gpu-configure.sh` to configure the GPU settings within the GPU host. Reference: [Run the GPU configuration script](/private-cloud-director/gpu/gpu-support-pcd/set-up-gpu-passthrough.md#step-3-run-the-gpu-configuration-script)

{% tabs %}
{% tab title="Bash" %}

```bash
sudo ./opt/pf9/gpu/pf9-gpu-configure.sh
```

{% endtab %}
{% endtabs %}

7. Re-authorise the GPU host with the correct host configuration: Reference: [Authorize your GPU Hosts](/private-cloud-director/gpu/gpu-support-pcd/set-up-gpu-passthrough.md#step-4-authorize-your-gpu-hosts)


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.platform9.com/private-cloud-director/gpu/gpu-support-pcd/troubleshooting-gpu-support.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.