Troubleshooting GPU Support

When you configure GPU support or deploy GPU-enabled VMs, you might encounter configuration errors or issues with GPU resource availability. The following troubleshooting information addresses specific problems that have been identified during GPU setup and operations.

NOTE

Before troubleshooting GPU issues, verify that your GPU model is supported and that you have completed all infrastructure setup steps. Most GPU errors result from incomplete configuration or mismatched settings between hosts and flavors.

vGPU script fails with SR-IOV unbindLock error

When running the vGPU configuration script, you may encounter this error during SR-IOV configuration.

Step 3: Enable SR-IOV for NVIDIA GPUs
Detecting PCI devices from /sys/bus/pci/devices...
Found the following NVIDIA PCI devices:
Found the following NVIDIA devices: 0000:c1:00.0
Enter the full PCI device IDs (e.g., 0000:17:00.0 0000:18:00.0) to enable sriov, separated by spaces.
Press Enter without input to configure ALL listed NVIDIA GPUs:
No PCI device IDs provided. Configuring all NVIDIA GPUs...
Enabling SR-IOV for 0000:c1:00.0...
Enabling VFs on 0000:c1:00.0
Cannot obtain unbindLock for 0000:c1:00.0

This error occurs when NVIDIA services are holding a lock on the PCI device, preventing SR-IOV configuration.

Prerequisites for this resolution

  • NVIDIA license and drivers are installed successfully

  • NVIDIA license server is created and configured

Resolution steps:

  1. Stop all NVIDIA-related services that might be holding the device lock:

  1. Remove and rescan the PCI device to reset its state:

Replace 0000:c1:00.0 with your actual PCI device ID from the error message.

  1. Manually enable SR-IOV for the GPU device:

  1. Restart the NVIDIA vGPU manager:

  1. Verify SR-IOV configuration by checking for virtual functions:

Expected output after successful resolution:

vGPU VMs fail to recover after GPU host reboot

When a GPU host with vGPU VMs is rebooted, the VMs scheduled on that host do not automatically recover. The system fails to recreate the required mediated device (mdev) configurations after the host restarts.

This occurs because mdev devices are not persistent across reboots. When the host restarts, the SR-IOV configuration and mdev device mappings are lost, preventing vGPU VMs from accessing their assigned GPU resources.

Resolution steps:

  1. Before rebooting the host, capture the current mdev device configuration:

  1. Save this output.

You will need the device UUIDs and PCI addresses. Your output will look similar to the following:

  1. After the host reboots, reconfigure SR-IOV by running the GPU configuration script:

  1. Select option 3 when prompted:

  1. Recreate the mdev devices for each vGPU VM using the UUIDs and PCI addresses you saved in step 1:

Replace the UUIDs, PCI addresses, and nvidia-924 profile names with the values from your saved output.

  1. Verify the mdev devices are recreated:

The output should match your saved configuration from step 1.

Your vGPU VMs should now be able to start and access the assigned GPU resources.

vGPU Profile Selection Restricted by Existing vGPU Allocations

When configuring vGPU on a GPU host, some vGPU profiles may not appear as selectable options even though the GPU supports them.

This typically happens when mediated devices (mdev) or vGPU-backed VMs already exist in the vGPU cluster. When any host in the cluster has active or stale vGPU allocations, the system prevents selecting a vGPU profile that changes the slice configuration of the GPU.

For example:

  • If a host in the cluster already has a vGPU VM using a profile that divides the GPU into 4 slices, the cluster is effectively locked to that slice layout.

  • Attempting to switch to a profile that uses a different slice configuration (for example 8 slices) will not be allowed.

  • In this case, those profiles may not appear in the available options during GPU host configuration.

This restriction ensures consistent vGPU resource partitioning across all hosts in the vGPU cluster.

Resolution steps:

  1. Remove all vGPU VMs: Delete or power off and remove any VMs using vGPU profiles across all hosts in the cluster.

  2. Check for existing mediated devices:

  1. Count the number of devices present:

  1. Stop NVIDIA vGPU services:

  1. Remove existing mediated devices:

  1. Restart NVIDIA vGPU services:

  1. Verify the devices are cleared:

  1. Return to the GPU host configuration in PCD to see the full list of available vGPU profiles.

VM creation fails with GPU model validation error

If you select a GPU model in the VM flavor that doesn't match the configured GPU on the host, the system will provide an error during VM creation. This ensures that the selected GPU in the flavor aligns with the underlying GPU hardware.

To resolve this issue, verify that the GPU model specified in your flavor matches the GPU model configured on the host.

GPU host authorization and visibility issues

GPU host does not appear as a listed GPU host

After running the GPU configuration script and authorizing, the GPU host should appear in the GPU host list showing compatibility (passthrough or vGPU), GPU model, and device ID.

If your GPU host does not appear:

  • Verify the GPU configuration script ran successfully on the host.

  • Confirm you rebooted the host after running the script.

  • Check that host authorization completed.

  • Ensure both host configuration and cluster have GPU enabled.

Script execution requirements

The GPU configuration script requires administrator privileges to run. Only administrators with access to the host and script should execute it. End users or developers requesting GPU resources do not need to run the script themselves.

Incorrect host configuration resulting in Cold Migration failure and Flavor resize of GPU enabled VMs

  1. Failure of GPU enabled VM Cold Migration.

Error in Ostackhost logs post cold migration attempt:

  1. Failure of Flavor resize of GPU enabled VMs

Resizing the flavor of VMs are failing for Flavor upgrade and Flavor downgrade with error:

Steps to resolve the issues related to the incorrect host configurations:

  1. Deauthorise existing GPU enabled hosts part of the host configuration that needs to be corrected/modified. Reference: Remove all roles and deauthorize a Host

  1. Run the below GPU passthrough script to cleanly remove the existing GPU configurations to avoid any conflicts in future:

  1. Now decommission the GPU host. Reference: Decommission a Host

  1. Add the required GPU configurations in the Host config section in the Cluster blueprint from the Management Plane UI. Authorize your GPU Hosts

  2. Now onboard the GPU host to the Management plane. Reference: Authorize Host

  3. From the GPU host run the gpu-passthrough script /opt/pf9/gpu/pf9-gpu-configure.sh to configure the GPU settings within the GPU host. Reference: Run the GPU configuration script

  1. Re-authorise the GPU host with the correct host configuration: Reference: Authorize your GPU Hosts

GPU Passthrough VM Creation Fails with PciPassthroughFilter

Symptom

A VM using a GPU passthrough flavor fails to create and lands in ERROR state. The fault message reads:

Retrieving the server events with pcdctl server event show <VM_UUID> <REQUEST_ID> shows a line similar to:

Why This Happens

The Compute Service applies a PCI passthrough filter when scheduling GPU passthrough VMs. This filter verifies three conditions for every candidate host:

  1. The host has available PCI devices matching the type specified in the flavor.

  2. The PCI device is reported in the Placement service inventory as an available resource (via a CUSTOM_PCI_<alias> resource class or a CUSTOM_VGPU_* trait, depending on configuration).

  3. The PCI alias in the host configuration matches the alias defined in the flavor's extra specs.

If any of these conditions is not met on any host, all hosts are filtered out and VM creation fails.

Step 1: Confirm the Host's PCI Device Is Visible

On the GPU host, confirm the PCI device is detected by the OS:

If no NVIDIA device appears, the GPU may not be physically present or the PCI slot is not recognized. Check the physical hardware and BIOS settings (VT-d and SR-IOV must be enabled).

Step 2: Verify vfio-pci Binding

The GPU must be bound to the vfio-pci driver for passthrough to work. Check the driver binding:

The Kernel driver in use line must show vfio-pci. If it shows nvidia or is blank, the GPU is not configured for passthrough. Run the GPU passthrough configuration script:

Select option 1 (PCI Passthrough), then update GRUB and reboot. Confirm the binding after the reboot.

Step 3: Validate the Passthrough Configuration

Run the validation option in the GPU configuration script to confirm the host is in a consistent passthrough state:

Select option 4 (Validate Passthrough). The script confirms IOMMU groups, vfio-pci binding, and the GRUB configuration are correct.

Step 4: Confirm the PCI Alias in the Host Configuration

The GPU model selected in the host configuration creates a PCI alias entry. The flavor must reference the same alias. Mismatches between the GPU model in the host configuration and the GPU model in the flavor cause the PCI passthrough filter to reject the host.

  1. Navigate to Infrastructure > Cluster Blueprint > Host Configurations and confirm the GPU model is correctly set.

  2. Navigate to Virtual Machines > Flavors, open the GPU flavor, and confirm the GPU model matches the host configuration exactly.

If they differ, either update the host configuration to match the GPU model in the flavor, or recreate the flavor with the correct GPU model. See Incorrect host configuration resulting in Cold Migration failure and Flavor resize of GPU enabled VMs for the full reset procedure when the host configuration must change.

Step 5: Confirm the Host Is Authorized and Scheduling Is Enabled

Confirm the host shows Status: ok and Scheduling: enabled. A host in Scheduling: disabled state is excluded from all placement decisions regardless of available resources.

If scheduling is disabled, navigate to Infrastructure > Cluster Hosts, select the host, and re-enable scheduling.

Step 6: Check Placement Service Inventory

Self-Hosted deployments only. Inspect the Placement service to confirm the host is registered as a resource provider with the expected PCI resource class:

Then verify the resource provider inventory directly:

Look for a resource class matching the GPU PCI alias (for example, CUSTOM_PCI_NVIDIA_L4_PASSTHROUGH or the alias name configured in your deployment). If it is missing, restart pf9-ostackhost on the GPU host to trigger a re-sync.

In SaaS deployments, contact Platform9 Support if the passthrough filter continues to reject all hosts after completing the steps above.

vGPU Profile Configuration Is Blank or Fails to Display

Symptom

When you navigate to Infrastructure > GPU Hosts and select a vGPU host to configure its profile, the profile dropdown is empty or shows no available profiles. Alternatively, a profile was previously set but the UI now shows the host in an unconfigured state.

Why This Happens

The profile list is populated from the mdev types that the NVIDIA vGPU manager reports as available on the host. If the vGPU manager is not running, or if the SR-IOV VFs are not present, no profiles appear. A stale or failed previous profile-change attempt can also leave the host in an inconsistent state where the UI cannot determine the current profile.

Step 1: Verify the NVIDIA vGPU Manager Is Running

On the vGPU host:

If the service is not running (inactive or failed), start it and check for errors:

Common startup failures include a missing NVIDIA license (the vGPU manager will not start without a valid license) and a kernel module mismatch. Resolve the underlying error before continuing.

Step 2: Confirm SR-IOV Virtual Functions Are Present

A value of 0 means no VFs are present and no mdev types will be enumerable. Re-run the SR-IOV configuration:

Select option 3 (vGPU SR-IOV configure).

Step 3: Enumerate Available mdev Types

Confirm that the NVIDIA vGPU manager is advertising profiles for the GPU:

Each directory listed (for example, nvidia-924) is an available profile. If this directory is empty or the path does not exist, the vGPU manager is not running correctly or the VFs are not initialized. Resolve steps 1 and 2 first.

Step 4: Recover from a Stale or Failed Profile Change

If the UI shows the host in an error state after a profile-change attempt, or if a previously selected profile is no longer reflected on the host, perform a clean reset:

  1. Remove all vGPU VMs from this host (power off and delete, or live-migrate them to another host).

  2. Stop the NVIDIA vGPU manager:

  1. Remove all existing mdev devices:

  1. Confirm all mdev devices are cleared:

The directory should be empty.

  1. Restart the vGPU manager:

  1. Re-run the vGPU SR-IOV configuration:

Select option 3 (vGPU SR-IOV configure).

  1. Return to Infrastructure > GPU Hosts in the UI and select the host. The profile dropdown should now list the available profiles. Select the desired profile and save.

  2. Restart the Compute Service agent to update inventory:

Step 5: Re-Authorize the Host if Needed

If the profile change requires deauthorizing and re-authorizing the host (for example, when switching to a profile that uses a different slice configuration), follow the full re-authorization procedure:

  1. Deauthorize the host: see Remove all roles and deauthorize a Host.

  2. Apply the new profile via the GPU Hosts configuration page.

  3. Re-authorize the host into the vGPU cluster: see Authorize vGPU Hosts.

Last updated

Was this helpful?