vGPU Inventory Troubleshooting and Multiple vGPU per VM

Overview

This guide covers two related operational topics for vGPU hosts:

  1. vGPU inventory problems — diagnosing why a vGPU host is missing from or shows incorrect inventory in the Compute Service resource-tracking layer.

  2. Multiple vGPU per VM — understanding the constraints when a VM requests more than one vGPU and how to verify that all requested vGPU devices attach correctly.

Both topics involve the host's mediated device (mdev) state and the SR-IOV virtual-function configuration. Work through the inventory section first if you see unexpected behavior; then consult the multi-vGPU section if the VM itself is running but a device is not visible inside the guest.

Diagnose and Repair vGPU Inventory Problems

Understand How vGPU Inventory Is Reported

When a vGPU host is authorized and the vGPU profile is configured, the Compute Service agent on that host reports resource providers and inventory to the Placement service. Each vGPU slot appears as a resource class of the form VGPU_<profile> (for example, VGPU_NVIDIA_924). If the host does not appear as a resource provider, or if the available inventory count is wrong, the Compute Service cannot schedule vGPU VMs to that host.

Common causes of incorrect inventory:

  • The mdev devices are not created or are in an inconsistent state (for example, some VFs have mdevs and some do not).

  • The host has enabled_mdev_types configured with duplicate or conflicting profile names.

  • Mixed vGPU profiles are applied to different virtual functions on the same physical GPU — only a single profile is supported per physical GPU at a time.

  • SR-IOV virtual functions were not created (the physical GPU shows no VFs in /sys/bus/pci/devices/<PCI_ADDR>/sriov_numvfs).

  • The Compute Service agent was not restarted after mdev state changed.

Step 1: Verify SR-IOV Virtual Functions

SSH to the host and confirm that the physical GPU has virtual functions enabled.

Replace <GPU_PCI_ADDR> with the GPU PCI address (for example, 0000:c1:00.0). A value of 0 means no VFs are active and no mdev devices can exist. If you see 0, re-run the vGPU SR-IOV configuration step:

Select option 3 (vGPU SR-IOV configure) when prompted.

Step 2: List Mediated Devices and Check Consistency

After confirming VFs are present, list the active mdev devices:

Expected output when the profile is configured correctly — one mdev entry per VF that has been assigned this profile:

Signs of inconsistency to watch for:

  • No output — no mdev devices are created at all. The inventory will be zero or absent.

  • Mixed profile names (for example, nvidia-918 on one VF and nvidia-924 on another VF on the same physical GPU) — only one profile is allowed per physical GPU; remove all mdevs and re-create with a single profile.

  • Fewer mdevs than expected VFs — partial allocation; the inventory count will be lower than the GPU's maximum capacity.

Step 3: Check the enabled_mdev_types Configuration

Duplicate or conflicting entries in the enabled_mdev_types configuration can cause the Compute Service agent to report inventory incorrectly.

On the host, inspect the supported types for each VF:

The output lists the profiles that VF supports (for example, nvidia-924). Only one profile should be used across all VFs of a given physical GPU. If you see entries from a previous profile configuration, remove all mdev devices, reset the VF state, and re-create using the single intended profile.

Step 4: Verify the Compute Service Agent Reflects the Inventory

After confirming that mdev devices are present and consistent, verify that the Compute Service agent on the host has picked up the current inventory.

Check the Compute Service agent log for resource-provider update lines:

Look for lines confirming that the agent reported the resource class and the correct total count. If the log shows errors such as Failed to update inventory or No resource provider found, restart the Compute Service agent on the host:

Wait two to three minutes and then recheck the log.

Self-Hosted deployments only. If host-side restarts do not resolve the inventory mismatch, inspect the Placement service pod in the region namespace from the management cluster:

In SaaS deployments, contact Platform9 Support if host-side remediation does not resolve the inventory discrepancy.

Step 5: Force a Full Inventory Re-sync

If inventory remains stale after the agent restart, force the Compute Service agent to re-discover and re-report the host's resources:

Allow three to five minutes for the agent to re-register the host as a resource provider and report inventory. Then check the GPU Hosts view in the UI and confirm the host shows the expected vGPU count.

Step 6: Re-validate with the GPU Configuration Script

After resolving inventory, run the GPU configuration script's validation option to confirm the host is in a consistent state:

Select option 5 (Validate vGPU). A clean validation output confirms the mdev state, SR-IOV VFs, and NVIDIA vGPU manager service are all operating correctly.

Multiple vGPU per VM

Overview

A VM flavor can request more than one vGPU by setting Number of vGPUs to a value greater than one (for example, 2) when creating the flavor. This is useful for workloads that can distribute computation across multiple GPU slices.

However, several constraints govern whether a multi-vGPU VM works as expected. If the VM is allocated and powered on but only one mdev device appears inside the guest, the following checklist will help you identify the cause.

Constraints for Multi-vGPU VMs

Constraint
Detail

Flavor request count

The flavor must request the exact number of vGPUs you want. Setting Number of vGPUs to 2 instructs the Compute Service to allocate two VGPU resource units from the same host.

Host capacity per profile

The host must have at least as many available mdev slots of the requested profile as the flavor requests. For example, if the flavor requests 2 vGPUs of profile nvidia-924 and the host has only 1 slot remaining, VM creation fails.

Profile consistency

All vGPU devices on a host must use the same profile. Mixing profiles on one physical GPU is not supported. Requesting vGPUs from mixed-profile hosts causes unpredictable allocation.

Single physical GPU only

Multiple vGPU slots allocated to one VM must come from the same physical GPU on a single host. Cross-physical-GPU vGPU allocation within a single VM is not supported.

NVIDIA driver inside the VM

The guest OS must have the correct NVIDIA vGPU guest driver installed. A missing or mismatched guest driver causes the VM to see fewer devices than allocated.

Diagnose: VM Is Allocated Multiple vGPUs but Only One Device Appears in the Guest

Step 1: Confirm the Compute Service allocated the requested count.

The output should show the number of VGPU resources allocated matching the flavor's request count. If the allocation count is lower than expected, the host did not have sufficient capacity — see the capacity verification steps in the inventory section above.

Step 2: Confirm the mdev devices are attached to the VM.

On the host running the VM, list the mdev devices and identify which ones are associated with the VM:

The virsh output should show one <hostdev> entry for each allocated vGPU. If fewer entries appear than expected, the libvirt domain XML did not include all allocated vGPU devices. This can happen if the mdev was not available when the VM started — verify that the mdev devices listed in mdevctl list match the UUIDs referenced in the domain XML.

Step 3: Check the NVIDIA guest driver inside the VM.

From inside the VM (or via the console), run:

The output should list one entry per allocated vGPU. If only one GPU appears but two were allocated, the guest driver may not be installed or may be outdated. Install or update the NVIDIA vGPU guest driver package that matches your host's NVIDIA vGPU manager version.

Step 4: Verify the host has sufficient profile capacity.

On the host:

Compare this count against the total number of mdev slots that the GPU supports at the requested profile. The difference is the remaining capacity. If available capacity is less than the flavor's vGPU request count, the VM will not be placed on that host.

Profile Mixing Rules

Only one vGPU profile may be active on a single physical GPU at any time. All virtual functions on that GPU must use the same profile. If a VM requests a profile that differs from the profile currently active on the host, VM creation fails with a resource provider inventory error.

If you need to change the active profile on a host, you must first:

  1. Remove all VMs using the current profile on that host.

  2. Remove all existing mdev devices (see the resolution steps in Troubleshooting GPU Support).

  3. Re-run the vGPU SR-IOV configuration with the new profile.

  4. Authorize the host in the cluster so the new inventory is reported.

Last updated

Was this helpful?