> For the complete documentation index, see [llms.txt](https://docs.platform9.com/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://docs.platform9.com/private-cloud-director/gpu/gpu-support-pcd/vgpu-inventory-and-multi-vgpu.md).

# vGPU Inventory Troubleshooting and Multiple vGPU per VM

## Overview

This guide covers two related operational topics for vGPU hosts:

1. **vGPU inventory problems** — diagnosing why a vGPU host is missing from or shows incorrect inventory in the Compute Service resource-tracking layer.
2. **Multiple vGPU per VM** — understanding the constraints when a VM requests more than one vGPU and how to verify that all requested vGPU devices attach correctly.

Both topics involve the host's mediated device (mdev) state and the SR-IOV virtual-function configuration. Work through the inventory section first if you see unexpected behavior; then consult the multi-vGPU section if the VM itself is running but a device is not visible inside the guest.

## Diagnose and Repair vGPU Inventory Problems

### Understand How vGPU Inventory Is Reported

When a vGPU host is authorized and the vGPU profile is configured, the Compute Service agent on that host reports resource providers and inventory to the Placement service. Each vGPU slot appears as a resource class of the form `VGPU_<profile>` (for example, `VGPU_NVIDIA_924`). If the host does not appear as a resource provider, or if the available inventory count is wrong, the Compute Service cannot schedule vGPU VMs to that host.

Common causes of incorrect inventory:

* The mdev devices are not created or are in an inconsistent state (for example, some VFs have mdevs and some do not).
* The host has `enabled_mdev_types` configured with duplicate or conflicting profile names.
* Mixed vGPU profiles are applied to different virtual functions on the same physical GPU — only a single profile is supported per physical GPU at a time.
* SR-IOV virtual functions were not created (the physical GPU shows no VFs in `/sys/bus/pci/devices/<PCI_ADDR>/sriov_numvfs`).
* The Compute Service agent was not restarted after mdev state changed.

### Step 1: Verify SR-IOV Virtual Functions

SSH to the host and confirm that the physical GPU has virtual functions enabled.

```bash
cat /sys/bus/pci/devices/<GPU_PCI_ADDR>/sriov_numvfs
```

Replace `<GPU_PCI_ADDR>` with the GPU PCI address (for example, `0000:c1:00.0`). A value of `0` means no VFs are active and no mdev devices can exist. If you see `0`, re-run the vGPU SR-IOV configuration step:

```bash
cd /opt/pf9/gpu
sudo ./pf9-gpu-configure.sh
```

Select option **3** (`vGPU SR-IOV configure`) when prompted.

### Step 2: List Mediated Devices and Check Consistency

After confirming VFs are present, list the active mdev devices:

```bash
mdevctl list
```

Expected output when the profile is configured correctly — one mdev entry per VF that has been assigned this profile:

```
295ba0f1-20d3-4a94-9d24-ad29edc5ac15 0000:c1:01.1 nvidia-924
916542a1-f966-4195-97d6-7cf2a10c850d 0000:c1:03.2 nvidia-924
```

Signs of inconsistency to watch for:

* No output — no mdev devices are created at all. The inventory will be zero or absent.
* Mixed profile names (for example, `nvidia-918` on one VF and `nvidia-924` on another VF on the same physical GPU) — only one profile is allowed per physical GPU; remove all mdevs and re-create with a single profile.
* Fewer mdevs than expected VFs — partial allocation; the inventory count will be lower than the GPU's maximum capacity.

### Step 3: Check the `enabled_mdev_types` Configuration

Duplicate or conflicting entries in the `enabled_mdev_types` configuration can cause the Compute Service agent to report inventory incorrectly.

On the host, inspect the supported types for each VF:

```bash
ls /sys/class/mdev_bus/0000:<VF_ADDR>/mdev_supported_types/
```

The output lists the profiles that VF supports (for example, `nvidia-924`). Only one profile should be used across all VFs of a given physical GPU. If you see entries from a previous profile configuration, remove all mdev devices, reset the VF state, and re-create using the single intended profile.

### Step 4: Verify the Compute Service Agent Reflects the Inventory

After confirming that mdev devices are present and consistent, verify that the Compute Service agent on the host has picked up the current inventory.

Check the Compute Service agent log for resource-provider update lines:

```bash
grep -i "vgpu\|resource_provider\|inventory" /var/log/pf9/ostackhost.log | tail -30
```

Look for lines confirming that the agent reported the resource class and the correct total count. If the log shows errors such as `Failed to update inventory` or `No resource provider found`, restart the Compute Service agent on the host:

```bash
sudo systemctl restart pf9-ostackhost
```

Wait two to three minutes and then recheck the log.

{% hint style="info" %}
**Self-Hosted deployments only.** If host-side restarts do not resolve the inventory mismatch, inspect the Placement service pod in the region namespace from the management cluster:

```bash
kubectl -n <region-namespace> logs deployment/nova-scheduler | grep -i "placement\|inventory\|vgpu"
kubectl -n <region-namespace> exec deployment/nova-scheduler -- nova-manage placement sync_aggregates
```

In SaaS deployments, contact Platform9 Support if host-side remediation does not resolve the inventory discrepancy.
{% endhint %}

### Step 5: Force a Full Inventory Re-sync

If inventory remains stale after the agent restart, force the Compute Service agent to re-discover and re-report the host's resources:

```bash
sudo systemctl stop pf9-ostackhost
sudo rm -f /var/lib/nova/nova.sqlite  # clears local resource-provider cache
sudo systemctl start pf9-ostackhost
```

Allow three to five minutes for the agent to re-register the host as a resource provider and report inventory. Then check the GPU Hosts view in the UI and confirm the host shows the expected vGPU count.

### Step 6: Re-validate with the GPU Configuration Script

After resolving inventory, run the GPU configuration script's validation option to confirm the host is in a consistent state:

```bash
cd /opt/pf9/gpu
sudo ./pf9-gpu-configure.sh
```

Select option **5** (`Validate vGPU`). A clean validation output confirms the mdev state, SR-IOV VFs, and NVIDIA vGPU manager service are all operating correctly.

## Multiple vGPU per VM

### Overview

A VM flavor can request more than one vGPU by setting **Number of vGPUs** to a value greater than one (for example, 2) when creating the flavor. This is useful for workloads that can distribute computation across multiple GPU slices.

However, several constraints govern whether a multi-vGPU VM works as expected. If the VM is allocated and powered on but only one mdev device appears inside the guest, the following checklist will help you identify the cause.

### Constraints for Multi-vGPU VMs

| Constraint                      | Detail                                                                                                                                                                                                                                   |
| ------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Flavor request count**        | The flavor must request the exact number of vGPUs you want. Setting **Number of vGPUs** to `2` instructs the Compute Service to allocate two VGPU resource units from the same host.                                                     |
| **Host capacity per profile**   | The host must have at least as many available mdev slots of the requested profile as the flavor requests. For example, if the flavor requests 2 vGPUs of profile `nvidia-924` and the host has only 1 slot remaining, VM creation fails. |
| **Profile consistency**         | All vGPU devices on a host must use the same profile. Mixing profiles on one physical GPU is not supported. Requesting vGPUs from mixed-profile hosts causes unpredictable allocation.                                                   |
| **Single physical GPU only**    | Multiple vGPU slots allocated to one VM must come from the same physical GPU on a single host. Cross-physical-GPU vGPU allocation within a single VM is not supported.                                                                   |
| **NVIDIA driver inside the VM** | The guest OS must have the correct NVIDIA vGPU guest driver installed. A missing or mismatched guest driver causes the VM to see fewer devices than allocated.                                                                           |

### Diagnose: VM Is Allocated Multiple vGPUs but Only One Device Appears in the Guest

**Step 1: Confirm the Compute Service allocated the requested count.**

```bash
pcdctl server show <VM_UUID> | grep -i vgpu
```

The output should show the number of VGPU resources allocated matching the flavor's request count. If the allocation count is lower than expected, the host did not have sufficient capacity — see the capacity verification steps in the inventory section above.

**Step 2: Confirm the mdev devices are attached to the VM.**

On the host running the VM, list the mdev devices and identify which ones are associated with the VM:

```bash
mdevctl list
virsh dumpxml <VM_UUID> | grep -i mdev
```

The `virsh` output should show one `<hostdev>` entry for each allocated vGPU. If fewer entries appear than expected, the libvirt domain XML did not include all allocated vGPU devices. This can happen if the mdev was not available when the VM started — verify that the mdev devices listed in `mdevctl list` match the UUIDs referenced in the domain XML.

**Step 3: Check the NVIDIA guest driver inside the VM.**

From inside the VM (or via the console), run:

```bash
nvidia-smi -L
```

The output should list one entry per allocated vGPU. If only one GPU appears but two were allocated, the guest driver may not be installed or may be outdated. Install or update the NVIDIA vGPU guest driver package that matches your host's NVIDIA vGPU manager version.

**Step 4: Verify the host has sufficient profile capacity.**

On the host:

```bash
mdevctl list | grep <profile_name> | wc -l
```

Compare this count against the total number of mdev slots that the GPU supports at the requested profile. The difference is the remaining capacity. If available capacity is less than the flavor's vGPU request count, the VM will not be placed on that host.

### Profile Mixing Rules

Only one vGPU profile may be active on a single physical GPU at any time. All virtual functions on that GPU must use the same profile. If a VM requests a profile that differs from the profile currently active on the host, VM creation fails with a resource provider inventory error.

If you need to change the active profile on a host, you must first:

1. Remove all VMs using the current profile on that host.
2. Remove all existing mdev devices (see the resolution steps in [Troubleshooting GPU Support](/private-cloud-director/gpu/gpu-support-pcd/troubleshooting-gpu-support.md#vgpu-profile-selection-restricted-by-existing-vgpu-allocations)).
3. Re-run the vGPU SR-IOV configuration with the new profile.
4. Authorize the host in the cluster so the new inventory is reported.

## Related Pages

* [Set up vGPU](/private-cloud-director/gpu/gpu-support-pcd/setup-vgpu.md) — full vGPU infrastructure setup guide
* [Create GPU Enabled Flavors](/private-cloud-director/gpu/gpu-support-pcd/create-gpu-enabled-flavors.md) — configure multi-vGPU flavors
* [Troubleshooting GPU Support](/private-cloud-director/gpu/gpu-support-pcd/troubleshooting-gpu-support.md) — additional GPU troubleshooting scenarios
* [Diagnose VM Scheduling Failures](/private-cloud-director/virtualized-clusters/troubleshooting-and-log-files/diagnose-vm-scheduling-failures.md) — general scheduling diagnostics that complement vGPU inventory issues


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.platform9.com/private-cloud-director/gpu/gpu-support-pcd/vgpu-inventory-and-multi-vgpu.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
