Create Virtualized Cluster with GPU support

This section guides you through the complete process of setting up, configuring, and managing GPU-enabled Kubernetes clusters. You'll learn how to create clusters with GPU support, select the appropriate partitioning strategy for your workloads, monitor resource utilization, and modify configurations as your requirements change.

Step 1: Set up VM infrastructure for GPU hypervisors

To create a GPU-enabled Kubernetes cluster, you will need to set up your GPU passthrough VM infrastructure by configuring GPU hypervisor hosts and infrastructure clusters with GPU capabilities.

Learn more about Set up GPU Passthrough.

Step 2: Set up GPU Flavor and Image

Now that your GPU hypervisor host and GPU infrastructure cluster are ready, you will need to set up a GPU flavor and GPU image with specific image properties:

  1. To create a GPU Passthrough flavor, see more details on Create GPU Enabled Flavors.

NOTE

Only VM flavors with GPU Passthrough mode enabled are supported to a create virtualized kubernetes cluster.

  1. Upload a Cluster API compliant Operating System Image into Image Library and Images specific to the Kubernetes version you want to deploy. See more details on Operating System Image Management.

  2. To use an image for GPU virtualised cluster creation, set gpu=true image property along with k8s_version

Step 3: Create a GPU enabled Kubernetes cluster

To create a GPU enabled Kubernetes cluster:

  1. Navigate to Kubernetes > Infrastructure > Clusters.

  2. Select to Deploy New Cluster with Virtualized Nodes

  3. On the Cluster Configuration page, enter a unique name for your cluster.

  4. Select a GPU-enabled infrastructure cluster with GPU mode (passthrough) from the virtualized clusters.

  5. Select an SSH key for your GPU virtualized Kubernetes nodes.

  6. Select Next to proceed to Node Pool configurations.

  7. Configure Node Pool setting**:**

    • VM Flavor: Select a GPU-enabled flavor and provide the number of VMs required for your virtualized Kubernetes cluster. Enable Show GPU flavors only to view GPU enabled flavors only.

    • Network: Configure network settings as needed.

    • Subnet: Configure subnet settings as needed.

  8. Select Next to proceed to configure your Kubernetes Cluster.

  9. Select the required Kubernetes version and GPU enabled image for your cluster.

  10. Enable the Nvidia GPU Operator add-on to configure your GPU virtualized Kubernetes nodes.

  11. Select 'Submit' to deploy your GPU-enabled Kubernetes virtualized node cluster.

Step 4: Configure GPU partitioning

Update GPU partitioning strategies as your business needs change.

  1. Navigate to Kubernetes > Infrastructure > Clusters

  2. Navigate to Capacity and Health

  3. Select the GPU Nodes Group to configure the GPU partition.

  4. Select Edit GPU Configuration to update GPU partitioning from the default Passthrough mode.

  5. Select a new partitioning strategy:

    • Switch from Passthrough to MIG or Time Slicing

    • Change MIG profiles

    • Adjust Time Slicing replica counts

  6. Choose Save Configuration changes.

The GPU operator pods on your GPU Kubernetes cluster restart and take a couple of minutes to update the GPU partitioning status. Once complete, the new configuration displays with updated GPU instances and memory allocation.

Info

GPUs must not be running any workloads before reconfiguration. By default GPU Passthrough is the default GPU mode configured.

MIG is supported on GPUs starting with the NVIDIA Ampere generation only. Learn more about MIG supported GPUs.

View GPU metrics

To view GPU metrics for GPU enabled virtualized worker nodes:

  1. Navigate to Kubernetes > Infrastructure > Cluster

  2. Navigate to Capacity and Health

  3. Select Manage Columns for Worker Nodes to enable the display of required GPU headers, such as GPU Model, GPU Strategy, GPU details, GPU count, GPU memory, and so on.

  4. Select Show GPU Nodes only to view only GPU enabled nodes in your cluster.

Now you can view the required GPU information for each of your GPU-enabled virtualized nodes.

Known issues

  • GPU node onboarding doesn't show explicit success messages. Wait 2-3 minutes after onboarding to see the updated status.

  • Cluster creation might occasionally fail, requiring a cointanerd restart on Kubernetes nodes. These issues are tracked and resolved in subsequent releases (BYOH).

Last updated

Was this helpful?