# Failed to Deploy Virtual Machine

This guide provides step-by-step instructions for troubleshooting and resolving issues when creating a virtual machine (VM) fails in <code class="expression">space.vars.product\_name</code>.

## Various VM deployment Methods

* Launch an instance from an Image to quickly deploy a pre-configured environment.
* Launch an instance from a New Volume to create a fresh setup with dedicated storage.
* Launch an instance from an Existing Volume to utilize previously used storage for seamless continuation.
* Launch an instance from a VM snapshot to capture current state and restore it precisely as it was.
* Launch an instance from a Volume Snapshot to ensure data integrity by reverting to a specific point in time.

## Most common causes

* Insufficient resources (CPU, RAM, storage).
* Incorrect network configurations or security groups.
* Unavailable or corrupted images.
* Issues with the scheduler or compute nodes.
* Permission or quota restrictions.
* Virtualised cluster mismatches.

## Deep Dive

The <code class="expression">space.vars.product\_name</code> VM creation process is similar for all VM deployment methods mentioned earlier, and the workflow is orchestrated primarily by the **Compute service** (Nova). This flow involves a series of steps with critical validations at each stage to ensure the request is valid, resources are available, and the VM is provisioned correctly.

{% hint style="info" %}
Below logs can be only reviewed in Self Hosted Private Cloud Director. For SAAS model, kindly contact the Platform9 Support Team
{% endhint %}

{% stepper %}
{% step %}

### User Request & API Validation

This is the initial stage where the user's request is received and authenticated.

* **User Request:** A user submits a request to create a VM (also called an instance) via the OpenStack CLI, <code class="expression">space.vars.product\_name</code> dashboard, or direct API call. Key parameters are specified, including the **image**, **flavor**, **network**, **security group**, and **key pair**.
* **Keystone Authentication:** The request is sent to the `nova-api-osapi` Pod, which immediately validates the user's authentication token with the **Identity service** (keystone). This ensures the user is who they claim to be. The output below shows the initial VM creation request was successfully received by the Nova API and was accepted with a 202 status code.

{% tabs %}
{% tab title="Sample Logs" %}

```dart
$ kubectl logs deployment/nova-api-osapi -n <WORKLOAD_REGION> | grep "POST /v2.1"
INFO nova.osapi_compute.wsgi.server [None [REQ_ID] [USER_ID] [TENANT_ID] - - default default] [IP] "POST /v2.1/[tenant_id]/servers HTTP/1.1" status: 202 len: [.] time: [.]
```

{% endtab %}
{% endtabs %}

{% hint style="info" %}
Here a unique `REQ_ID` will be generated, which will be further used for tracking the request in other component logs.
{% endhint %}

* **Authorization & Quota Checks:** The Nova API performs two key validations:
  * **Authorization:** It verifies that the user has the necessary permissions to create a VM within the specified project.
  * **Quota Check:** It confirms the project has enough available resources (vCPUs, RAM, instances, etc.) to fulfil the request based on the chosen flavor.
* **Initial Database Entry:** The database name is `nova`. The `nova-conductor` service is the only service that writes to the database. The other Compute services access the database through the `nova-conductor` service. If all checks pass, `nova-conductor` creates a database record for the new VM and sets its status to **BUILDING(None)**.
  {% endstep %}

{% step %}

### Scheduling & Resource Selection

After the initial validation, the request is sent to the Nova Scheduler, which decides where to place the VM.

* **Message Queue:** The Nova API sends a message to the **Nova Scheduler** via a message queue (RabbitMQ), containing all the VM's requirements.
* The **Nova scheduler** queries the **Placement API** to find a suitable **resource provider (compute node)** that has enough resources based on host filters and host weighing.
* **Host Filtering:** The `nova-scheduler` begins by filtering out unsuitable hosts. This process checks for:
  * **Resource availability:** It ensures the host has sufficient free RAM, disk space, and vCPUs.
  * **Compatibility:** It verifies the host is compatible with the image properties and any specific requirements.
  * **Availability Zones:** It confirms the host is in the requested availability zone.
  * **Image Metadata:** It checks the image metadata if there is a specific metadata filter for the image. E.g. Images with metadata SRIOV, vTPM, etc.
  * Many more other filters.

{% hint style="info" %}
Other filters:

Details on Nova filters are available on [Scheduler filters](https://docs.openstack.org/nova/rocky/user/filter-scheduler.html).
{% endhint %}

* **Host Weighing:** The remaining hosts are then ranked based on a weighting system. This can be configured to prioritise hosts with the least load or those that have been least recently used to ensure balanced resource distribution.

{% hint style="warning" %}
At this stage, if the scheduler doesn’t find any suitable host to deploy an instance, it gives a **“No Valid Host Found”** error.
{% endhint %}

* **Placement Reservation:** The `nova-scheduler` service queries `placement API` to fetch eligible compute nodes. Once a host is selected, the scheduler **makes a provisional allocation** by creating a **"claim"** via: `PUT /allocations/[VM_UUID]`. Placement API `PUT` requests will have VM allocation ID logs that look like the following:

{% tabs %}
{% tab title="Sample Logs" %}

```dart
$ kubectl logs deployment/placement-api -n <WORKLOAD_REGION> | grep <req_id>
INFO placement.requestlog [[REQ_ID] [REQ_ID] [USER_ID] [TENANT_ID] - - default default] [IP] "PUT /allocations/[VM_UUID]" status: 204 len: 0 microversion: 1.36
```

{% endtab %}
{% endtabs %}

The `nova-scheduler` pod logs can be reviewed against the request ID captured from `nova-api-osapi` the pod. In the snippet below, VM requests will verify a suitable host for VM deployment.

{% tabs %}
{% tab title="Sample Logs" %}

```dart
$ kubectl logs deployment/nova-scheduler -n <WORKLOAD_REGION> | grep <req_id>
WARNING nova.scheduler.filters.aggregate_image_properties_isolation [None [REQ_ID] [USER_ID] [TENANT_ID] - - default default] Host '[HOST_UUID]' has a metadata key 'availability_zone' that is not present in the image metadata.
```

{% endtab %}
{% endtabs %}

* The **Nova-scheduler** sends the database update request with the host information to **Nova-conductor** which further updates the database and sets VM status to **BUILDING (Scheduling)**. Then the request is passed to the **Nova-compute** service.
  {% endstep %}

{% step %}

### Compute & Final Service-Level Validation

The **Nova-compute** service on the selected host performs the final provisioning steps.

* **Resource Allocation:** The **Nova Compute** service receives the scheduling decision and begins allocating resources. It interacts with:
  * **Glance:** It requests the VM image. **Validation occurs here** as `glance-api` pod can perform a signature check to ensure the image's integrity. If an image is not available, then it errors out. Below is an example of an `GET` image request.

{% tabs %}
{% tab title="Sample Logs" %}

```dart
INFO eventlet.wsgi.server [None [REQ_ID] [USER_ID] [TENANT_ID] - - default default] 127.0.0.1 - - [...] "GET /v2/images/[IMAGE_UUID] HTTP/1.0" 200 1132 0.032435
```

{% endtab %}
{% endtabs %}

* **Neutron:** It requests network resources, and `neutron-server` pod **validates** that the specified network and security groups exist and are accessible to the user. It then allocates a virtual network interface and an IP address. Below example shows the IP and network interface port information. **Nova-conductor** further updates the database and sets VM status to **BUILDING(Networking)**.

{% tabs %}
{% tab title="Sample Logs" %}

```dart
$ kubectl logs deployment/neutron-server -n <WORKLOAD_REGION>
INFO neutron.wsgi [[REQ_ID] [REQ_ID] [USER_ID] [TENANT_ID] - - default default] 127.0.0.1 "GET /v2.0/floatingips?fixed_ip_address=[VM_IP_Address]&port_id=[VM_Interface_Port_ID] HTTP/1.1" status: 200  len: [.] time: [.]
```

{% endtab %}
{% endtabs %}

* **Cinder (if applicable):** If a persistent boot volume is requested, **Cinder validates** that the volume is available and attaches it to the VM. **Nova-conductor** further updates the database and sets VM status to **BUILDING(Block\_Device\_Mapping)**.
* **Hypervisor Instruction:** Once all resources are confirmed, `nova-compute` instructs the `pf9-ostackhost` service on the hypervisor (Libvirtd KVM) to create the VM using the image, flavor, and other parameters. The VM then boots. The `pf9-ostackhost` logs look like the below, which outline details like claim successful, device path, network information, time elapsed to spawn an instance, etc.

{% tabs %}
{% tab title="Sample Logs" %}

```dart
INFO nova.compute.claims [[REQ_ID] [USERNAME] service] [instance: [VM_UUID]] Claim successful on node [SELECTED_NODE_NAME]
..
INFO os_vif [[REQ_ID] [USERNAME] service] Successfully plugged vif VIFOpenVSwitch(active=False,address=[MAC_ADDRESS],bridge_name='br-int',has_traffic_filtering=True,id=[INTERFACE_ID],network=Network([NETWORK_ID]),plugin='ovs',port_profile=VIFPortProfileOpenVSwitch,preserve_on_delete=False,vif_name='[VM_TAP_INTERFACE]')
..
INFO nova.compute.manager [[REQ_ID] [USERNAME] service] [instance: [VM_UUID]] Took 3.74 seconds to spawn the instance on the hypervisor.
```

{% endtab %}
{% endtabs %}
{% endstep %}

{% step %}

### VM Configuration & Finalization

The final step involves configuring the guest OS and updating the status.

* **Cloud-init:** As the VM boots, **Cloud-init** runs with `169.254.169.254` IP address and retrieves metadata from Nova. The cloud-init logs are available within the VM. It performs validations on this metadata before:
  * Injecting the SSH key.
  * Configuring networking and the hostname.
  * Executing any custom user data scripts.
* **Status Update:** The `nova-compute` service updates the VM's status in the database to **ACTIVE**, indicating a successful creation. The VM is now ready for the user to access.
  {% endstep %}
  {% endstepper %}

## Procedure

{% hint style="info" %}
OpenStack CLI references virtual machines as 'server'.
{% endhint %}

{% stepper %}
{% step %}

### Get the VM status

Use the PCD UI or CLI to check the error message. Look for `status` and `fault` fields to understand the issue.

{% tabs %}
{% tab title="Command" %}

```bash
$ openstack server show <VM_UUID>
```

{% endtab %}
{% endtabs %}
{% endstep %}

{% step %}

### Validate Compute Service Status

Get the Compute Service state and ensure it is `up` and status is `enabled`.

{% tabs %}
{% tab title="Command" %}

```bash
$ openstack compute service list
```

{% endtab %}
{% endtabs %}
{% endstep %}

{% step %}

### Trace the VM Events

Retrieve the Request ID i.e. `REQ_ID` from the server event list details, which uniquely identifies the request. This `REQ_ID` is displayed in the first column of the server events list command and helps track request failures.

{% hint style="warning" %}
This `REQ_ID` is crucial for troubleshooting the VM creation issues.
{% endhint %}

{% tabs %}
{% tab title="Command" %}

```bash
$ openstack server event list <VM_UUID>
$ openstack server event show <VM_UUID> <REQ_ID>
```

{% endtab %}
{% endtabs %}
{% endstep %}

{% step %}

### Review the Pods and their logs on the Management plane

{% hint style="info" %}
Step 4 is applicable only for Self Hosted Private Cloud Director
{% endhint %}

Management plane has Pods like `Nova-api-osapi`, `Nova-scheduler` and `Nova-conductor`. Review all these pods:

* Check if they are in "CrashLoopBackOff/OOMkilled/Pending/Error/Init" state.
* Verify if all containers in the pods are Running.
* See the events section in pod describe output.
* Review pods logs using `REQ_ID` or `VM_UUID` for relevant details.

{% tabs %}
{% tab title="Command" %}

```bash
$ kubectl get pods -o wide -n <WORKLOAD_REGION> | grep -i "nova"

$ kubectl describe -n <WORKLOAD_REGION> <NOVA_API_OSAPI_POD>
$ kubectl describe -n <WORKLOAD_REGION> <NOVA_SCHEDULER_POD>
$ kubectl describe -n <WORKLOAD_REGION> <NOVA_CONDUCTOR_POD>

$ kubectl logs -n <WORKLOAD_REGION> <NOVA_API_OSAPI_POD>
$ kubectl logs -n <WORKLOAD_REGION> <NOVA_SCHEDULER_POD>
$ kubectl logs -n <WORKLOAD_REGION> <NOVA_CONDUCTOR_POD>
```

{% endtab %}
{% endtabs %}
{% endstep %}

{% step %}

### Validate the Image and Flavor

Check if the image is available and not corrupted. Ensure the resources requested in the flavor are available on the underlying hosts.

{% tabs %}
{% tab title="Command" %}

```bash
$ openstack image show <IMAGE_ID>
$ openstack flavor show <FLAVOR_ID>
$ openstack hypervisor stats show
$ openstack hypervisor show <HYPERVISOR_NAME>
```

{% endtab %}
{% endtabs %}
{% endstep %}

{% step %}

### Validate the service status on the affected VM's underlying hypervisor

Validate if services listed below are running on the Underlying Hypervisor:

{% tabs %}
{% tab title="Command" %}

```bash
$ sudo systemctl status pf9-hostagent
$ sudo systemctl status pf9-ostackhost 
$ sudo systemctl status pf9-cindervolume-base
$ sudo systemctl status pf9-neutron-ovn-metadata-agent
```

{% endtab %}
{% endtabs %}
{% endstep %}

{% step %}

### Check the logs on the affected VM's hypervisor

* Compute Node: `Ostackhost` logs are responsible for provisioning the compute resource required by the VM. Review the latest logs and search for `REQ_ID` or `VM_UUID`.

{% tabs %}
{% tab title="Command" %}

```bash
$ less /var/log/pf9/ostackhost.log
```

{% endtab %}
{% endtabs %}

* Cinder Storage Node: `cindervolume-base` logs are responsible for provisioning the storage resources required by the VM. Review the latest logs and search for `REQ_ID` or `VM_UUID`.

{% tabs %}
{% tab title="Command" %}

```bash
$ less /var/log/pf9/cindervolume-base.log
```

{% endtab %}
{% endtabs %}

* Network Node: `pf9-neutron-ovn-metadata-agent` logs are responsible for provisioning the connectivity and networking resources required by the VM. Review the latest logs and search for `REQ_ID` or `VM_UUID`.

{% tabs %}
{% tab title="Command" %}

```bash
$ less /var/log/pf9/pf9-neutron-ovn-metadata-agent.log
```

{% endtab %}
{% endtabs %}
{% endstep %}
{% endstepper %}

If these steps prove insufficient to resolve the issue, kindly reach out to the [Platform9 Support Team](https://support.platform9.com/) for additional assistance.
