# Highly Available Management Plane Recovery Guide

## Introduction

This guide walks through the recovery scenarios of the SMCP Management Station in various deployment configurations.

## Single Node Cluster

### There is no high availability in case of a single node cluster. If a single node management plane is down, the workload clusters will lose connectivity to it.

{% hint style="warning" %}
**Warning**

It is recommended to always run at-least a 3 node cluster in production environments
{% endhint %}

## Node Failure in a 3-Node Cluster

In the event of a node failure within a 3-node cluster, no manual intervention is required.

Kubernetes automatically detects the node as NotReady. However, it's important to note that the failover process may take more than 5 minutes before Kubernetes begins to reschedule the pods onto other active nodes.

During this time window while pods are still being rescheduled and until services are running again, you may experience some service disruption on the management plane if critical services such as Keystone and/or K8sniff were running on the failed node. However, it's important to understand that the workload clusters themselves will remain unaffected.

Once Kubernetes has successfully rescheduled all affected services onto the remaining active nodes, the management plane should become operational again. **Please note that the duration of this process can vary and may exceed 5 minutes, depending on various factors.**

{% hint style="info" %}
**Info**

Please make sure that the current active nodes have enough free CPU/Memory/Disk to run the pods from the failed nodes
{% endhint %}

## Two nodes are down in a 3 node cluster

The management plane will not be accessible as the database services require a minimum of 2 nodes to be operational to maintain quorum. Even if one, or both, of the failed nodes are recovered and brought back up at a later point in time, there might be manual intervention required to restore Percona.

#### Percona Recovery:

In most cases, Percona automatically handles the recovery process when a cluster quorum failure occurs, meaning when two nodes in the cluster fail. This is achieved by setting the `autoRecovery` feature to `true`. However, there may be instances where manual intervention is required to ensure a successful recovery.

The Percona pods in the cluster are named as `percona-db-pxc-db-pxc-<0,1,2>`, and there is a container called `pxc` within each pod.

If the second node is brought back online and if the Percona pods remain stuck in the `Terminating` state, you can forcibly delete all the percona pods using the following command:

{% tabs %}
{% tab title="Bash" %}

```bash
kubectl delete pods -n percona percona-db-pxc-db-pxc-0 percona-db-pxc-db-pxc-1 percona-db-pxc-db-pxc-2 --force
```

{% endtab %}
{% endtabs %}

This action will not cause any data loss as it does not affect the Persistent Volume Claims (PVCs) associated with the pods.

Now, you should start seeing the Percona pods starting up. However, in some cases, the Percona pods after booting might complain about an `FULL_PXC_CLUSTER_CRASH` error and may not recover automatically.

An easy way to check for this is by looking for logs in the percona-db-pxc pods (replace <0,1,2> with the right number):

{% tabs %}
{% tab title="Bash" %}

```bash
kubectl logs -n percona percona-db-pxc-db-pxc-<0,1,2> -c pxc -f
```

{% endtab %}
{% endtabs %}

Look for an error like this:

{% tabs %}
{% tab title="None" %}

```none
#####################################################FULL_PXC_CLUSTER_CRASH:percona-db-pxc-db-pxc-1.percona-db-pxc-db-pxc.percona.svc.cluster.local#####################################################
You have the situation of a full PXC cluster crash. In order to restore your PXC cluster, please check the log
from all pods/nodes to find the node with the most recent data (the one with the highest sequence number (seqno).
It is percona-db-pxc-db-pxc-1.percona-db-pxc-db-pxc.percona.svc.cluster.local node with sequence number (seqno): 41366
Cluster will recover automatically from the crash now.
If you have set spec.pxc.autoRecovery to false, run the following command to recover manually from this node:
kubectl -n percona exec percona-db-pxc-db-pxc-1 -c pxc -- sh -c 'kill -s USR1 1'
#####################################################LAST_LINE:percona-db-pxc-db-pxc-1.percona-db-pxc-db-pxc.percona.svc.cluster.local:41366:#####################################################
```

{% endtab %}
{% endtabs %}

This means the percona system has crashed and cannot figure out which of its pods has the latest data. To recover, you need to obtain the sequence number for each percona pod and identify the pod with the highest sequence number, indicating the presence of the latest data. The sequence number is found like the crash report. For example, from the output above:

{% tabs %}
{% tab title="None" %}

```none
It is percona-db-pxc-db-pxc-1.percona-db-pxc-db-pxc.percona.svc.cluster.local node with sequence number (seqno): 41366
```

{% endtab %}
{% endtabs %}

The sequence number for this pod is `41366`. Look at all the percona pods and identify the pod with the highest sequence number.

Then, execute the following command on the pod with the highest sequence number (assuming `percona-db-pxc-db-pxc-1` pod in this example):

{% tabs %}
{% tab title="Bash" %}

```bash
kubectl -n percona exec percona-db-pxc-db-pxc-1 -c pxc -- sh -c 'kill -s USR1 1'
```

{% endtab %}
{% endtabs %}

This command triggers the restart of the `mysqld` process within that pod, allowing the formation of the `mysqld` cluster to resume eventually. If everything is successful, the pod status should look like this:

{% tabs %}
{% tab title="YAML" %}

```yaml
[root@test-pf9-du-host-airgap-c7-mm-2499804-940-3 ~]# kubectl get pods -n percona
NAME                                                     READY   STATUS        RESTARTS   AGE
percona-db-pxc-db-haproxy-0                              2/2     Terminating   0          5h46m
percona-db-pxc-db-haproxy-1                              2/2     Running       7          5h45m
percona-db-pxc-db-haproxy-2                              2/2     Running       8          5h44m
percona-db-pxc-db-pxc-0                                  3/3     Running       1          105m
percona-db-pxc-db-pxc-1                                  3/3     Running       0          104m
percona-db-pxc-db-pxc-2                                  0/3     Pending       0          104m
percona-operator-pxc-operator-c547b4cd5-74kzs            1/1     Running       9          5h47m
```

{% endtab %}
{% endtabs %}

Wait for all the containers in the `percona-db-pxc-db-haproxy-*` pods to be up and running (`Ready`) and then the recovery is successful.
