# Troubleshooting Cluster Issues

### Cluster Creation <a href="#cluster-creation" id="cluster-creation"></a>

#### Public Cloud Provider <a href="#public-cloud-provider" id="public-cloud-provider"></a>

* Make sure the permissions for the account you provided to PMK as part of cloud provider creation has all the required privileges. See the AWS pre-requisites under Getting Started section for more details

#### Cluster Creation Fails for BareOS <a href="#cluster-creation-fails-for-bareos" id="cluster-creation-fails-for-bareos"></a>

* Navigate to Infrastructure -> Clusters tab.
* Click on the cluster name. This will take you to the cluster details page.
* Click on the “Node Health” tab

Here you should see detailed breakdown of which nodes failed to install and which specific steps failed. Next, check [Troubleshooting Node Issues](https://docs.platform9.com/managed-kubernetes/support/troubleshooting/troubleshooting-node-issues).

### Etcd <a href="#etcd" id="etcd"></a>

#### Heartbeat/Election Timeout Interval <a href="#heartbeatelection-timeout-interval" id="heartbeatelection-timeout-interval"></a>

```
2021-02-04 18:36:31.380207 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 124.999498ms, to 92d6e239c543436)
2021-02-04 18:36:31.380220 W | etcdserver: server is likely overloaded
2021-02-04 18:36:31.382208 W | etcdserver: read-only range request "key:\"/registry/mutatingwebhookconfigurations/vault-agent-injector-cfg\" " with result "range_response_count:1 size:2723" took too long (264.355727ms) to execute
```

**ETCD\_HEARTBEAT\_INTERVAL -** This is the frequency with which the leader will notify followers that it is still the leader.

**ETCD\_ELECTION\_TIMEOUT -** This timeout is how long a follower node will go without hearing a heartbeat before attempting to become a leader itself.

By default, etcd uses a`100ms`heartbeat interval and`1000ms`election timeout.

```
# cat /etc/pf9/kube.env | grep -i etcd
export ETCD_HEARTBEAT_INTERVAL="1000"
export ETCD_ELECTION_TIMEOUT="10000"
```

#### Database  <a href="#database-size-exceeded" id="database-size-exceeded"></a>

#### Size Exceeded <a href="#database-size-exceeded" id="database-size-exceeded"></a>

<figure><img src="https://914468382-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FdN3UKqOJY9TdzQV1eEIf%2Fuploads%2FToGpcO36zJfjXLRV7HiD%2Fimage.png?alt=media&#x26;token=18d0056a-992e-4057-ad31-cf2a9defb235" alt=""><figcaption></figcaption></figure>

```
etcdserver: failed to apply request,took 2.429<C2><B5>s,request header:<ID:1920634987875929770 > txn:<compare:<target:MOD key:"/registry/services/endpoints/kube-system/kube-controller-manager" mod_revision:287319046 > success:<request_put:<key:"/registry/services/endpoints/kube-system/kube-controller-manager" value_size:473 >> failure:<>>,resp ,err is etcdserver: no space
```

* Stop the pf9-hostagent and nodeletd services on the master node(s).

```
sudo systemctl stop pf9-{hostagent,nodeletd}
```

* Issue a `stop` for the Nodelet phases.

```
/opt/pf9/nodelet/nodeletd phases stop
```

* In `/opt/pf9/pf9-kube/master_utils.sh` , modify the function `ensure_etcd__r_unning()`to add the following environment variable.

{% tabs %}
{% tab title="/opt/pf9/pf9-kube/master\_utils.sh" %}

```javascript
--volume ${ETCD_DATA_DIR}:/var/etcd/data \
        -e ETCD_DEBUG=${DEBUG}
        -e ETCD_QUOTA_BACKEND_BYTES=<size_in_bytes>"
```

{% endtab %}
{% endtabs %}

* Start the `pf9-hostagent` service.

```
sudo systemctl start pf9-hostagent
```

* Verify the size was correctly set by scraping the etcd metrics endpoint.

```
curl -L http://localhost:2379/metrics | grep etcd_server_quota_backend_bytes
```
