Troubleshooting Issues Cluster

Cluster Creation

Public Cloud Provider

  • Make sure the permissions for the account you provided to PMK as part of cloud provider creation has all the required privileges. See the AWS pre-requisites under Getting Started section for more details

Cluster Creation Fails for BareOS

  • Navigate to Infrastructure -> Clusters tab.

  • Click on the cluster name. This will take you to the cluster details page.

  • Click on the “Node Health” tab

Here you should see detailed breakdown of which nodes failed to install and which specific steps failed. Next, check auto$arrow-up-right.

Etcd

Heartbeat/Election Timeout Interval

2021-02-04 18:36:31.380207 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 124.999498ms, to 92d6e239c543436)
2021-02-04 18:36:31.380220 W | etcdserver: server is likely overloaded
2021-02-04 18:36:31.382208 W | etcdserver: read-only range request "key:\"/registry/mutatingwebhookconfigurations/vault-agent-injector-cfg\" " with result "range_response_count:1 size:2723" took too long (264.355727ms) to execute

ETCD_HEARTBEAT_INTERVAL - This is the frequency with which the leader will notify followers that it is still the leader.

ETCD_ELECTION_TIMEOUT - This timeout is how long a follower node will go without hearing a heartbeat before attempting to become a leader itself.

By default, etcd uses a100msheartbeat interval and1000mselection timeout.

Database Size Exceeded

  1. Stop the pf9-hostagent and nodeletd services on the master node(s).

  1. Issue a stop for the Nodelet phases.

  1. In /opt/pf9/pf9-kube/master_utils.sh , modify the function ensure_etcd__r_unning()to add the following environment variable.

  1. Start the pf9-hostagent service.

  1. Verify the size was correctly set by scraping the etcd metrics endpoint.

Last updated

Was this helpful?