Troubleshooting Node Install Failure or Health Issues

As part of cluster creation, PMK installs components on your node that are responsible for the following:

  • Installing required packages on the node to prepare it to be added to the cluster with the right role (master or worker)
  • Performing a periodic health check on the node for health of various packages installed

If there is an error during either of these two operations, PMK will put the node in Unhealthy state. Depending on the cluster size and configuration, this will impact the health of the cluster as well.

Analyzing Node Health

Follow these steps to dig deeper when you see a node being reported as Unhealthy.

  • Identify the cluster that the node belongs to on the Infrastructure -> Clusters table
  • Click on the cluster name, this will navigate you to the Cluster Details view for that cluster
  • Click on the “Node Health” tab. Here you will see detailed health info for all the nodes that are part of this cluster.
  • Identify the node in question, and click on it in the table on the left side. Now you can view details of the specific step that failed either during install or during status check on that node. This is the root cause for the node being in Unhealthy state.

Viewing Node Logs

You can easily view the node logs generated by the PMK services that run on the node by clicking on the ‘View Logs’ link on this page.

NOTE however that the logs will rotate on the node and the link in the UI only shows the latest snippet of the log. If you do not find any error in the log file, its likely because the logs have rotated since. Log into your node and navigate to the log directory:

cd /var/log/pf9/kube

Here you will find the most recent log files for this node generated by the PMK pf9-kube service.

ubuntu@madhura-pmkft-test:/var/log/pf9/kube$ ls
kube.log       kube.log.2.gz  kube.log.4.gz  kube.log.6.gz
kube.log.1.gz  kube.log.3.gz  kube.log.5.gz  kube.log.7.gz

Search through the files for more details on the specific errors.

Common Reasons for Node Failures

Node Disk Ran Out of Space

The node needs to have sufficient CPU, Memory and Storage resources for it to become part of a Kubernetes cluster. If the node runs out of storage space during or after installation, this will very likely result in atleast some services failing on the node and hence the node going into Unhealthy state.

To identify if the node has run out of storage space:

  • Navigate to Infrastructure -> Nodes tab.
  • Identify the node under question. Look at the ‘Storage’ (or the ‘Resources’) column for the node
  • If you have < 500MB of free storage space left on the node, this is likely the reason for node failure.
  • WARNING - If the storage space on the node is >= 90% used, you will very likely run into problems on this node soon (if you aren’t running into any right now). We recommend adding more storage to the node as soon as possible.

Node Is Not Accessible

Node Is Running on an Internal Network

If your node is deployed on a interal / private network that is not accessible externally, you will need to associate an external IP address with the node and perform any addition cloud specific configuration to make sure the node can access the PMK management plane.

Node VIP Not Routable From Other Masters

If you have provided a VIP as part of your BareOS cluster creation, and your master nodes can not reach the network that your virtual IP address is allocated from or if the masters are deployed on a private network with port security rules, this will cause the master nodes to not come up properly. Ensure that the VIP is routable from the master nodes.

Node Has Incompatible Versions of Packages

Ensure that the apt package repositories that the Ubuntu machine is configured to use are configure properly and the installed packages on the node are of compatible architecture. (Some enterprises configure their machines to work with private package repositories which might have orlder or incompatible package versions.)

Node is Experiencing Clock Skew

Type date to verify that the current date and time on the node are accurate.

If the node clock is not in sync wrt global time, this might result in cerificate validation issues between the node and the Platform9 management plane, resulting in the TLS handshake between the two failing, and hence the node not being able to register with the management plane.

During install, PMK will try to install NTP on the node which will try to keep the node time as close to the real time as possible. Hence this issue should happen rarely. If you are experiencing node issues, make sure the node clock is in sync with real time.