# Catapult Rules Alarms

Catapult has 56 built in rules, each rule falls into one of the categories below:

* [Calico Monitoring](/managed-kubernetes/5.15/catapult-remote-monitoring/catapult-rules-alarms/calico-monitoring.md)
* [etcd Monitoring](/managed-kubernetes/5.15/catapult-remote-monitoring/catapult-rules-alarms/etcd-monitoring.md)
* [Kubernetes API Monitoring](/managed-kubernetes/5.15/catapult-remote-monitoring/catapult-rules-alarms/kubernetes-monitoring.md)
* [Kube State Monitoring](/managed-kubernetes/5.15/catapult-remote-monitoring/catapult-rules-alarms/kube-state--monitoring.md)
* [Node OS Monitoring](/managed-kubernetes/5.15/catapult-remote-monitoring/catapult-rules-alarms/node-os-monitoring.md)
* [Environment Status](/managed-kubernetes/5.15/catapult-remote-monitoring/catapult-rules-alarms/environment-status-monitoring.md)
* [Node Connectivity Monitoring](/managed-kubernetes/5.15/catapult-remote-monitoring/catapult-rules-alarms/nodes-connectivity-monitoring.md)
* [Platform9 Managed Add-on Monitoring](/managed-kubernetes/5.15/catapult-remote-monitoring/catapult-rules-alarms/addons-catapult.md)

Below is a summary of each rule, where the data is collected, the component that is monitored and the action require if the alert is received.

### etcd

Below is a list of the rule names, where the data is collected from, the monitored components, and the suggested action to remedy the issue.

| Rule Name                          | <p>Data Collection<br>Point</p> | Monitored Component | Action Required                                                                                                                                |
| ---------------------------------- | ------------------------------- | ------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------- |
| EtcdBackupJobFailed                | Cluster                         | etcd                | Review etcd backup task. Possibly lack of disk space.                                                                                          |
| EtcdDown                           | Cluster                         | etcd                | Verify if nodelet can repair etcd. Check for CPU, memory, or storage issues on the node. Review logs of etcd container.                        |
| EtcdInsufficientMembers            | Cluster                         | etcd                | Etcd quorum lacks an odd number of masters.                                                                                                    |
| EtcdNoLeader                       | Cluster                         | etcd                | Etcd quorum has no leader. Investigate network issues causing etcd members not reaching consensus.                                             |
| EtcdHighNumberOfLeaderChanges      | Cluster                         | etcd                | If leader of etcd quorum is changing frequently, verify master nodes are not constantly failing. Possible network issues between master nodes. |
| EtcdHighNumberOfFailedGrpcRequests | Cluster                         | etcd                | Investigate network issues causing failed grpc requests in the etcd quorum. Review etcd container logs.                                        |
| EtcdHighNumberOfFailedProposals    | Cluster                         | etcd                | Inspect network issues causing failed proposals in the etcd quorum. Check etcd container logs.                                                 |
| EtcdHighFsyncDurations             | Cluster                         | etcd                | Review disk or storage issues which may cause problems when an etcd member tries to commit data to disk.                                       |
| EtcdHighCommitDurations            | Cluster                         | etcd                | Review disk or storage issues which may cause problems when an etcd member tries to commit data to disk.                                       |

### Kubernetes API Monitoring

| Rule Name                 | <p>Data Collection<br>Point</p> | Monitored Component | Action Required                                                                |
| ------------------------- | ------------------------------- | ------------------- | ------------------------------------------------------------------------------ |
| KubeAPIServerDown         | Cluster                         | Kube API server     | Verify API Server is responding on the node                                    |
| KubernetesApiServerErrors | Cluster                         | Kube API server     | <p>Check for API Server overload.<br>Review logs.</p>                          |
| KubernetesApiClientErrors | Cluster                         | Kube API server     | <p>Check for API Server overload.<br>Survey logs to audit client requests.</p> |
| KubeSchedulerDown         | Cluster                         | Kube Scheduler      | Audit logs for k8s master pod restarts.                                        |
| KubeControllerManagerDown | Cluster                         | kube controller     | Audit logs for k8s master pod restarts.                                        |
| KubeProxyDown             | Cluster                         | kube proxy          | Verify kube proxy container is running on the node.                            |
| KubeProxyRuleSyncLatency  | Cluster                         | kube proxy          | Explore kube proxy overload via logs.                                          |
|                           |                                 |                     |                                                                                |

### Kube State

| Rule Name                              | <p>Data Collection<br>Point</p> | Monitored Component | Action Required                                                                                                              |
| -------------------------------------- | ------------------------------- | ------------------- | ---------------------------------------------------------------------------------------------------------------------------- |
|                                        |                                 |                     |                                                                                                                              |
|                                        |                                 |                     |                                                                                                                              |
| KubeNodeNotReady                       | Cluster                         | kube state metrics  | Review issues with kubelet on the node.                                                                                      |
| KubernetesMemoryPressure               | Cluster                         | kube state metrics  | Check available memory on the node.                                                                                          |
| KubernetesDiskPressure                 | Cluster                         | kube state metrics  | Check available space on the node.                                                                                           |
| KubernetesJobFailed                    | Cluster                         | kube state metrics  | Check logs for failed jobs.                                                                                                  |
| KubernetesContainerTerminated          | Cluster                         | kube state metrics  | K8s container was terminated, possible OOM killer.                                                                           |
| KubePodCrashLooping                    | Cluster                         | kube state metrics  | Review logs on the crashed pod.                                                                                              |
| KubePodNotReady                        | Cluster                         | kube state metrics  | Pod may be in pending state, or awaiting a resource.                                                                         |
| KubeDeploymentReplicasMismatch         | Cluster                         | kube state metrics  | K8s deployment has not reconciled the expected number of replicas. Possibly related to CPU, memory requests, or node taints. |
| KubeStatefulSetReplicasMismatch        | Cluster                         | kube state metrics  | K8s deployment has not reconciled the expected number of replicas. Possibly related to CPU, memory requests, or node taints. |
| KubeDaemonSetRolloutStuck              | Cluster                         | kube state metrics  | K8s deployment has not reconciled the expected number of replicas. Possibly related to CPU, memory requests, or node taints. |
| KubernetesPersistentvolumeclaimPending | Cluster                         | kube state metrics  | PVC is in a pending state as K8s was unable to create a PV. Verify details of storage class, CSI driver etc.                 |
| KubernetesPersistentvolumeError        | Cluster                         | kube state metrics  | PV is in an error state. Check storage class or CSI drivers, and logs.                                                       |

### Node OS

| Rule Name                       | Data Collection Point | Monitored Component | Action Required                                                      |
| ------------------------------- | --------------------- | ------------------- | -------------------------------------------------------------------- |
| HostHighCpuLoad                 | Node                  | Node exporter       | High CPU usage on node. Check processes or responsible container.    |
| HostOutOfMemory                 | Node                  | Node exporter       | High memory usage on node. Check processes or responsible container. |
| HostMemoryUnderMemoryPressure   | Node                  | Node exporter       | High memory usage on node. Check processes or responsible container. |
| NodeFilesystemAlmostOutOfSpace  | Node                  | Node exporter       | Node out of disk space.                                              |
| NodeFilesystemAlmostOutOfFiles  | Node                  | Node exporter       | Node out of inodes.                                                  |
| HostUnusualNetworkThroughputIn  | Node                  | Node exporter       | Host is experiencing unusual inbound network throughput.             |
| HostUnusualNetworkThroughputOut | Node                  | Node exporter       | Host is experiencing unusual outbound network throughput             |
| NodeNetworkReceiveErrs          | Node                  | Node exporter       | Host is experiencing network receive errors.                         |
| NodeNetworkTransmitErrs         | Node                  | Node exporter       | Host is experiencing network transmit errors.                        |
| HostUnusualDiskWriteRate        | Node                  | Node exporter       | Host is experiencing unusual disk write rate.                        |
| HostUnusualDiskReadRate         | Node                  | Node exporter       | Host is experiencing unusual disk write rate.                        |

### Environment Status

| Rule Name               | Data Collection Point | Monitored Component | Action Required                                                             |
| ----------------------- | --------------------- | ------------------- | --------------------------------------------------------------------------- |
| ClusterStatusNotOK      | PMK                   | SaaS Mgmt Plane     | Cluster status is not ok in PMK Mgmt Plane DB.                              |
| NodeNotReady            | PMK                   | SaaS Mgmt Plane     | K8s node has entered a *not ready* state as noted by a HostAgent extension. |
| K8sApiNotResponding     | PMK                   | SaaS Mgmt Plane     | PMK Mgmt Plane cannot reach k8s apiserver.                                  |
| WorkerNodeNotResponding | PMK                   | SaaS Mgmt Plane     | PMK Mgmt Plane cannot reach worker node.                                    |

### Node Connectivity

| Rule Name          | Data Collection Point | Monitored Component | Action Required                                                                                                                                                    |
| ------------------ | --------------------- | ------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| Host Availability  | PMK                   | SaaS Mgmt Plane     | <p>Node Heartbeat is failing. This could be caused by a node outage or a failed service.<br>Review the node to ensure that all Platform9 services are running.</p> |
| Hosts disconnected | PMK                   | SaaS Mgmt Plane     | Partial node availability, heartbeat is passing, review the node to ensure that all Platform9 services are running.                                                |
| host-down          | PMK                   | SaaS Mgmt Plane     | The node is completely disconnected from Platform9. Ensure the node is running and all services are operating.                                                     |

### Managed Add-ons

| Rule Name           | Data Collection Point | Monitored Component | Action Required                                                 |
| ------------------- | --------------------- | ------------------- | --------------------------------------------------------------- |
| AddonNotHealthy     | PMK                   | Nodelet             | ClusterAddon’s *.status.health* shows addon is not healthy.     |
| AddonNotConverging  | PMK                   | Nodelet             | ClusterAddon’s *.status.phase* is not in Installed state.       |
| AddonInstallError   | PMK                   | Nodelet             | ClusterAddon’s *.status.phase* is in an `InstallError` state.   |
| AddonUninstallError | PMK                   | Nodelet             | ClusterAddon’s *.status.phase* is in an `UninstallError` state. |
|                     |                       |                     |                                                                 |

### Calico

| Rule Name                      | Data Collection Point | Monitored Component | Action Required                                                                           |
| ------------------------------ | --------------------- | ------------------- | ----------------------------------------------------------------------------------------- |
| PromHTTPRequestErrors          | Node                  | calico-felix        | Prometheus unable to pull data from calico. Underlying network or calico-node pod issues. |
| CalicoDatapaneFailuresHigh     | Node                  | calico-felix        | Calico-node pod is congested. Reduce load or restart restart calico-node pod.             |
| CalicoIpsetErrorsHigh          | Node                  | calico-felix        | Calico-node pod is congested. Reduce load or restart restart calico-node pod.             |
| CalicoIptableSaveErrorsHigh    | Node                  | calico-felix        | Calico-node pod is congested. Reduce load or restart restart calico-node pod.             |
| CalicoIptableRestoreErrorsHigh | Node                  | calico-felix        | Calico-node pod is congested. Reduce load or restart restart calico-node pod.             |
| TyphaPingLatency               | Node                  | calico-typha        | Check network connectivity to the calico-typha pods.                                      |
| TyphaClientWriteLatency        | Node                  | calico-typha        | Verify connectivity between calico-typha and kubernetes API server/etcd                   |


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.platform9.com/managed-kubernetes/5.15/catapult-remote-monitoring/catapult-rules-alarms.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
