# Catapult Rules Alarms

Catapult has 56 built in rules, each rule falls into one of the categories below:

* [auto$](https://github.com/platform9/pcd-docs-gitbook/blob/main/kubernetes/calico-monitoring/README.md)
* [auto$](https://github.com/platform9/pcd-docs-gitbook/blob/main/kubernetes/etcd-monitoring/README.md)
* [auto$](https://github.com/platform9/pcd-docs-gitbook/blob/main/kubernetes/kubernetes-monitoring/README.md)
* [auto$](https://github.com/platform9/pcd-docs-gitbook/blob/main/kubernetes/kube-state--monitoring/README.md)
* [auto$](https://github.com/platform9/pcd-docs-gitbook/blob/main/kubernetes/node-os-monitoring/README.md)
* [auto$](https://github.com/platform9/pcd-docs-gitbook/blob/main/kubernetes/Environment-Status-monitoring/README.md)
* [auto$](https://github.com/platform9/pcd-docs-gitbook/blob/main/kubernetes/nodes-connectivity-monitoring/README.md)
* [auto$](https://github.com/platform9/pcd-docs-gitbook/blob/main/kubernetes/addons-catapult/README.md)

Below is a summary of each rule, where the data is collected, the component that is monitored and the action require if the alert is received.

### etcd

Below is a list of the rule names, where the data is collected from, the monitored components, and the suggested action to remedy the issue.

| Rule Name                          | <p>Data Collection<br>Point</p> | Monitored Component | Action Required                                                                                                                                |
| ---------------------------------- | ------------------------------- | ------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------- |
| EtcdBackupJobFailed                | Cluster                         | etcd                | Review etcd backup task. Possibly lack of disk space.                                                                                          |
| EtcdDown                           | Cluster                         | etcd                | Verify if nodelet can repair etcd. Check for CPU, memory, or storage issues on the node. Review logs of etcd container.                        |
| EtcdInsufficientMembers            | Cluster                         | etcd                | Etcd quorum lacks an odd number of masters.                                                                                                    |
| EtcdNoLeader                       | Cluster                         | etcd                | Etcd quorum has no leader. Investigate network issues causing etcd members not reaching consensus.                                             |
| EtcdHighNumberOfLeaderChanges      | Cluster                         | etcd                | If leader of etcd quorum is changing frequently, verify master nodes are not constantly failing. Possible network issues between master nodes. |
| EtcdHighNumberOfFailedGrpcRequests | Cluster                         | etcd                | Investigate network issues causing failed grpc requests in the etcd quorum. Review etcd container logs.                                        |
| EtcdHighNumberOfFailedProposals    | Cluster                         | etcd                | Inspect network issues causing failed proposals in the etcd quorum. Check etcd container logs.                                                 |
| EtcdHighFsyncDurations             | Cluster                         | etcd                | Review disk or storage issues which may cause problems when an etcd member tries to commit data to disk.                                       |
| EtcdHighCommitDurations            | Cluster                         | etcd                | Review disk or storage issues which may cause problems when an etcd member tries to commit data to disk.                                       |

### Kubernetes API Monitoring

| Rule Name                 | <p>Data Collection<br>Point</p> | Monitored Component | Action Required                                                                |
| ------------------------- | ------------------------------- | ------------------- | ------------------------------------------------------------------------------ |
| KubeAPIServerDown         | Cluster                         | Kube API server     | Verify API Server is responding on the node                                    |
| KubernetesApiServerErrors | Cluster                         | Kube API server     | <p>Check for API Server overload.<br>Review logs.</p>                          |
| KubernetesApiClientErrors | Cluster                         | Kube API server     | <p>Check for API Server overload.<br>Survey logs to audit client requests.</p> |
| KubeSchedulerDown         | Cluster                         | Kube Scheduler      | Audit logs for k8s master pod restarts.                                        |
| KubeControllerManagerDown | Cluster                         | kube controller     | Audit logs for k8s master pod restarts.                                        |
| KubeProxyDown             | Cluster                         | kube proxy          | Verify kube proxy container is running on the node.                            |
| KubeProxyRuleSyncLatency  | Cluster                         | kube proxy          | Explore kube proxy overload via logs.                                          |
|                           |                                 |                     |                                                                                |

### Kube State

| Rule Name                              | <p>Data Collection<br>Point</p> | Monitored Component | Action Required                                                                                                              |
| -------------------------------------- | ------------------------------- | ------------------- | ---------------------------------------------------------------------------------------------------------------------------- |
|                                        |                                 |                     |                                                                                                                              |
|                                        |                                 |                     |                                                                                                                              |
| KubeNodeNotReady                       | Cluster                         | kube state metrics  | Review issues with kubelet on the node.                                                                                      |
| KubernetesMemoryPressure               | Cluster                         | kube state metrics  | Check available memory on the node.                                                                                          |
| KubernetesDiskPressure                 | Cluster                         | kube state metrics  | Check available space on the node.                                                                                           |
| KubernetesJobFailed                    | Cluster                         | kube state metrics  | Check logs for failed jobs.                                                                                                  |
| KubernetesContainerTerminated          | Cluster                         | kube state metrics  | K8s container was terminated, possible OOM killer.                                                                           |
| KubePodCrashLooping                    | Cluster                         | kube state metrics  | Review logs on the crashed pod.                                                                                              |
| KubePodNotReady                        | Cluster                         | kube state metrics  | Pod may be in pending state, or awaiting a resource.                                                                         |
| KubeDeploymentReplicasMismatch         | Cluster                         | kube state metrics  | K8s deployment has not reconciled the expected number of replicas. Possibly related to CPU, memory requests, or node taints. |
| KubeStatefulSetReplicasMismatch        | Cluster                         | kube state metrics  | K8s deployment has not reconciled the expected number of replicas. Possibly related to CPU, memory requests, or node taints. |
| KubeDaemonSetRolloutStuck              | Cluster                         | kube state metrics  | K8s deployment has not reconciled the expected number of replicas. Possibly related to CPU, memory requests, or node taints. |
| KubernetesPersistentvolumeclaimPending | Cluster                         | kube state metrics  | PVC is in a pending state as K8s was unable to create a PV. Verify details of storage class, CSI driver etc.                 |
| KubernetesPersistentvolumeError        | Cluster                         | kube state metrics  | PV is in an error state. Check storage class or CSI drivers, and logs.                                                       |

### Node OS

| Rule Name                       | Data Collection Point | Monitored Component | Action Required                                                      |
| ------------------------------- | --------------------- | ------------------- | -------------------------------------------------------------------- |
| HostHighCpuLoad                 | Node                  | Node exporter       | High CPU usage on node. Check processes or responsible container.    |
| HostOutOfMemory                 | Node                  | Node exporter       | High memory usage on node. Check processes or responsible container. |
| HostMemoryUnderMemoryPressure   | Node                  | Node exporter       | High memory usage on node. Check processes or responsible container. |
| NodeFilesystemAlmostOutOfSpace  | Node                  | Node exporter       | Node out of disk space.                                              |
| NodeFilesystemAlmostOutOfFiles  | Node                  | Node exporter       | Node out of inodes.                                                  |
| HostUnusualNetworkThroughputIn  | Node                  | Node exporter       | Host is experiencing unusual inbound network throughput.             |
| HostUnusualNetworkThroughputOut | Node                  | Node exporter       | Host is experiencing unusual outbound network throughput             |
| NodeNetworkReceiveErrs          | Node                  | Node exporter       | Host is experiencing network receive errors.                         |
| NodeNetworkTransmitErrs         | Node                  | Node exporter       | Host is experiencing network transmit errors.                        |
| HostUnusualDiskWriteRate        | Node                  | Node exporter       | Host is experiencing unusual disk write rate.                        |
| HostUnusualDiskReadRate         | Node                  | Node exporter       | Host is experiencing unusual disk write rate.                        |

### Environment Status

| Rule Name               | Data Collection Point | Monitored Component | Action Required                                                             |
| ----------------------- | --------------------- | ------------------- | --------------------------------------------------------------------------- |
| ClusterStatusNotOK      | PMK                   | SaaS Mgmt Plane     | Cluster status is not ok in PMK Mgmt Plane DB.                              |
| NodeNotReady            | PMK                   | SaaS Mgmt Plane     | K8s node has entered a *not ready* state as noted by a HostAgent extension. |
| K8sApiNotResponding     | PMK                   | SaaS Mgmt Plane     | PMK Mgmt Plane cannot reach k8s apiserver.                                  |
| WorkerNodeNotResponding | PMK                   | SaaS Mgmt Plane     | PMK Mgmt Plane cannot reach worker node.                                    |

### Node Connectivity

| Rule Name          | Data Collection Point | Monitored Component | Action Required                                                                                                                                                    |
| ------------------ | --------------------- | ------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| Host Availability  | PMK                   | SaaS Mgmt Plane     | <p>Node Heartbeat is failing. This could be caused by a node outage or a failed service.<br>Review the node to ensure that all Platform9 services are running.</p> |
| Hosts disconnected | PMK                   | SaaS Mgmt Plane     | Partial node availability, heartbeat is passing, review the node to ensure that all Platform9 services are running.                                                |
| host-down          | PMK                   | SaaS Mgmt Plane     | The node is completely disconnected from Platform9. Ensure the node is running and all services are operating.                                                     |

### Managed Add-ons

| Rule Name           | Data Collection Point | Monitored Component | Action Required                                                 |
| ------------------- | --------------------- | ------------------- | --------------------------------------------------------------- |
| AddonNotHealthy     | PMK                   | Nodelet             | ClusterAddon’s *.status.health* shows addon is not healthy.     |
| AddonNotConverging  | PMK                   | Nodelet             | ClusterAddon’s *.status.phase* is not in Installed state.       |
| AddonInstallError   | PMK                   | Nodelet             | ClusterAddon’s *.status.phase* is in an `InstallError` state.   |
| AddonUninstallError | PMK                   | Nodelet             | ClusterAddon’s *.status.phase* is in an `UninstallError` state. |
|                     |                       |                     |                                                                 |

### Calico

| Rule Name                      | Data Collection Point | Monitored Component | Action Required                                                                           |
| ------------------------------ | --------------------- | ------------------- | ----------------------------------------------------------------------------------------- |
| PromHTTPRequestErrors          | Node                  | calico-felix        | Prometheus unable to pull data from calico. Underlying network or calico-node pod issues. |
| CalicoDatapaneFailuresHigh     | Node                  | calico-felix        | Calico-node pod is congested. Reduce load or restart restart calico-node pod.             |
| CalicoIpsetErrorsHigh          | Node                  | calico-felix        | Calico-node pod is congested. Reduce load or restart restart calico-node pod.             |
| CalicoIptableSaveErrorsHigh    | Node                  | calico-felix        | Calico-node pod is congested. Reduce load or restart restart calico-node pod.             |
| CalicoIptableRestoreErrorsHigh | Node                  | calico-felix        | Calico-node pod is congested. Reduce load or restart restart calico-node pod.             |
| TyphaPingLatency               | Node                  | calico-typha        | Check network connectivity to the calico-typha pods.                                      |
| TyphaClientWriteLatency        | Node                  | calico-typha        | Verify connectivity between calico-typha and kubernetes API server/etcd                   |
