Catapult Rules Alarms

Catapult has 56 built in rules, each rule falls into one of the categories below:

Below is a summary of each rule, where the data is collected, the component that is monitored and the action require if the alert is received.

etcd

Below is a list of the rule names, where the data is collected from, the monitored components, and the suggested action to remedy the issue.

Rule Name

Data Collection Point

Monitored Component

Action Required

EtcdBackupJobFailed

Cluster

etcd

Review etcd backup task. Possibly lack of disk space.

EtcdDown

Cluster

etcd

Verify if nodelet can repair etcd. Check for CPU, memory, or storage issues on the node. Review logs of etcd container.

EtcdInsufficientMembers

Cluster

etcd

Etcd quorum lacks an odd number of masters.

EtcdNoLeader

Cluster

etcd

Etcd quorum has no leader. Investigate network issues causing etcd members not reaching consensus.

EtcdHighNumberOfLeaderChanges

Cluster

etcd

If leader of etcd quorum is changing frequently, verify master nodes are not constantly failing. Possible network issues between master nodes.

EtcdHighNumberOfFailedGrpcRequests

Cluster

etcd

Investigate network issues causing failed grpc requests in the etcd quorum. Review etcd container logs.

EtcdHighNumberOfFailedProposals

Cluster

etcd

Inspect network issues causing failed proposals in the etcd quorum. Check etcd container logs.

EtcdHighFsyncDurations

Cluster

etcd

Review disk or storage issues which may cause problems when an etcd member tries to commit data to disk.

EtcdHighCommitDurations

Cluster

etcd

Review disk or storage issues which may cause problems when an etcd member tries to commit data to disk.

Kubernetes API Monitoring

Rule Name

Data Collection Point

Monitored Component

Action Required

KubeAPIServerDown

Cluster

Kube API server

Verify API Server is responding on the node

KubernetesApiServerErrors

Cluster

Kube API server

Check for API Server overload. Review logs.

KubernetesApiClientErrors

Cluster

Kube API server

Check for API Server overload. Survey logs to audit client requests.

KubeSchedulerDown

Cluster

Kube Scheduler

Audit logs for k8s master pod restarts.

KubeControllerManagerDown

Cluster

kube controller

Audit logs for k8s master pod restarts.

KubeProxyDown

Cluster

kube proxy

Verify kube proxy container is running on the node.

KubeProxyRuleSyncLatency

Cluster

kube proxy

Explore kube proxy overload via logs.

Kube State

Rule Name

Data Collection Point

Monitored Component

Action Required

KubeNodeNotReady

Cluster

kube state metrics

Review issues with kubelet on the node.

KubernetesMemoryPressure

Cluster

kube state metrics

Check available memory on the node.

KubernetesDiskPressure

Cluster

kube state metrics

Check available space on the node.

KubernetesJobFailed

Cluster

kube state metrics

Check logs for failed jobs.

KubernetesContainerTerminated

Cluster

kube state metrics

K8s container was terminated, possible OOM killer.

KubePodCrashLooping

Cluster

kube state metrics

Review logs on the crashed pod.

KubePodNotReady

Cluster

kube state metrics

Pod may be in pending state, or awaiting a resource.

KubeDeploymentReplicasMismatch

Cluster

kube state metrics

K8s deployment has not reconciled the expected number of replicas. Possibly related to CPU, memory requests, or node taints.

KubeStatefulSetReplicasMismatch

Cluster

kube state metrics

K8s deployment has not reconciled the expected number of replicas. Possibly related to CPU, memory requests, or node taints.

KubeDaemonSetRolloutStuck

Cluster

kube state metrics

K8s deployment has not reconciled the expected number of replicas. Possibly related to CPU, memory requests, or node taints.

KubernetesPersistentvolumeclaimPending

Cluster

kube state metrics

PVC is in a pending state as K8s was unable to create a PV. Verify details of storage class, CSI driver etc.

KubernetesPersistentvolumeError

Cluster

kube state metrics

PV is in an error state. Check storage class or CSI drivers, and logs.

Node OS

Rule Name
Data Collection Point
Monitored Component
Action Required

HostHighCpuLoad

Node

Node exporter

High CPU usage on node. Check processes or responsible container.

HostOutOfMemory

Node

Node exporter

High memory usage on node. Check processes or responsible container.

HostMemoryUnderMemoryPressure

Node

Node exporter

High memory usage on node. Check processes or responsible container.

NodeFilesystemAlmostOutOfSpace

Node

Node exporter

Node out of disk space.

NodeFilesystemAlmostOutOfFiles

Node

Node exporter

Node out of inodes.

HostUnusualNetworkThroughputIn

Node

Node exporter

Host is experiencing unusual inbound network throughput.

HostUnusualNetworkThroughputOut

Node

Node exporter

Host is experiencing unusual outbound network throughput

NodeNetworkReceiveErrs

Node

Node exporter

Host is experiencing network receive errors.

NodeNetworkTransmitErrs

Node

Node exporter

Host is experiencing network transmit errors.

HostUnusualDiskWriteRate

Node

Node exporter

Host is experiencing unusual disk write rate.

HostUnusualDiskReadRate

Node

Node exporter

Host is experiencing unusual disk write rate.

Environment Status

Rule Name
Data Collection Point
Monitored Component
Action Required

ClusterStatusNotOK

PMK

SaaS Mgmt Plane

Cluster status is not ok in PMK Mgmt Plane DB.

NodeNotReady

PMK

SaaS Mgmt Plane

K8s node has entered a not ready state as noted by a HostAgent extension.

K8sApiNotResponding

PMK

SaaS Mgmt Plane

PMK Mgmt Plane cannot reach k8s apiserver.

WorkerNodeNotResponding

PMK

SaaS Mgmt Plane

PMK Mgmt Plane cannot reach worker node.

Node Connectivity

Rule Name
Data Collection Point
Monitored Component
Action Required

Host Availability

PMK

SaaS Mgmt Plane

Node Heartbeat is failing. This could be caused by a node outage or a failed service. Review the node to ensure that all Platform9 services are running.

Hosts disconnected

PMK

SaaS Mgmt Plane

Partial node availability, heartbeat is passing, review the node to ensure that all Platform9 services are running.

host-down

PMK

SaaS Mgmt Plane

The node is completely disconnected from Platform9. Ensure the node is running and all services are operating.

Managed Add-ons

Rule Name
Data Collection Point
Monitored Component
Action Required

AddonNotHealthy

PMK

Nodelet

ClusterAddon’s .status.health shows addon is not healthy.

AddonNotConverging

PMK

Nodelet

ClusterAddon’s .status.phase is not in Installed state.

AddonInstallError

PMK

Nodelet

ClusterAddon’s .status.phase is in an InstallError state.

AddonUninstallError

PMK

Nodelet

ClusterAddon’s .status.phase is in an UninstallError state.

Calico

Rule Name
Data Collection Point
Monitored Component
Action Required

PromHTTPRequestErrors

Node

calico-felix

Prometheus unable to pull data from calico. Underlying network or calico-node pod issues.

CalicoDatapaneFailuresHigh

Node

calico-felix

Calico-node pod is congested. Reduce load or restart restart calico-node pod.

CalicoIpsetErrorsHigh

Node

calico-felix

Calico-node pod is congested. Reduce load or restart restart calico-node pod.

CalicoIptableSaveErrorsHigh

Node

calico-felix

Calico-node pod is congested. Reduce load or restart restart calico-node pod.

CalicoIptableRestoreErrorsHigh

Node

calico-felix

Calico-node pod is congested. Reduce load or restart restart calico-node pod.

TyphaPingLatency

Node

calico-typha

Check network connectivity to the calico-typha pods.

TyphaClientWriteLatency

Node

calico-typha

Verify connectivity between calico-typha and kubernetes API server/etcd

Last updated

Was this helpful?