Catapult Rules Alarms
Catapult has 56 built in rules, each rule falls into one of the categories below:
Below is a summary of each rule, where the data is collected, the component that is monitored and the action require if the alert is received.
etcd
Below is a list of the rule names, where the data is collected from, the monitored components, and the suggested action to remedy the issue.
Rule Name
Data Collection Point
Monitored Component
Action Required
EtcdBackupJobFailed
Cluster
etcd
Review etcd backup task. Possibly lack of disk space.
EtcdDown
Cluster
etcd
Verify if nodelet can repair etcd. Check for CPU, memory, or storage issues on the node. Review logs of etcd container.
EtcdInsufficientMembers
Cluster
etcd
Etcd quorum lacks an odd number of masters.
EtcdNoLeader
Cluster
etcd
Etcd quorum has no leader. Investigate network issues causing etcd members not reaching consensus.
EtcdHighNumberOfLeaderChanges
Cluster
etcd
If leader of etcd quorum is changing frequently, verify master nodes are not constantly failing. Possible network issues between master nodes.
EtcdHighNumberOfFailedGrpcRequests
Cluster
etcd
Investigate network issues causing failed grpc requests in the etcd quorum. Review etcd container logs.
EtcdHighNumberOfFailedProposals
Cluster
etcd
Inspect network issues causing failed proposals in the etcd quorum. Check etcd container logs.
EtcdHighFsyncDurations
Cluster
etcd
Review disk or storage issues which may cause problems when an etcd member tries to commit data to disk.
EtcdHighCommitDurations
Cluster
etcd
Review disk or storage issues which may cause problems when an etcd member tries to commit data to disk.
Kubernetes API Monitoring
Rule Name
Data Collection Point
Monitored Component
Action Required
KubeAPIServerDown
Cluster
Kube API server
Verify API Server is responding on the node
KubernetesApiServerErrors
Cluster
Kube API server
Check for API Server overload. Review logs.
KubernetesApiClientErrors
Cluster
Kube API server
Check for API Server overload. Survey logs to audit client requests.
KubeSchedulerDown
Cluster
Kube Scheduler
Audit logs for k8s master pod restarts.
KubeControllerManagerDown
Cluster
kube controller
Audit logs for k8s master pod restarts.
KubeProxyDown
Cluster
kube proxy
Verify kube proxy container is running on the node.
KubeProxyRuleSyncLatency
Cluster
kube proxy
Explore kube proxy overload via logs.
Kube State
Rule Name
Data Collection Point
Monitored Component
Action Required
KubeNodeNotReady
Cluster
kube state metrics
Review issues with kubelet on the node.
KubernetesMemoryPressure
Cluster
kube state metrics
Check available memory on the node.
KubernetesDiskPressure
Cluster
kube state metrics
Check available space on the node.
KubernetesJobFailed
Cluster
kube state metrics
Check logs for failed jobs.
KubernetesContainerTerminated
Cluster
kube state metrics
K8s container was terminated, possible OOM killer.
KubePodCrashLooping
Cluster
kube state metrics
Review logs on the crashed pod.
KubePodNotReady
Cluster
kube state metrics
Pod may be in pending state, or awaiting a resource.
KubeDeploymentReplicasMismatch
Cluster
kube state metrics
K8s deployment has not reconciled the expected number of replicas. Possibly related to CPU, memory requests, or node taints.
KubeStatefulSetReplicasMismatch
Cluster
kube state metrics
K8s deployment has not reconciled the expected number of replicas. Possibly related to CPU, memory requests, or node taints.
KubeDaemonSetRolloutStuck
Cluster
kube state metrics
K8s deployment has not reconciled the expected number of replicas. Possibly related to CPU, memory requests, or node taints.
KubernetesPersistentvolumeclaimPending
Cluster
kube state metrics
PVC is in a pending state as K8s was unable to create a PV. Verify details of storage class, CSI driver etc.
KubernetesPersistentvolumeError
Cluster
kube state metrics
PV is in an error state. Check storage class or CSI drivers, and logs.
Node OS
HostHighCpuLoad
Node
Node exporter
High CPU usage on node. Check processes or responsible container.
HostOutOfMemory
Node
Node exporter
High memory usage on node. Check processes or responsible container.
HostMemoryUnderMemoryPressure
Node
Node exporter
High memory usage on node. Check processes or responsible container.
NodeFilesystemAlmostOutOfSpace
Node
Node exporter
Node out of disk space.
NodeFilesystemAlmostOutOfFiles
Node
Node exporter
Node out of inodes.
HostUnusualNetworkThroughputIn
Node
Node exporter
Host is experiencing unusual inbound network throughput.
HostUnusualNetworkThroughputOut
Node
Node exporter
Host is experiencing unusual outbound network throughput
NodeNetworkReceiveErrs
Node
Node exporter
Host is experiencing network receive errors.
NodeNetworkTransmitErrs
Node
Node exporter
Host is experiencing network transmit errors.
HostUnusualDiskWriteRate
Node
Node exporter
Host is experiencing unusual disk write rate.
HostUnusualDiskReadRate
Node
Node exporter
Host is experiencing unusual disk write rate.
Environment Status
ClusterStatusNotOK
PMK
SaaS Mgmt Plane
Cluster status is not ok in PMK Mgmt Plane DB.
NodeNotReady
PMK
SaaS Mgmt Plane
K8s node has entered a not ready state as noted by a HostAgent extension.
K8sApiNotResponding
PMK
SaaS Mgmt Plane
PMK Mgmt Plane cannot reach k8s apiserver.
WorkerNodeNotResponding
PMK
SaaS Mgmt Plane
PMK Mgmt Plane cannot reach worker node.
Node Connectivity
Host Availability
PMK
SaaS Mgmt Plane
Node Heartbeat is failing. This could be caused by a node outage or a failed service. Review the node to ensure that all Platform9 services are running.
Hosts disconnected
PMK
SaaS Mgmt Plane
Partial node availability, heartbeat is passing, review the node to ensure that all Platform9 services are running.
host-down
PMK
SaaS Mgmt Plane
The node is completely disconnected from Platform9. Ensure the node is running and all services are operating.
Managed Add-ons
AddonNotHealthy
PMK
Nodelet
ClusterAddon’s .status.health shows addon is not healthy.
AddonNotConverging
PMK
Nodelet
ClusterAddon’s .status.phase is not in Installed state.
AddonInstallError
PMK
Nodelet
ClusterAddon’s .status.phase is in an InstallError state.
AddonUninstallError
PMK
Nodelet
ClusterAddon’s .status.phase is in an UninstallError state.
Calico
PromHTTPRequestErrors
Node
calico-felix
Prometheus unable to pull data from calico. Underlying network or calico-node pod issues.
CalicoDatapaneFailuresHigh
Node
calico-felix
Calico-node pod is congested. Reduce load or restart restart calico-node pod.
CalicoIpsetErrorsHigh
Node
calico-felix
Calico-node pod is congested. Reduce load or restart restart calico-node pod.
CalicoIptableSaveErrorsHigh
Node
calico-felix
Calico-node pod is congested. Reduce load or restart restart calico-node pod.
CalicoIptableRestoreErrorsHigh
Node
calico-felix
Calico-node pod is congested. Reduce load or restart restart calico-node pod.
TyphaPingLatency
Node
calico-typha
Check network connectivity to the calico-typha pods.
TyphaClientWriteLatency
Node
calico-typha
Verify connectivity between calico-typha and kubernetes API server/etcd
Last updated
Was this helpful?
