# Nodes Connectivity Monitoring

## Rules for the Hostagent

The Prometheus agent is configured to report numerous node-based metrics. Below is the YAML-based rule set that Catapult uses, including alert names, expression, timeframe, annotations which contain the summary and description of the notification as well as the SaaS management plane and hostname.

{% tabs %}
{% tab title="YAML" %}

```yaml
bash-5.0# cat resmgr.yml
groups:
  - name: Host Availability Alerts
    rules:
    # du_sidekick_last_checkin_seconds will be the time in seconds since a host checked in via sidekick
    # or -1 if the host is not reported by sidekick at all. The latter case may occur if
    # sidekick has restarted and a host hasn't connected since.
    - expr: >-2
        (
          sum by (host_id,host_name) (
            time() - sidekick_host_last_heartbeat_time{job="sidekickserver"} / 1000
          )
        )
        OR
        ignoring(host_name) (
          sum by (host_id,host_name) (
            (resmgr_host_up{job="resmgr"} == 0) - 1
          )
        )
      record: du_sidekick_last_checkin_seconds


  - name: Hosts disconnected
    rules:
    # resmgr says it's been down for at least 10m, sidekick says still reporting
    # NOTE: the `for` delay _must_ exceeed the cutoff (currently 600 seconds) + the scrape
    # period (1m) or else both this and host-down will fire off simultaneously
    - alert: host-disconnected
      expr: >-2
        sum by (du, host_id, host_name) (du_sidekick_last_checkin_seconds < 600) AND
        ON(host_id) resmgr_host_up{job="resmgr"} == 0
      for: 15m
      annotations:
        summary: host-disconnected
        description: "{{ $labels.host_name }} disconnected from control plane {{ $labels.du }} for more than 10 minutes"
        du: "{{ $labels.du }}"
        host_name: "{{ $labels.host_name }}"


    # resmgr says it's been down for at least 10m, sidekick "agrees"
    - alert: host-down
      expr: >-2
        sum by (du, host_id, host_name) (du_sidekick_last_checkin_seconds >= 600 OR du_sidekick_last_checkin_seconds == -1) AND
        ON(host_id) resmgr_host_up{job="resmgr"} == 0
      for: 10m
      annotations:
        summary: host-down
        description: "{{ $labels.host_name }} down {{ $labels.du }} for more than 10 minutes"
        du: "{{ $labels.du }}"
        host_name: "{{ $labels.host_name }}"
bash-5.0#
```

{% endtab %}
{% endtabs %}
