Prometheus 报警规则参考-笔记

prometheus的报警规则案例参考[编辑器原因导致判断符号显示不正确]。
规则是通过文件方式来定义的，这些规则加载目录可以通过prometheus.yaml 配置文件进行定义，比如：

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
    - "rule/*.yml"
  # - "first_rules.yml"
  # - "second_rules.yml"

这里我们是放置到了/usr/local/prometheus/server/rule/目录下，可以自定义多个报警规则文件，prometheus会通过global设置的参数进行周期扫描加载，不用重启prometheus守护进程。

这里我们记录下常规的rule规则。
prometheus报警规则（类似zabbix的triggr触发器）说明，参考如下rule规则进行说明：

groups:
  - name: tcp port check
    rules:
      - alert: tcp_port_check failed
        for: 5s
        expr: probe_success{job="tcp_port_check"} == 0
        labels:
          serverity: critical
        annotations:
          description: "{{ $labels.group }}的{{ $labels.app }} tcp检测失败,当前probe_success的值为{ { $value }}"
          summary: "{{ $labels.group }}组的应用 {{ $labels.app }} 端口检测不通"

（1）报警规则名称name（这个可以通过promethues status-rule可以看到报警规则名称。
（2）rule具体规则：alert，报警title。插件报警会将这个进行title展示，什么业务一目了然。当然，这些报警规则下的所有alert都可以通过prometheus的alerts可以看到。
（3）rule具体规则：触发周期。多久进行一次数据拉取和规则匹配。
（4）rule具体规则：触发器表达式，报警的核心。这里要设置针对哪个target的什么指标设置什么值进行触发。
（5）rule具体规则：报警级别。类似zabbix的报警级别。
（6）注释，便于查看和后期维护。

1，linux基本资源指标监控，比如cpu、内存、网卡、磁盘等，也可以通过promsql 自己设定其他规则

[root@cn-hz-21yunwei-devops rule]# cat  linux_rule.yml 
groups: 
  - name: linux_alert
    rules: 
      - alert: "linux load5 over 5"
        for: 5s
        expr: node_load5 &gt; 5
        labels:
          serverity: critical
        annotations:
          description: "{{ $labels.app }}  over 5,当前值:{{ $value }}"
          summary: "linux load5  over 5"

      - alert: "node explorter have down"
        for: 5s
        expr: up==0
        labels:
          serverity: critical
        annotations:
          description: "{{ $labels.app }} -- {{ $labels.instance }} ,当前值:{{ $value }}"
          summary: "node explorter value equle 0"

      - alert: "cpu used percent over 80% per 1 min"
        for: 5s
        expr: 100 * (1 - avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[1m])))  * on(instance) group_left(hostname) node_uname_info &gt; 80
        labels:
          serverity: critical
        annotations:
          description: "{{ $labels.app }} -- {{ $labels.instance }} ,当前值:{{ $value }}"
          summary: "cpu used percent over 80% per 1 min"

      - alert: "memory used percent over 85%"
        for: 5m
        expr: ((node_memory_MemTotal_bytes - node_memory_MemFree_bytes - node_memory_Buffers_bytes - node_memory_Cached_bytes) / (node_memory_MemTotal_bytes{instance!~"172..*"})) * 100 &gt; 85
        labels:
          serverity: critical
        annotations:
          description: "{{ $labels.app }} -- {{ $labels.instance }} ,当前值:{{ $value }}"
          summary: "memory used percent over 85%"

      - alert: "eth0 input traffic network over 10M"
        for: 3m
        expr: sum by(instance) (irate(node_network_receive_bytes_total{device="eth0",instance!~"172.1.*|172..*"}[1m]) / 128/1024) * on(instance) group_left(hostname) node_uname_info &gt; 10
        labels:
          serverity: critical
        annotations:
          description: "{{ $labels.app }} -- {{ $labels.instance }} ,当前值:{{ $value }}"
          summary: "eth0 input traffic network over 10M"

      - alert: "eth0 output traffic network over 10M"
        for: 3m
        expr: sum by(instance) (irate(node_network_transmit_bytes_total{device="eth0",instance!~"172.1.*|175.*"}[1m]) / 128/1024) * on(instance) group_left(hostname) node_uname_info &gt; 10
        labels:
          serverity: critical
        annotations:
          description: "{{ $labels.app }} -- {{ $labels.instance }} ,当前值:{{ $value }}"
          summary: "eth0 output traffic network over 10M"

      - alert: "disk usage over 80%"
        for: 10m
        expr: (node_filesystem_size_bytes{device=~"/dev/.+"} - node_filesystem_free_bytes{device=~"/dev/.+"} )/ node_filesystem_size_bytes{device=~"/dev/.+"} * 100 &gt; 80
        labels:
          serverity: critical
        annotations:
          description: "{{ $labels.mountpoint }} 分区 over 80%,当前值:{{ $value }}"
          summary: "disk usage over 80%"

2，icmp 监控（主要用来判断target是否在线或者是有网络抖动）

[root@cn-hz-21yunwei-devops rule]# cat check_icmp_rule.yml 
groups:
  - name: icmp check
    rules:
      - alert: icmp_check failed
        for: 5s
        expr: probe_success{job="icmp_check"} == 0
        labels:
          serverity: critical
        annotations:
          description: "{{ $labels.group }}的{{ $labels.hostname }} icmp检测失败,当前probe_success的值为{ { $value }}"
          summary: "{{ $labels.group }}组的服务器 {{ $labels.hostname }} 服务器检测不通"

3,端口监控（判断某个端口socket是否通信，一般是对应某个业务守护进程，比如go进程，mysql，mongo，redis等等）

[root@cn-hz-21yunwei-devops rule]# cat  check_port_rule.yml 
groups:
  - name: tcp port check
    rules:
      - alert: tcp_port_check failed
        for: 5s
        expr: probe_success{job="tcp_port_check"} == 0
        labels:
          serverity: critical
        annotations:
          description: "{{ $labels.group }}的{{ $labels.app }} tcp检测失败,当前probe_success的值为{ { $value }}"
          summary: "{{ $labels.group }}组的应用 {{ $labels.app }} 端口检测不通"

4,url 监控
这个一般是判断某一个url是否可以访问，直接请求是否返回200或者301，302状态码来判断业务是否正常。

[root@cn-hz-21yunwei-devops rule]# cat  check_url_rule.yml 
groups:
  - name: httpd url check
    rules:
      - alert: http_url_check failed
        for: 5s
        expr: probe_success{job="http_url_check"} == 0
        labels:
          serverity: critical
        annotations:
          description: "{{ $labels.group }}的{{ $labels.app }} url检测失败,当前probe_success的值为{ { $value }}"
          summary: "{{ $labels.group }}组的应用 {{ $labels.app }} url接口检测不通"

转载请注明：21运维 » Prometheus 报警规则参考-笔记

与本文相关的文章