prometheus的报警规则案例参考[编辑器原因导致判断符号显示不正确]。
规则是通过文件方式来定义的,这些规则加载目录可以通过prometheus.yaml 配置文件进行定义,比如:
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'. rule_files: - "rule/*.yml" # - "first_rules.yml" # - "second_rules.yml"
这里我们是放置到了/usr/local/prometheus/server/rule/目录下,可以自定义多个报警规则文件,prometheus会通过global设置的参数进行周期扫描加载,不用重启prometheus守护进程。
这里我们记录下常规的rule规则。
prometheus报警规则(类似zabbix的triggr触发器)说明,参考如下rule规则进行说明:
groups: - name: tcp port check rules: - alert: tcp_port_check failed for: 5s expr: probe_success{job="tcp_port_check"} == 0 labels: serverity: critical annotations: description: "{{ $labels.group }}的{{ $labels.app }} tcp检测失败,当前probe_success的值为{ { $value }}" summary: "{{ $labels.group }}组的应用 {{ $labels.app }} 端口检测不通"
(1)报警规则名称name(这个可以通过promethues status-rule可以看到报警规则名称。
(2)rule具体规则:alert,报警title。插件报警会将这个进行title展示,什么业务一目了然。当然,这些报警规则下的所有alert都可以通过prometheus的alerts可以看到。
(3)rule具体规则:触发周期。多久进行一次数据拉取和规则匹配。
(4)rule具体规则:触发器表达式,报警的核心。 这里要设置针对哪个target的什么指标 设置什么值 进行触发。
(5)rule具体规则:报警级别。 类似zabbix的报警级别。
(6)注释,便于查看和后期维护。
1,linux基本资源指标监控,比如cpu、内存、网卡、磁盘等,也可以通过promsql 自己设定其他规则
[root@cn-hz-21yunwei-devops rule]# cat linux_rule.yml groups: - name: linux_alert rules: - alert: "linux load5 over 5" for: 5s expr: node_load5 > 5 labels: serverity: critical annotations: description: "{{ $labels.app }} over 5,当前值:{{ $value }}" summary: "linux load5 over 5" - alert: "node explorter have down" for: 5s expr: up==0 labels: serverity: critical annotations: description: "{{ $labels.app }} -- {{ $labels.instance }} ,当前值:{{ $value }}" summary: "node explorter value equle 0" - alert: "cpu used percent over 80% per 1 min" for: 5s expr: 100 * (1 - avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[1m]))) * on(instance) group_left(hostname) node_uname_info > 80 labels: serverity: critical annotations: description: "{{ $labels.app }} -- {{ $labels.instance }} ,当前值:{{ $value }}" summary: "cpu used percent over 80% per 1 min" - alert: "memory used percent over 85%" for: 5m expr: ((node_memory_MemTotal_bytes - node_memory_MemFree_bytes - node_memory_Buffers_bytes - node_memory_Cached_bytes) / (node_memory_MemTotal_bytes{instance!~"172..*"})) * 100 > 85 labels: serverity: critical annotations: description: "{{ $labels.app }} -- {{ $labels.instance }} ,当前值:{{ $value }}" summary: "memory used percent over 85%" - alert: "eth0 input traffic network over 10M" for: 3m expr: sum by(instance) (irate(node_network_receive_bytes_total{device="eth0",instance!~"172.1.*|172..*"}[1m]) / 128/1024) * on(instance) group_left(hostname) node_uname_info > 10 labels: serverity: critical annotations: description: "{{ $labels.app }} -- {{ $labels.instance }} ,当前值:{{ $value }}" summary: "eth0 input traffic network over 10M" - alert: "eth0 output traffic network over 10M" for: 3m expr: sum by(instance) (irate(node_network_transmit_bytes_total{device="eth0",instance!~"172.1.*|175.*"}[1m]) / 128/1024) * on(instance) group_left(hostname) node_uname_info > 10 labels: serverity: critical annotations: description: "{{ $labels.app }} -- {{ $labels.instance }} ,当前值:{{ $value }}" summary: "eth0 output traffic network over 10M" - alert: "disk usage over 80%" for: 10m expr: (node_filesystem_size_bytes{device=~"/dev/.+"} - node_filesystem_free_bytes{device=~"/dev/.+"} )/ node_filesystem_size_bytes{device=~"/dev/.+"} * 100 > 80 labels: serverity: critical annotations: description: "{{ $labels.mountpoint }} 分区 over 80%,当前值:{{ $value }}" summary: "disk usage over 80%"
2,icmp 监控 (主要用来判断target是否在线或者是有网络抖动)
[root@cn-hz-21yunwei-devops rule]# cat check_icmp_rule.yml groups: - name: icmp check rules: - alert: icmp_check failed for: 5s expr: probe_success{job="icmp_check"} == 0 labels: serverity: critical annotations: description: "{{ $labels.group }}的{{ $labels.hostname }} icmp检测失败,当前probe_success的值为{ { $value }}" summary: "{{ $labels.group }}组的服务器 {{ $labels.hostname }} 服务器检测不通"
3,端口监控(判断某个端口socket是否通信 ,一般是对应某个业务守护进程,比如go进程,mysql,mongo,redis等等)
[root@cn-hz-21yunwei-devops rule]# cat check_port_rule.yml groups: - name: tcp port check rules: - alert: tcp_port_check failed for: 5s expr: probe_success{job="tcp_port_check"} == 0 labels: serverity: critical annotations: description: "{{ $labels.group }}的{{ $labels.app }} tcp检测失败,当前probe_success的值为{ { $value }}" summary: "{{ $labels.group }}组的应用 {{ $labels.app }} 端口检测不通"
4,url 监控
这个一般是判断某一个url是否可以访问,直接请求是否返回200或者301,302状态码来判断业务是否正常。
[root@cn-hz-21yunwei-devops rule]# cat check_url_rule.yml groups: - name: httpd url check rules: - alert: http_url_check failed for: 5s expr: probe_success{job="http_url_check"} == 0 labels: serverity: critical annotations: description: "{{ $labels.group }}的{{ $labels.app }} url检测失败,当前probe_success的值为{ { $value }}" summary: "{{ $labels.group }}组的应用 {{ $labels.app }} url接口检测不通"
转载请注明:21运维 » Prometheus 报警规则参考-笔记