k8s监控篇-04 Prometheus告警处理

Prometheus告警介绍

默认报警规则文件为prometheus-rules.yaml，通过此规则发送到Alertmanager

secret名称为alertmanager-main为alertmanager

kubectl -n monitoring get secret alertmanager-main -o yaml

echo "$内容" | base64 -d 可查看具体配置，下面有一个模版介绍

https://prometheus.io/docs/alerting/configuration/#email_config

# global块配置下的配置选项在本配置文件内的所有配置项下可见
global:
  # 在Alertmanager内管理的每一条告警均有两种状态: "resolved"或者"firing". 在altermanager首次发送告警通知后, 该告警会一直处于firing状态,设置resolve_timeout可以指定处于firing状态的告警间隔多长时间会被设置为resolved状态, 在设置为resolved状态的告警后,altermanager不会再发送firing的告警通知.
  resolve_timeout: 1h

  # 邮件告警配置
  smtp_smarthost: 'smtp.exmail.qq.com:25'
  smtp_from: 'howell@xxx.com'
  smtp_auth_username: 'howell@xxx.com'
  smtp_auth_password: 'xxxx'
  # HipChat告警配置
  # hipchat_auth_token: '123456789'
  # hipchat_auth_url: 'https://hipchat.foobar.org/'
  # wechat
  wechat_api_url: 'https://qyapi.weixin.qq.com/cgi-bin/'
  wechat_api_secret: 'JJ'
  wechat_api_corp_id: 'ww'

  # 告警通知模板
templates:
- '/etc/alertmanager/config/*.tmpl'

# route: 根路由,该模块用于该根路由下的节点及子路由routes的定义. 子树节点如果不对相关配置进行配置，则默认会从父路由树继承该配置选项。每一条告警都要进入route，即要求配置选项group_by的值能够匹配到每一条告警的至少一个labelkey(即通过POST请求向altermanager服务接口所发送告警的labels项所携带的<labelname>)，告警进入到route后，将会根据子路由routes节点中的配置项match_re或者match来确定能进入该子路由节点的告警(由在match_re或者match下配置的labelkey: labelvalue是否为告警labels的子集决定，是的话则会进入该子路由节点，否则不能接收进入该子路由节点).
route:
  # 例如所有labelkey:labelvalue含cluster=A及altertname=LatencyHigh labelkey的告警都会被归入单一组中
  group_by: ['job', 'altername', 'cluster', 'service','severity']
  # 若一组新的告警产生，则会等group_wait后再发送通知，该功能主要用于当告警在很短时间内接连产生时，在group_wait内合并为单一的告警后再发送
  group_wait: 30s
  # 再次告警时间间隔
  group_interval: 5m
  # 如果一条告警通知已成功发送，且在间隔repeat_interval后，该告警仍然未被设置为resolved，则会再次发送该告警通知
  repeat_interval: 12h
  # 默认告警通知接收者，凡未被匹配进入各子路由节点的告警均被发送到此接收者
  receiver: 'wechat'
  # 上述route的配置会被传递给子路由节点，子路由节点进行重新配置才会被覆盖

  # 子路由树
  routes:
  # 该配置选项使用正则表达式来匹配告警的labels，以确定能否进入该子路由树
  # match_re和match均用于匹配labelkey为service,labelvalue分别为指定值的告警，被匹配到的告警会将通知发送到对应的receiver
  - match_re:
      service: ^(foo1|foo2|baz)$
    receiver: 'wechat'
    # 在带有service标签的告警同时有severity标签时，他可以有自己的子路由，同时具有severity != critical的告警则被发送给接收者team-ops-mails,对severity == critical的告警则被发送到对应的接收者即team-ops-pager
    routes:
    - match:
        severity: critical
      receiver: 'wechat'
  # 比如关于数据库服务的告警，如果子路由没有匹配到相应的owner标签，则都默认由team-DB-pager接收
  - match:
      service: database
    receiver: 'wechat'
  # 我们也可以先根据标签service:database将数据库服务告警过滤出来，然后进一步将所有同时带labelkey为database
  - match:
      severity: critical
    receiver: 'wechat'
# 抑制规则，当出现critical告警时 忽略warning
inhibit_rules:
- source_match:
    severity: 'critical'
  target_match:
    severity: 'warning'
  # Apply inhibition if the alertname is the same.
  #   equal: ['alertname', 'cluster', 'service']
  #
# 收件人配置
receivers:
- name: 'team-ops-mails'
  email_configs:
  - to: 'jiahao.li@xxx.com'
- name: 'wechat'
  wechat_configs:
  - send_resolved: true
    corp_id: 'ww'
    api_secret: 'JJ'
    to_tag: '1'
    agent_id: '1000002'
    api_url: 'https://qyapi.weixin.qq.com/cgi-bin/'
    message: '{{ template "wechat.default.message" . }}'
#- name: 'team-X-pager'
#  email_configs:
#  - to: 'team-X+alerts-critical@example.org'
#  pagerduty_configs:
#  - service_key: <team-X-key>
#
#- name: 'team-Y-mails'
#  email_configs:
#  - to: 'team-Y+alerts@example.org'
#
#- name: 'team-Y-pager'
#  pagerduty_configs:
#  - service_key: <team-Y-key>
#
#- name: 'team-DB-pager'
#  pagerduty_configs:
#  - service_key: <team-DB-key>
#  
#- name: 'team-X-hipchat'
#  hipchat_configs:
#  - auth_token: <auth_token>
#    room_id: 85
#    message_format: html
#    notify: true

prometheus-rules.yaml监控具体配置文件介绍：

groups:
- name: <string>
  rules:
  - alert: <string>
    expr: <string>
    for:  [ <duration> | default 0 ]
    labels:
      [ <lable_name>: <label_value> ]
    annotations:
      [ <lable_name>: <tmpl_string> ]

| 参数 | 描述 | | :-----: | :----: | |- name: <string>|警报规则组的名称| |- alert: <string>|警报规则的名称| |expr: <string|使用PromQL表达式完成的警报触发条件，用于计算是否有满足触发条件 ||for| 定义告警持续时间,在达到这个时间后,就会触发通知告警.防止一些误报的产生 |<lable_name>: <label_value>|自定义标签，允许自行定义标签附加在警报上，比如high warning| |annotations: <lable_name>: <tmpl_string> |用来设置有关警报的一组描述信息，其中包括自定义的标签，以及expr计算后的值。|

alertmanager通过邮件告警

创建alertmanager.yaml并创建相应secret

"global":
  "resolve_timeout": "2h"
  smtp_from: "ljh****@163.com"
  smtp_smarthost: "smtp.163.com:465"
  smtp_hello: "163.com"
  smtp_auth_username: "ljh****@163.com"
  smtp_auth_password: "QSFCJMAEHQDMHCAL"
  smtp_require_tls: false
"inhibit_rules":
- "equal":
  - "namespace"
  - "alertname"
  "source_match":
    "severity": "critical"
  "target_match_re":
    "severity": "warning|info"
- "equal":
  - "namespace"
  - "alertname"
  "source_match":
    "severity": "warning"
  "target_match_re":
    "severity": "info"
"receivers":
- "name": "Default"
  "email_configs":
  - to: "ljh****@163.com"
    send_resolved: true
- "name": "Watchdog"
  "email_configs":
  - to: "ljh****@163.com"
    send_resolved: true
- "name": "Critical"
  "email_configs":
  - to: "ljh****@163.com"
    send_resolved: true
"route":
  "group_by":
  - "namespace"
  "group_interval": "1m"
  "group_wait": "30s"
  "receiver": "Default"
  "repeat_interval": "1m"
  "routes":
  - "match":
      "alertname": "Watchdog
    "receiver": "Watchdog"
  - "match":
      "severity": "critical"
    "receiver": "Critical"

创建alertmanager.yaml的secret

kubectl create secret generic alertmanager-main --from-file=alertmanager.yaml -n monitoring

之后更改alertmanager.yaml可以使用热加载去更新k8s的secret

kubectl create secret generic alertmanager-main --from-file=alertmanager.yaml -n monitoring --dry-run -o yaml | kubectl replace -f -

查看日志，成功热更新：

kubectl logs -f alertmanager-main-0 -n monitoring alertmanager

查看邮件

通过企业微信告警

登录企业微信

添加子部门，记住部门ID

查看企业ID

alert-manager.yaml

global:
  resolve_timeout: 1m   # 每1分钟检测一次是否恢复
  wechat_api_url: 'https://qyapi.weixin.qq.com/cgi-bin/'    # 企业微信的api_url，无需修改
  wechat_api_corp_id: 'xxxxxxxxxx'      # 企业微信中企业ID
  wechat_api_secret: 'xxxxxxxxxxxxx'      # 企业微信中，Prometheus应用的Secret

templates:
  - '/etc/alertmanager/config/*.tmpl'       # Alertmanager微信告警模板

route:
  receiver: 'wechat'
  group_by: ['env','instance','type','group','job','alertname']
  group_wait: 10s       # 初次发送告警延时
  group_interval: 10s   # 距离第一次发送告警，等待多久再次发送告警
  repeat_interval: 5m   # 告警重发时间

receivers:
- name: 'wechat'
  wechat_configs: 
  - send_resolved: true
    message: '{{ template "wechat.default.message" . }}'
    to_party: '2'         # 企业微信中创建的接收告警的告警部门ID
    agent_id: '1000002'     # 企业微信中创建应用的AgentId
    api_secret: 'xxxxxxxxxxxxx'      # 企业微信中，Prometheus应用的Secret

alertmanager.yml中templates项以配置的就是此文件，复制内容即可使用

wechat.tmpl

{{ define "wechat.default.message" }}
{{- if gt (len .Alerts.Firing) 0 -}}
{{- range $index, $alert := .Alerts -}}
{{- if eq $index 0 }}
========= 监控报警 =========
告警状态：{{   .Status }}
告警级别：{{ .Labels.severity }}
告警类型：{{ $alert.Labels.alertname }}
故障主机: {{ $alert.Labels.instance }}
告警主题: {{ $alert.Annotations.summary }}
告警详情: {{ $alert.Annotations.message }}{{ $alert.Annotations.description}};
触发阀值：{{ .Annotations.value }}
故障时间: {{ ($alert.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
========= = end =  =========
{{- end }}
{{- end }}
{{- end }}
{{- if gt (len .Alerts.Resolved) 0 -}}
{{- range $index, $alert := .Alerts -}}
{{- if eq $index 0 }}
========= 异常恢复 =========
告警类型：{{ .Labels.alertname }}
告警状态：{{   .Status }}
告警主题: {{ $alert.Annotations.summary }}
告警详情: {{ $alert.Annotations.message }}{{ $alert.Annotations.description}};
故障时间: {{ ($alert.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
恢复时间: {{ ($alert.EndsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
{{- if gt (len $alert.Labels.instance) 0 }}
实例信息: {{ $alert.Labels.instance }}
{{- end }}
========= = end =  =========
{{- end }}
{{- end }}
{{- end }}
{{- end }}

kubectl create secret generic alertmanager-main --from-file=alertmanager.yaml --from-file=wechat.tmpl -n monitoring --dry-run -o yaml | kubectl replace -f -

需注意：在应用管理界面需要配置企业可信IP