Grafana Alert告警

grafana 4版本以上支持了alert功能,这使得利用grafana作为监控面板更为完整,因为只有alert功能才能称得上监控。

Grafana Alert告警

Grafana


Prometheus 原理介绍

01367

除了Prometheus的AlertManager可以发送报警,grafana同时也支持高级。grafana可以无缝定义告警在数据中的位置,可视化的定义阈值,并可以通过钉钉、email等平台获取告警通知。最重要的是可直观的定义告警规则,不断的评估并发送通知。grafana只有在4.0以上版本才有,同时在Grafana 5.3及以上版本支持发送提醒,可以指定如何经常提醒应用使用的秒、分、或者小时

官方文档https://grafana.com/docs/alerting/notifications/

邮件报警配置

如果要启动email告警,需要在启动grafana的时候配置/etc/grafana/grafana.ini开启smtp服务,我这里grafana是运行在k8s 环境中,这里直接使用configmap,如果是二进制或者docker环境中也可以配置修改grafana.ini

cat >> grafana_configmap.yaml << EOF
apiVersion: v1
kind: ConfigMap
metadata:
  name: grafana-config
  namespace: kube-system
data:
  grafana.ini: |
    [server]
    [smtp]
    enabled = true
    host = smtp.163.com:465
    user = xxx@163.com
    password = <邮箱授权密码>
    skip_verify = true
    from_address = <发件人邮箱>
    [alerting]
    enabled = true
    execute_alerts = true
EOF

#我这里使用163邮件进发送邮件

这里需要说明一下,邮箱密码不是登陆密码

https://www.lingkb.com/wp-content/uploads/2020/01/1578154157-0ca874f20c28971.png-348.7kB

可以参考Zabbix邮件报警的配置


Zabbix Web 邮件报警

48672

接下来我们创建grafana的configmap

[root@abcdocker grafana]# kubectl create -f grafana_config.yaml
configmap/grafana-config created

配置完configmap之后,我们还需要在grafana的deployment中挂载configmap

添加如下

        volumeMounts:
        - mountPath: "/etc/grafana"
          name: config1
          ....
      - name: config1
        configMap:
          name: grafana-config

如果前面grafana安装也是安装我的文档安装的,这里可以直接使用我的文档


Prometheus 持久化安装

01302

cat >> grafana_deployment.yaml <<EOF
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: grafana
  namespace: kube-system
  labels:
    app: grafana
spec:
  revisionHistoryLimit: 10
  template:
    metadata:
      labels:
        app: grafana
    spec:
      containers:
      - name: grafana
        image: grafana/grafana:5.3.4
        imagePullPolicy: IfNotPresent
        ports:
        - containerPort: 3000
          name: grafana
        env:
        - name: GF_SECURITY_ADMIN_USER
          value: admin
        - name: GF_SECURITY_ADMIN_PASSWORD
          value: abcdocker
        readinessProbe:
          failureThreshold: 10
          httpGet:
            path: /api/health
            port: 3000
            scheme: HTTP
          initialDelaySeconds: 60
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 30
        livenessProbe:
          failureThreshold: 3
          httpGet:
            path: /api/health
            port: 3000
            scheme: HTTP
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1
        resources:
          limits:
            cpu: 100m
            memory: 256Mi
          requests:
            cpu: 100m
            memory: 256Mi
        volumeMounts:
        - mountPath: /var/lib/grafana
          subPath: grafana
          name: storage
        - mountPath: "/etc/grafana"
          name: config1
      securityContext:
        fsGroup: 472
        runAsUser: 472
      volumes:
      - name: storage
        persistentVolumeClaim:
          claimName: grafana
      - name: config1
        configMap:
          name: grafana-config
EOF

刷新配置文件

[root@yzsjhl82-138 grafana]# kubectl apply -f  grafana_deployment.yaml
deployment.extensions/grafana unchanged

#这时候还需要重启pod

[root@abcdocker grafana]# kubectl get pod -n kube-system |grep grafana
grafana-849f76ddc7-mz4z4                1/1     Running     0          5h53m
grafana-chown-qsctd                     0/1     Completed   0          5d6h
[root@abcdocker grafana]# kubectl delete pod -n kube-system grafana-849f76ddc7-mz4z4
pod "grafana-849f76ddc7-mz4z4" deleted

#过滤出pod之后,进行删除就相当于重启了

https://www.lingkb.com/wp-content/uploads/2020/01/1578154157-1ec5e0ef05f541d.png-202.2kB

删除完成之后,我们需要测试一下smtp邮箱服务器是否正常

#先查看一下grafana service端口
[root@abcdockergrafana]# kubectl get svc -n kube-system |grep grafana
grafana                NodePort    10.98.192.213            3000:32452/TCP           5d6h
monitoring-grafana     ClusterIP   10.104.104.130           80/TCP                   34d

在任意节点访问ip+32452端口

https://www.lingkb.com/wp-content/uploads/2020/01/1578154157-f2dfcaa9e3cbaa4.png-128.1kB

Send on all alerts 发送所有消息

Send image 发送图像

https://www.lingkb.com/wp-content/uploads/2020/01/1578154158-5641a3d663310d0.png-82kB

这里我们点击Send Test 发送测试邮件

https://www.lingkb.com/wp-content/uploads/2020/01/1578154158-eb7c6d2b61eff17.png-271.5kB

多个邮箱账号可以使用分号分隔

钉钉报警配置

grafana报警不仅支持email报警,同时还支持钉钉报警

由于公司不使用钉钉,特意随便找了个群Test一下

点击群机器人

https://www.lingkb.com/wp-content/uploads/2020/01/1578154161-f8939e87be9ed6a.png-205.3kB

https://www.lingkb.com/wp-content/uploads/2020/01/1578154158-3ba1c258b42d697.png-59.2kB

选择自定义

https://www.lingkb.com/wp-content/uploads/2020/01/1578154158-0b2ab129e6c6538.png-131.9kB

点击添加

https://www.lingkb.com/wp-content/uploads/2020/01/1578154158-12054502e3ff901.png-97kB

设置名称

https://www.lingkb.com/wp-content/uploads/2020/01/1578154158-69ade2a34023bee.jpg-80.3kB

复制钉钉上面的Web hook

https://www.lingkb.com/wp-content/uploads/2020/01/1578154159-1296b036cd5ac71.png-48.2kB

这里的配置和email路径一样,只需要将Type修改为DingDing

https://www.lingkb.com/wp-content/uploads/2020/01/1578154159-d8292f48f8520b8.jpg-117.1kB

点击Test,我们在钉钉就可以收到test提示

https://www.lingkb.com/wp-content/uploads/2020/01/1578154159-9722081d16d5b24.png-33.7kB

报警测试

在邮箱报警和钉钉报警测试通过的时候,我们可以点击Save进行报错

这里稍微说一下,如果出现无法发送mail的问题,可以通过kubectl log 查看grafana的日志

https://www.lingkb.com/wp-content/uploads/2020/01/1578154159-316db25b2dd1fda.png-83.8kB

我这里已经添加完毕

https://www.lingkb.com/wp-content/uploads/2020/01/1578154159-5969f7972c69098.png-50.8kB

目前报警还不支持插件报警,只支持dashboard告警,我们这里可以使用之前监控k8s的模板配置进行告警


Prometheus监控Kubernetes 集群节点及应用

122138

https://www.lingkb.com/wp-content/uploads/2020/01/1578154159-09bcddeb8409810.png-300.1kB

点击Alert在点击创建Alert

https://www.lingkb.com/wp-content/uploads/2020/01/1578154160-980b8c56862ddef.png-25.9kB

alert配置如下

https://www.lingkb.com/wp-content/uploads/2020/01/1578154160-356736913b9322b.png-107.8kB

这里的阈值也可以手动拖拽

我这里设置为0.02

https://www.lingkb.com/wp-content/uploads/2020/01/1578154160-f03109257bdfb5b.png-233.8kB

点击Alert选择Notifications 这里的告警发送是我们之前保存的,Message是发送告警内容

https://www.lingkb.com/wp-content/uploads/2020/01/1578154160-90cd044e1b0fe17.png-53.8kB

https://www.lingkb.com/wp-content/uploads/2020/01/1578154161-e89d6d28bcf33cd.png-54.2kB

这里我们点击Test Rule进行测试,返回true代表alert正常

https://www.lingkb.com/wp-content/uploads/2020/01/1578154161-44f217ff5e183d0.png-180.6kB

然后我们就收到告警邮件

https://www.lingkb.com/wp-content/uploads/2020/01/1578154160-731fe09f30cc56e.png-84.8kB

钉钉报警和邮件是一样的,记得最后需要保存~

由于Grafana alert告警比较弱,大部分告警都是通过Prometheus AlertManager进行告警

欢迎评论。
lingkb » Grafana Alert告警