CoinexChain节点监测告警应用

CoinexChain节点监测告警

分享一个coinex chain的监测节点告警小工具
目前可以监测节点miss block个数、节点落块、节点未参与共识三种情形:
当触发配置条件时,通过slack告警;详见仓库文档

日后再完善添加更多监测指标和邮件告警,欢迎更多的小伙伴参观维护


prometheus

同时coinex chain本身还支持 prometheus

  • Set prometheus = true in config.toml

    • default file location: ~/.cetd/config/config.toml
  • 启动节点,Metrics will be provided at: http://localhost:26660/metrics

  • 安装prometheusalertmanager

  • 配置prometheus.yml,开启alertmanagers targets、设置rule_files 路径、配置job

    # my global config
    global:
      scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
      evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
      # scrape_timeout is set to the global default (10s).
    
    # Alertmanager configuration
    alerting:
      alertmanagers:
      - static_configs:
        - targets: ['localhost:9093']
    
    # Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
    rule_files:
      - /Users/app/prometheus-2.13.1.darwin-amd64/etc/rules/*.rules
      # - "second_rules.yml"
    
    # A scrape configuration containing exactly one endpoint to scrape:
    # Here it's Prometheus itself.
    scrape_configs:
      # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
      - job_name: 'coinexchain'
    
        # metrics_path defaults to '/metrics'
        # scheme defaults to 'http'.
    
        static_configs:
        - targets: ['localhost:26660']
    
  • 配置rule_files,配置节点宕机后即告警

    groups:
    - name: hostStatsAlert
      rules:
      - alert: InstanceStatus
        expr: up {job="coinexchain"} == 0
        for: 15s
        labels:
          instance: ""
        annotations:
          summary: "节点运行状态"
          description: 节点已宕机"
          link: "http://localhost:9090"
          color: "#0000FF"
          username: "@fanzc912"
    
  • 配置alertmanager.yml,配置slack url 和slack告警通知

    global:
      resolve_timeout: 5m
    
    route:
      group_by: ['alertname']
      group_wait: 10s
      group_interval: 10s
      repeat_interval: 1h
      receiver: 'slack'
    receivers:
      - name: 'slack'
        slack_configs:
          - api_url: "https://hooks.slack.com/services/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"
            channel: "#test"
            text: "{{ range .Alerts }} {{ .Annotations.description}}\n {{end}} {{ .CommonAnnotations.username}} <{{.CommonAnnotations.link}}| click here>"
            title: "{{.CommonAnnotations.summary}}"
            title_link: "{{.CommonAnnotations.link}}"
            color: "{{.CommonAnnotations.color}}"
    inhibit_rules:
      - source_match:
          severity: 'critical'
        target_match:
          severity: 'warning'
        equal: ['alertname', 'dev', 'instance']
    
  • 启动prometheusalertmanager,节点宕机后可在slack收到通知

    [AlertManager] [8:14 AM]

    [节点运行状态]

    节点已宕机"
    @fanzc912 click here

alertmanager还支持Email和企业微信等其他告警接收方案,并且coinex chain有提供 tendermint_mempool_sizetendermint_state_block_processing_timetendermint_consensus_num_txs 等许多metrics,大家可以按需自行配置