Skip to content

Prod / dev divergence monitoring #315

@bodymindarts

Description

@bodymindarts

The settings in the default monitoring/values.yml are currently somewhat out of date.
Many values are being overridden in production.

Ideally we would like to:

  • have less prod overrides (ie. bring as many of the prod settings as possible into the defaults)
  • have a setup that verifiably works in the dev setup - in particular the values coming from the blackbox exporter (that monitor the main backend) should work locally.

Here are the current production overrides - note that some values are injected via terraform templates (eg ${graphql_playground_url}) - don't know how best to set defaults for that. At least for dev setup we should probably hard code the values.

prometheus:
  extraScrapeConfigs: |
    - job_name: 'prometheus-blackbox-exporter-noauth'
      metrics_path: /probe
      params:
        module: [buildParameters]
      static_configs:
        - targets:
          - ${graphql_playground_url}
      relabel_configs:
        - source_labels: [__address__]
          target_label: __param_target
        - source_labels: [__param_target]
          target_label: instance
        - target_label: __address__
          replacement: monitoring-prometheus-blackbox-exporter:9115
        - source_labels: [__meta_kubernetes_namespace]
          target_label: namespace
    - job_name: 'prometheus-blackbox-exporter-auth'
      scrape_timeout: 30s
      metrics_path: /probe
      params:
        module: [walletAuth]
      static_configs:
        - targets:
          - ${graphql_playground_url}
      relabel_configs:
        - source_labels: [__address__]
          target_label: __param_target
        - source_labels: [__param_target]
          target_label: instance
        - target_label: __address__
          replacement: monitoring-prometheus-blackbox-exporter:9115

  alerts:
    groups:
    - name: Ingress Controller
      rules:
      - alert: NGINXTooMany500s
        expr: 100 * ( sum( nginx_ingress_controller_requests{status=~"5.+"} ) / sum(nginx_ingress_controller_requests) ) > 5
        for: 1m
        labels:
          severity: critical
        annotations:
          description: Too many 5XXs
          summary: More than 5% of all requests returned 5XX
      - alert: NGINXTooMany400s
        expr: 100 * ( sum( nginx_ingress_controller_requests{status=~"4.+"} ) / sum(nginx_ingress_controller_requests) ) > 5
        for: 1m
        labels:
          severity: critical
        annotations:
          description: Too many 4XXs
          summary: More than 5% of all requests returned 4XX
    - name: ${instance_name}
      rules:
      - alert: PodRestart
        expr: increase(kube_pod_container_status_restarts_total{namespace=~'${galoy_namespace}|${bitcoin_namespace}'}[10m]) >= 2
        labels:
          severity: critical
        annotations:
          summary: "{{$labels.container}} restarted too many times"
      - alert: PodStartupError
        for: 1m
        expr: kube_pod_container_status_waiting_reason{reason!="ContainerCreating",namespace=~'${galoy_namespace}|${bitcoin_namespace}'} == 1
        labels:
          severity: critical
        annotations:
          summary: "{{$labels.container}} is unable to start"
      - alert: GraphqlIssue
        for: 3m
        expr: probe_success{job="prometheus-blackbox-exporter-mainnet"} == 0
        labels:
          severity: critical
        annotations:
          summary: "Graphql is down"
      - alert: GraphqlNoAuthIssue
        for: 3m
        expr: probe_success{namespace=~'${galoy_namespace}', job="prometheus-blackbox-exporter-noauth"} == 0
        labels:
          severity: critical
        annotations:
          summary: "Graphql is down"
  
  alertmanagerFiles:
    alertmanager.yml:
      global:
        pagerduty_url: 'https://events.pagerduty.com/v2/enqueue'
      route:
        group_wait: 10s
        group_interval: 10m
        receiver: slack
        repeat_interval: 6h
        routes:
        - receiver: slack-pagerduty
          matchers:
            - severity="critical"
          group_interval: 2m

prometheus-blackbox-exporter:
  secretConfig: true
  config:
    modules:
      buildParameters:
        prober: http
        timeout: 3s
        http:
          method: POST
          headers:
            Content-Type: application/json
          body: '{"query":"query buildParameters { buildParameters { id commitHash buildTime helmRevision minBuildNumberAndroid minBuildNumberIos lastBuildNumberAndroid lastBuildNumberIos }}","variables":{}}'
      walletAuth:
        prober: http
        timeout: 30s
        http:
          method: POST
          fail_if_body_matches_regexp:
            - "errors+"
          headers:
            Content-Type: application/json
          body: '{"query":"query gql_query_logged { prices { __typename id o } earnList { __typename id value completed } wallet { __typename id balance currency transactions { __typename id amount description created_at hash type usd fee feeUsd pending } } getLastOnChainAddress { __typename id } me { __typename id level username phone } maps { __typename id title coordinate { __typename latitude longitude } } nodeStats { __typename id } }","variables":{}}'

Another file containing sensitive information is also merged in:

prometheus:
  alertmanagerFiles:
    alertmanager.yml:
      global:
        slack_api_url: ${slack_api_url}
      receivers:
        - name: slack
          slack_configs:
          - channel: '#${slack_alerts_channel_name}'
            title: "{{range .Alerts}}{{.Annotations.summary}}\n{{end}}"
            send_resolved: true
        - name: slack-pagerduty
          pagerduty_configs:
          - service_key: ${pagerduty_service_key}
            send_resolved: true
          slack_configs:
          - channel: '#${slack_alerts_channel_name}'
            title: "{{range .Alerts}}{{.Annotations.summary}}\n{{end}}"
            send_resolved: true

prometheus-blackbox-exporter:
  config:
    modules:
      walletAuth:
        http:
          headers:
            Authorization: Bearer ${probe_auth_token}

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions