The Grafana dashboard recipe defines PromQL alerts for ban signals, start failures, prod pool low, and DLQ depth. Alertmanager must route those to **actionable Slack channels** — not a generic #alerts firehose. This recipe wires Alertmanager receivers with severity-based routing and runbook deep links.

Alert → channel map

AlertSeveritySlack channelFirst action
MLXBanSignalProdcritical#mlx-ban-opsBan recovery runbook
MLXDLQDepthHighwarning#mlx-automationDLQ handler
MLXProdPoolLowwarning#mlx-capacityProfile pool
MLXStartFailureHighwarning#mlx-automationProfile debug

alertmanager.yml receivers

global:
  resolve_timeout: 5m

route:
  receiver: slack-default
  group_by: ['alertname', 'platform', 'client_id']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  routes:
    - match:
        severity: critical
        alertname: MLXBanSignalProd
      receiver: slack-ban-ops
      continue: false
    - match:
        severity: warning
        alertname: MLXDLQDepthHigh
      receiver: slack-automation
    - match:
        severity: warning
        alertname: MLXProdPoolLow
      receiver: slack-capacity

receivers:
  - name: slack-default
    slack_configs:
      - api_url: '${SLACK_WEBHOOK_DEFAULT}'
        channel: '#mlx-alerts'
        send_resolved: true
        title: '{{ .Status | toUpper }} — {{ .CommonLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'

  - name: slack-ban-ops
    slack_configs:
      - api_url: '${SLACK_WEBHOOK_BAN_OPS}'
        channel: '#mlx-ban-ops'
        send_resolved: true
        color: '{{ if eq .Status "firing" }}danger{{ else }}good{{ end }}'
        title: '🚨 BAN SIGNAL — {{ .CommonLabels.platform }}'
        text: |
          *Client:* {{ .CommonLabels.client_id }}
          *Profile tier:* prod
          *Runbook:* https://mlx-mmo.github.io/guides/multilogin-ban-recovery-runbook.html
          *Forensics:* clone before any proxy change
          {{ range .Alerts }}• {{ .Annotations.summary }}{{ end }}

  - name: slack-automation
    slack_configs:
      - api_url: '${SLACK_WEBHOOK_AUTOMATION}'
        channel: '#mlx-automation'
        send_resolved: true
        title: 'Automation — {{ .CommonLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'

  - name: slack-capacity
    slack_configs:
      - api_url: '${SLACK_WEBHOOK_CAPACITY}'
        channel: '#mlx-capacity'
        send_resolved: true
        title: 'Pool capacity — {{ .CommonLabels.alertname }}'
        text: 'Prod pool below threshold. Check warm tier promotion and burn tier drain.'

Slack incoming webhook setup

  1. Create Slack app → **Incoming Webhooks** → enable per workspace.
  2. One webhook URL per channel (#mlx-ban-ops, etc.) — do not reuse a single URL with channel override unless using legacy tokens.
  3. Store URLs in secrets manager — inject as env vars at Alertmanager deploy.
  4. Test with amtool alert add before enabling prod routes.

Ban alert enrichment (optional sidecar)

When MLXBanSignalProd fires, a small webhook sidecar can POST to your CMDB API to tag tier=burn and pause queue routing for that profile_id. Keep Alertmanager dumb — sidecar handles idempotent CMDB writes. Pair with webhook receiver HMAC pattern.

# Sidecar listens on /hooks/alertmanager
# Payload: Alertmanager webhook v4 JSON
# On ban alert: PATCH cmdb/profiles/{id} tier=burn, LPUSH mlx:dlq review job

Metrics dependency

Pool and DLQ alerts require Redis exporter sidecar metrics (mlx_pool_depth, mlx_dlq_depth). Ban signals come from worker instrumentation in the observability guide.

Related

Disclosure: MLX-MMO affiliated with Multilogin.