Slack is fine for DLQ warnings — **ban signals on prod profiles** need wake-up-grade routing. This recipe extends the Grafana alert rules through Alertmanager into **PagerDuty** with severity mapping, deduplication, and runbook links. Teams using Opsgenie follow the same pattern with Atlassian's Prometheus integration.

Severity map

AlertPagerDuty severityEscalation
MLXBanSignalProdP1 — criticalImmediate page → ops lead → agency owner
MLXStartFailureHighP2 — errorPage after 15m sustained
MLXDLQDepthHighP3 — warningSlack + ticket, no page
MLXProdPoolLowP3 — warningSlack #mlx-capacity

Keep P3 on Slack via the Slack recipe — PagerDuty only for P1/P2.

PagerDuty service setup

  1. Create service MLX Automation Ops with Events API v2 integration.
  2. Escalation policy: primary on-call (5 min) → secondary (10 min) → manager.
  3. Enable **alert grouping** by client_id + platform — one incident per seller brand, not per metric tick.
  4. Add runbook link custom field → ban recovery URL.

alertmanager.yml receiver

receivers:
  - name: pagerduty-ban-ops
    pagerduty_configs:
      - routing_key: '${PAGERDUTY_ROUTING_KEY}'
        severity: critical
        class: ban-signal
        component: multilogin-automation
        group: '{{ .CommonLabels.client_id }}'
        description: |
          BAN SIGNAL — {{ .CommonLabels.platform }}
          Profile tier: prod
          Client: {{ .CommonLabels.client_id }}
        details:
          runbook: https://mlx-mmo.github.io/guides/multilogin-ban-recovery-runbook.html
          forensics: https://mlx-mmo.github.io/guides/multilogin-profile-clone-forensics-recipe.html
          platform: '{{ .CommonLabels.platform }}'
          profile_id: '{{ .CommonLabels.profile_id }}'
        client: MLX-MMO
        client_url: https://mlx-mmo.github.io/guides/multilogin-grafana-health-dashboard-recipe.html

route:
  routes:
    - match:
        severity: critical
        alertname: MLXBanSignalProd
      receiver: pagerduty-ban-ops
      repeat_interval: 1h

Deduplication with Alertmanager

Use group_by: ['alertname', 'client_id', 'platform', 'profile_id'] so one ban does not spawn 20 incidents from correlated metrics (queue stall + ban signal + start failure). Set group_wait: 1m for ban group — fast enough for ops, slow enough to batch flapping.

Opsgenie equivalent

receivers:
  - name: opsgenie-ban
    opsgenie_configs:
      - api_key: '${OPSGENIE_API_KEY}'
        priority: P1
        tags: multilogin,ban-signal,{{ .CommonLabels.platform }}
        message: 'MLX Ban — {{ .CommonLabels.platform }} / {{ .CommonLabels.client_id }}'
        description: '{{ .CommonAnnotations.description }}'
        details:
          runbook: https://mlx-mmo.github.io/guides/multilogin-ban-recovery-runbook.html

Post-incident workflow

  1. Acknowledge PagerDuty incident → pause queue routing for burned UUID
  2. Run ban recovery — CMDB audit, proxy check
  3. Clone profile for forensics before any proxy change
  4. Resolve incident only after tier=burn tagged and DLQ cleared

Related

Disclosure: MLX-MMO affiliated with Multilogin.