Multilogin Alertmanager PagerDuty Recipe

Slack is fine for DLQ warnings — **ban signals on prod profiles** need wake-up-grade routing. This recipe extends the Grafana alert rules through Alertmanager into **PagerDuty** with severity mapping, deduplication, and runbook links. Teams using Opsgenie follow the same pattern with Atlassian's Prometheus integration.

Severity map

Alert	PagerDuty severity	Escalation
`MLXBanSignalProd`	P1 — critical	Immediate page → ops lead → agency owner
`MLXStartFailureHigh`	P2 — error	Page after 15m sustained
`MLXDLQDepthHigh`	P3 — warning	Slack + ticket, no page
`MLXProdPoolLow`	P3 — warning	Slack #mlx-capacity

Keep P3 on Slack via the Slack recipe — PagerDuty only for P1/P2.

PagerDuty service setup

Create service MLX Automation Ops with Events API v2 integration.
Escalation policy: primary on-call (5 min) → secondary (10 min) → manager.
Enable **alert grouping** by client_id + platform — one incident per seller brand, not per metric tick.
Add runbook link custom field → ban recovery URL.

alertmanager.yml receiver

receivers:
  - name: pagerduty-ban-ops
    pagerduty_configs:
      - routing_key: '${PAGERDUTY_ROUTING_KEY}'
        severity: critical
        class: ban-signal
        component: multilogin-automation
        group: '{{ .CommonLabels.client_id }}'
        description: |
          BAN SIGNAL — {{ .CommonLabels.platform }}
          Profile tier: prod
          Client: {{ .CommonLabels.client_id }}
        details:
          runbook: https://mlx-mmo.github.io/guides/multilogin-ban-recovery-runbook.html
          forensics: https://mlx-mmo.github.io/guides/multilogin-profile-clone-forensics-recipe.html
          platform: '{{ .CommonLabels.platform }}'
          profile_id: '{{ .CommonLabels.profile_id }}'
        client: MLX-MMO
        client_url: https://mlx-mmo.github.io/guides/multilogin-grafana-health-dashboard-recipe.html

route:
  routes:
    - match:
        severity: critical
        alertname: MLXBanSignalProd
      receiver: pagerduty-ban-ops
      repeat_interval: 1h

Deduplication with Alertmanager

Use group_by: ['alertname', 'client_id', 'platform', 'profile_id'] so one ban does not spawn 20 incidents from correlated metrics (queue stall + ban signal + start failure). Set group_wait: 1m for ban group — fast enough for ops, slow enough to batch flapping.

Opsgenie equivalent

receivers:
  - name: opsgenie-ban
    opsgenie_configs:
      - api_key: '${OPSGENIE_API_KEY}'
        priority: P1
        tags: multilogin,ban-signal,{{ .CommonLabels.platform }}
        message: 'MLX Ban — {{ .CommonLabels.platform }} / {{ .CommonLabels.client_id }}'
        description: '{{ .CommonAnnotations.description }}'
        details:
          runbook: https://mlx-mmo.github.io/guides/multilogin-ban-recovery-runbook.html

Post-incident workflow

Acknowledge PagerDuty incident → pause queue routing for burned UUID
Run ban recovery — CMDB audit, proxy check
Clone profile for forensics before any proxy change
Resolve incident only after tier=burn tagged and DLQ cleared

Alertmanager → Slack Grafana dashboard Redis exporter sidecar Ban recovery runbook Observability metrics Code hub

Disclosure: MLX-MMO affiliated with Multilogin.

Alertmanager → PagerDuty