Slack is fine for DLQ warnings — **ban signals on prod profiles** need wake-up-grade routing. This recipe extends the Grafana alert rules through Alertmanager into **PagerDuty** with severity mapping, deduplication, and runbook links. Teams using Opsgenie follow the same pattern with Atlassian's Prometheus integration.
Severity map
| Alert | PagerDuty severity | Escalation |
|---|---|---|
MLXBanSignalProd | P1 — critical | Immediate page → ops lead → agency owner |
MLXStartFailureHigh | P2 — error | Page after 15m sustained |
MLXDLQDepthHigh | P3 — warning | Slack + ticket, no page |
MLXProdPoolLow | P3 — warning | Slack #mlx-capacity |
Keep P3 on Slack via the Slack recipe — PagerDuty only for P1/P2.
PagerDuty service setup
- Create service MLX Automation Ops with Events API v2 integration.
- Escalation policy: primary on-call (5 min) → secondary (10 min) → manager.
- Enable **alert grouping** by
client_id+platform— one incident per seller brand, not per metric tick. - Add runbook link custom field → ban recovery URL.
alertmanager.yml receiver
receivers:
- name: pagerduty-ban-ops
pagerduty_configs:
- routing_key: '${PAGERDUTY_ROUTING_KEY}'
severity: critical
class: ban-signal
component: multilogin-automation
group: '{{ .CommonLabels.client_id }}'
description: |
BAN SIGNAL — {{ .CommonLabels.platform }}
Profile tier: prod
Client: {{ .CommonLabels.client_id }}
details:
runbook: https://mlx-mmo.github.io/guides/multilogin-ban-recovery-runbook.html
forensics: https://mlx-mmo.github.io/guides/multilogin-profile-clone-forensics-recipe.html
platform: '{{ .CommonLabels.platform }}'
profile_id: '{{ .CommonLabels.profile_id }}'
client: MLX-MMO
client_url: https://mlx-mmo.github.io/guides/multilogin-grafana-health-dashboard-recipe.html
route:
routes:
- match:
severity: critical
alertname: MLXBanSignalProd
receiver: pagerduty-ban-ops
repeat_interval: 1h
Deduplication with Alertmanager
Use group_by: ['alertname', 'client_id', 'platform', 'profile_id'] so one ban does not spawn 20 incidents from correlated metrics (queue stall + ban signal + start failure). Set group_wait: 1m for ban group — fast enough for ops, slow enough to batch flapping.
Opsgenie equivalent
receivers:
- name: opsgenie-ban
opsgenie_configs:
- api_key: '${OPSGENIE_API_KEY}'
priority: P1
tags: multilogin,ban-signal,{{ .CommonLabels.platform }}
message: 'MLX Ban — {{ .CommonLabels.platform }} / {{ .CommonLabels.client_id }}'
description: '{{ .CommonAnnotations.description }}'
details:
runbook: https://mlx-mmo.github.io/guides/multilogin-ban-recovery-runbook.html
Post-incident workflow
- Acknowledge PagerDuty incident → pause queue routing for burned UUID
- Run ban recovery — CMDB audit, proxy check
- Clone profile for forensics before any proxy change
- Resolve incident only after
tier=burntagged and DLQ cleared
Related
Disclosure: MLX-MMO affiliated with Multilogin.