Teams at 100+ profiles cannot debug from Telegram screenshots. You need metrics per profile, proxy pool, and job type — tied to your queue worker and CDP layer. This guide defines the event schema and Prometheus counters that actually predict bans before finance notices.
Golden signals (Multilogin-specific)
| Metric | Type | Alert if |
|---|---|---|
mlx_profile_start_total{status} | Counter | error rate > 5% / 15m |
mlx_profile_start_duration_seconds | Histogram | p95 > 45s |
mlx_cdp_connect_fail_total | Counter | spike vs baseline |
mlx_cdp_reconnect_total | Counter | > 2 per job avg |
mlx_job_duration_seconds | Histogram | SLA breach |
mlx_ban_signal_total{platform} | Counter | any prod tier uptick |
mlx_proxy_preflight_fail_total | Counter | pool unhealthy |
cloud_phone_adb_offline_total | Counter | device farm degradation |
Structured log event (JSON)
{
"ts": "2026-06-17T10:22:01Z",
"event": "job_finished",
"job_id": "uuid",
"profile_id": "mlx-uuid",
"client_id": "acme",
"platform": "tiktok_shop",
"tier": "prod",
"proxy_pool": "vn-resi-01",
"start_ms": 8200,
"cdp_connect_ms": 1100,
"reconnect_count": 0,
"status": "ok",
"ban_signal": false
}
Python instrumentation snippet
from prometheus_client import Counter, Histogram, start_http_server
START = Counter("mlx_profile_start_total", "Profile starts", ["status"])
START_LAT = Histogram("mlx_profile_start_duration_seconds", "Start latency")
RECONNECT = Counter("mlx_cdp_reconnect_total", "CDP reconnects")
BAN = Counter("mlx_ban_signal_total", "Ban signals detected", ["platform"])
def record_start(ok: bool, seconds: float):
START.labels(status="ok" if ok else "error").inc()
if ok:
START_LAT.observe(seconds)
def detect_ban_signal(page) -> bool:
# platform-specific: captcha wall, logout redirect, appeal banner
url = page.url
if "suspend" in url or "captcha" in url:
BAN.labels(platform=PLATFORM).inc()
return True
return False
Expose :9100/metrics on workers; scrape with Prometheus. Distributed traces: OpenTelemetry recipe. Pair with queue: queue worker + webhook receiver.
Grafana dashboard panels
- Start success rate —
rate(mlx_profile_start_total{status="ok"}[5m]) - p95 start latency — by worker host
- Ban signals / day — split by
platformandclient_id - Reconnect rate — profiles needing CDP reconnect
- Proxy pool health — preflight fail counter
- Queue depth — Redis
LLEN mlx:jobs
Alert rules (examples)
# Prod ban signal any increase
increase(mlx_ban_signal_total{tier="prod"}[1h]) > 0
# Start failures
rate(mlx_profile_start_total{status="error"}[15m])
/ rate(mlx_profile_start_total[15m]) > 0.05
CMDB feedback loop
When ban_signal fires, auto-tag CMDB tier=burn, pause queue routing, notify ops Slack. Never auto-delete profiles — clone for forensics: clone & forensics recipe. Full dashboard: Grafana recipe. Custom Redis metrics: Redis exporter sidecar.
Related
Disclosure: MLX-MMO affiliated with Multilogin.