Teams at 100+ profiles cannot debug from Telegram screenshots. You need metrics per profile, proxy pool, and job type — tied to your queue worker and CDP layer. This guide defines the event schema and Prometheus counters that actually predict bans before finance notices.

Golden signals (Multilogin-specific)

MetricTypeAlert if
mlx_profile_start_total{status}Countererror rate > 5% / 15m
mlx_profile_start_duration_secondsHistogramp95 > 45s
mlx_cdp_connect_fail_totalCounterspike vs baseline
mlx_cdp_reconnect_totalCounter> 2 per job avg
mlx_job_duration_secondsHistogramSLA breach
mlx_ban_signal_total{platform}Counterany prod tier uptick
mlx_proxy_preflight_fail_totalCounterpool unhealthy
cloud_phone_adb_offline_totalCounterdevice farm degradation

Structured log event (JSON)

{
  "ts": "2026-06-17T10:22:01Z",
  "event": "job_finished",
  "job_id": "uuid",
  "profile_id": "mlx-uuid",
  "client_id": "acme",
  "platform": "tiktok_shop",
  "tier": "prod",
  "proxy_pool": "vn-resi-01",
  "start_ms": 8200,
  "cdp_connect_ms": 1100,
  "reconnect_count": 0,
  "status": "ok",
  "ban_signal": false
}

Python instrumentation snippet

from prometheus_client import Counter, Histogram, start_http_server

START = Counter("mlx_profile_start_total", "Profile starts", ["status"])
START_LAT = Histogram("mlx_profile_start_duration_seconds", "Start latency")
RECONNECT = Counter("mlx_cdp_reconnect_total", "CDP reconnects")
BAN = Counter("mlx_ban_signal_total", "Ban signals detected", ["platform"])

def record_start(ok: bool, seconds: float):
    START.labels(status="ok" if ok else "error").inc()
    if ok:
        START_LAT.observe(seconds)

def detect_ban_signal(page) -> bool:
    # platform-specific: captcha wall, logout redirect, appeal banner
    url = page.url
    if "suspend" in url or "captcha" in url:
        BAN.labels(platform=PLATFORM).inc()
        return True
    return False

Expose :9100/metrics on workers; scrape with Prometheus. Distributed traces: OpenTelemetry recipe. Pair with queue: queue worker + webhook receiver.

Grafana dashboard panels

  1. Start success raterate(mlx_profile_start_total{status="ok"}[5m])
  2. p95 start latency — by worker host
  3. Ban signals / day — split by platform and client_id
  4. Reconnect rate — profiles needing CDP reconnect
  5. Proxy pool health — preflight fail counter
  6. Queue depth — Redis LLEN mlx:jobs

Alert rules (examples)

# Prod ban signal any increase
increase(mlx_ban_signal_total{tier="prod"}[1h]) > 0

# Start failures
rate(mlx_profile_start_total{status="error"}[15m])
  / rate(mlx_profile_start_total[15m]) > 0.05

CMDB feedback loop

When ban_signal fires, auto-tag CMDB tier=burn, pause queue routing, notify ops Slack. Never auto-delete profiles — clone for forensics: clone & forensics recipe. Full dashboard: Grafana recipe. Custom Redis metrics: Redis exporter sidecar.

Related

Disclosure: MLX-MMO affiliated with Multilogin.