Multilogin Automation Observability

Teams at 100+ profiles cannot debug from Telegram screenshots. You need metrics per profile, proxy pool, and job type — tied to your queue worker and CDP layer. This guide defines the event schema and Prometheus counters that actually predict bans before finance notices.

Golden signals (Multilogin-specific)

Metric	Type	Alert if
`mlx_profile_start_total{status}`	Counter	error rate > 5% / 15m
`mlx_profile_start_duration_seconds`	Histogram	p95 > 45s
`mlx_cdp_connect_fail_total`	Counter	spike vs baseline
`mlx_cdp_reconnect_total`	Counter	> 2 per job avg
`mlx_job_duration_seconds`	Histogram	SLA breach
`mlx_ban_signal_total{platform}`	Counter	any prod tier uptick
`mlx_proxy_preflight_fail_total`	Counter	pool unhealthy
`cloud_phone_adb_offline_total`	Counter	device farm degradation

Structured log event (JSON)

{
  "ts": "2026-06-17T10:22:01Z",
  "event": "job_finished",
  "job_id": "uuid",
  "profile_id": "mlx-uuid",
  "client_id": "acme",
  "platform": "tiktok_shop",
  "tier": "prod",
  "proxy_pool": "vn-resi-01",
  "start_ms": 8200,
  "cdp_connect_ms": 1100,
  "reconnect_count": 0,
  "status": "ok",
  "ban_signal": false
}

Python instrumentation snippet

from prometheus_client import Counter, Histogram, start_http_server

START = Counter("mlx_profile_start_total", "Profile starts", ["status"])
START_LAT = Histogram("mlx_profile_start_duration_seconds", "Start latency")
RECONNECT = Counter("mlx_cdp_reconnect_total", "CDP reconnects")
BAN = Counter("mlx_ban_signal_total", "Ban signals detected", ["platform"])

def record_start(ok: bool, seconds: float):
    START.labels(status="ok" if ok else "error").inc()
    if ok:
        START_LAT.observe(seconds)

def detect_ban_signal(page) -> bool:
    # platform-specific: captcha wall, logout redirect, appeal banner
    url = page.url
    if "suspend" in url or "captcha" in url:
        BAN.labels(platform=PLATFORM).inc()
        return True
    return False

Expose :9100/metrics on workers; scrape with Prometheus. Distributed traces: OpenTelemetry recipe. Pair with queue: queue worker + webhook receiver.

Grafana dashboard panels

Start success rate — rate(mlx_profile_start_total{status="ok"}[5m])
p95 start latency — by worker host
Ban signals / day — split by platform and client_id
Reconnect rate — profiles needing CDP reconnect
Proxy pool health — preflight fail counter
Queue depth — Redis LLEN mlx:jobs

Alert rules (examples)

# Prod ban signal any increase
increase(mlx_ban_signal_total{tier="prod"}[1h]) > 0

# Start failures
rate(mlx_profile_start_total{status="error"}[15m])
  / rate(mlx_profile_start_total[15m]) > 0.05

CMDB feedback loop

When ban_signal fires, auto-tag CMDB tier=burn, pause queue routing, notify ops Slack. Never auto-delete profiles — clone for forensics: clone & forensics recipe. Full dashboard: Grafana recipe. Custom Redis metrics: Redis exporter sidecar.

OpenTelemetry traces Grafana dashboard Redis exporter sidecar Ban recovery runbook Webhook receiver Debug runbook Production recipe Pre-launch checklist Hybrid CMDB Code hub

Disclosure: MLX-MMO affiliated with Multilogin.

Multilogin Observability