The profile pool manager defines health probes inline — this recipe packages them as a scheduled cron with concurrency limits, Prometheus counters, and ops alerts. Goal: catch Mimic OOM, stale proxies, and API flakiness before a seller payout week job fails mid-queue.
Probe contract
- No platform login — start profile headless → CDP WebSocket ping → stop
- Skip leased profiles — if
mlx:lease:{id}exists, defer probe - 3 fails in 24h →
SMOVE prod warmorburnper policy - Success → reset
mlx:health_fail:{id}counter
Cron runner (Python)
import asyncio
import time
from prometheus_client import Counter, Histogram, start_http_server
PROBE_OK = Counter("mlx_health_probe_total", "Health probes", ["tier", "status"])
PROBE_DURATION = Histogram("mlx_health_probe_seconds", "Probe duration", ["tier"])
async def probe_batch(profile_ids: list[str], tier: str, sem: asyncio.Semaphore):
async def one(pid):
async with sem:
if r.exists(f"mlx:lease:{pid}"):
return
t0 = time.perf_counter()
ok = await health_probe(pid) # from profile pool recipe
PROBE_DURATION.labels(tier=tier).observe(time.perf_counter() - t0)
PROBE_OK.labels(tier=tier, status="ok" if ok else "fail").inc()
handle_result(pid, tier, ok)
await asyncio.gather(*(one(pid) for pid in profile_ids))
async def main():
start_http_server(9102)
sem = asyncio.Semaphore(3) # match Multilogin concurrent cap
for tier in ("prod", "warm"):
ids = list(r.smembers(f"mlx:pool:{tier}"))
await probe_batch(ids, tier, sem)
if __name__ == "__main__":
asyncio.run(main())
Kubernetes CronJob
apiVersion: batch/v1
kind: CronJob
metadata:
name: mlx-health-probe
spec:
schedule: "*/15 * * * *"
concurrencyPolicy: Forbid
jobTemplate:
spec:
template:
spec:
restartPolicy: OnFailure
containers:
- name: probe
image: your-registry/mlx-health-probe:latest
env:
- name: MLX_TOKEN
valueFrom:
secretKeyRef:
name: mlx-secrets
key: token
- name: REDIS_URL
value: redis://mlx-redis:6379/0
Run every 15 minutes on prod/warm — not burn tier (forensics only). Stagger from queue worker peak hours.
Prometheus & Grafana
| Metric | Alert |
|---|---|
mlx_health_probe_total{status="fail"} | Spike > 10% of pool in 1h |
mlx_pool_depth{tier="prod"} | < 5 — pair with Redis sidecar |
mlx_health_probe_seconds p99 | > 60s — MLX API degradation |
Dashboard panels and health probe alerts: Grafana recipe (PromQL + Alertmanager for mlx_health_probe_*). Traces on slow probes: OpenTelemetry.
Notification on demotion
def handle_result(profile_id: str, tier: str, ok: bool):
if ok:
r.delete(f"mlx:health_fail:{profile_id}")
return
fails = r.incr(f"mlx:health_fail:{profile_id}")
r.expire(f"mlx:health_fail:{profile_id}", 86400)
if fails >= 3:
sync_pool(profile_id, tier, "burned")
post_slack("#mlx-capacity", f"Profile {profile_id} demoted after 3 probe fails")
Route to Alertmanager → Slack — not PagerDuty unless prod pool < minimum.
Related
Disclosure: MLX-MMO affiliated with Multilogin.