The profile pool manager defines health probes inline — this recipe packages them as a scheduled cron with concurrency limits, Prometheus counters, and ops alerts. Goal: catch Mimic OOM, stale proxies, and API flakiness before a seller payout week job fails mid-queue.

Probe contract

Cron runner (Python)

import asyncio
import time
from prometheus_client import Counter, Histogram, start_http_server

PROBE_OK = Counter("mlx_health_probe_total", "Health probes", ["tier", "status"])
PROBE_DURATION = Histogram("mlx_health_probe_seconds", "Probe duration", ["tier"])

async def probe_batch(profile_ids: list[str], tier: str, sem: asyncio.Semaphore):
    async def one(pid):
        async with sem:
            if r.exists(f"mlx:lease:{pid}"):
                return
            t0 = time.perf_counter()
            ok = await health_probe(pid)  # from profile pool recipe
            PROBE_DURATION.labels(tier=tier).observe(time.perf_counter() - t0)
            PROBE_OK.labels(tier=tier, status="ok" if ok else "fail").inc()
            handle_result(pid, tier, ok)
    await asyncio.gather(*(one(pid) for pid in profile_ids))

async def main():
    start_http_server(9102)
    sem = asyncio.Semaphore(3)  # match Multilogin concurrent cap
    for tier in ("prod", "warm"):
        ids = list(r.smembers(f"mlx:pool:{tier}"))
        await probe_batch(ids, tier, sem)

if __name__ == "__main__":
    asyncio.run(main())

Kubernetes CronJob

apiVersion: batch/v1
kind: CronJob
metadata:
  name: mlx-health-probe
spec:
  schedule: "*/15 * * * *"
  concurrencyPolicy: Forbid
  jobTemplate:
    spec:
      template:
        spec:
          restartPolicy: OnFailure
          containers:
            - name: probe
              image: your-registry/mlx-health-probe:latest
              env:
                - name: MLX_TOKEN
                  valueFrom:
                    secretKeyRef:
                      name: mlx-secrets
                      key: token
                - name: REDIS_URL
                  value: redis://mlx-redis:6379/0

Run every 15 minutes on prod/warm — not burn tier (forensics only). Stagger from queue worker peak hours.

Prometheus & Grafana

MetricAlert
mlx_health_probe_total{status="fail"}Spike > 10% of pool in 1h
mlx_pool_depth{tier="prod"}< 5 — pair with Redis sidecar
mlx_health_probe_seconds p99> 60s — MLX API degradation

Dashboard panels and health probe alerts: Grafana recipe (PromQL + Alertmanager for mlx_health_probe_*). Traces on slow probes: OpenTelemetry.

Notification on demotion

def handle_result(profile_id: str, tier: str, ok: bool):
    if ok:
        r.delete(f"mlx:health_fail:{profile_id}")
        return
    fails = r.incr(f"mlx:health_fail:{profile_id}")
    r.expire(f"mlx:health_fail:{profile_id}", 86400)
    if fails >= 3:
        sync_pool(profile_id, tier, "burned")
        post_slack("#mlx-capacity", f"Profile {profile_id} demoted after 3 probe fails")

Route to Alertmanager → Slack — not PagerDuty unless prod pool < minimum.

Related

Disclosure: MLX-MMO affiliated with Multilogin.