Prometheus counters tell you **that** jobs fail — OpenTelemetry traces tell you **where**. When a Shopee order-export job times out, you need span-level visibility: lease acquired? profile start slow? CDP attach failed? webhook retry? This recipe instruments the queue worker and exports to Grafana Tempo or Jaeger.

Span hierarchy

mlx.job.run                    (root — job_id, client_id, platform)
├── mlx.lease.acquire          (profile_id, worker_id)
├── mlx.profile.start          (MLX API latency)
├── mlx.cdp.connect            (Playwright connectOverCDP)
├── mlx.task.execute           (task name, e.g. sync_orders)
├── mlx.profile.stop
└── mlx.webhook.post           (callback_url, status)

Python setup (OTLP exporter)

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.resources import Resource

resource = Resource.create({"service.name": "mlx-worker", "worker.id": WORKER_ID})
provider = TracerProvider(resource=resource)
provider.add_span_processor(BatchSpanProcessor(
    OTLPSpanExporter(endpoint="tempo:4317", insecure=True)
))
trace.set_tracer_provider(provider)
tracer = trace.get_tracer("mlx.automation")

Instrument run_job

async def run_job(job: dict):
    with tracer.start_as_current_span("mlx.job.run") as root:
        root.set_attribute("mlx.job_id", job["job_id"])
        root.set_attribute("mlx.profile_id", job["profile_id"])
        root.set_attribute("mlx.platform", job.get("platform", ""))
        root.set_attribute("mlx.client_id", job.get("client_id", ""))

        with tracer.start_as_current_span("mlx.lease.acquire"):
            if not try_lease(job["profile_id"], WORKER_ID):
                raise RetryLater("profile leased")

        with tracer.start_as_current_span("mlx.profile.start"):
            port = await start_profile(job["profile_id"])

        with tracer.start_as_current_span("mlx.cdp.connect"):
            browser = await playwright.chromium.connect_over_cdp(f"http://127.0.0.1:{port}")

        with tracer.start_as_current_span("mlx.task.execute"):
            await execute_task(browser, job)

        with tracer.start_as_current_span("mlx.profile.stop"):
            await stop_profile(job["profile_id"])

Propagate trace context to webhook

Inject W3C traceparent header on webhook POST so your FastAPI receiver continues the trace — pairs with webhook receiver.

from opentelemetry.propagate import inject

headers = {"Content-Type": "application/json"}
inject(headers)  # adds traceparent
httpx.post(callback_url, json=body, headers=headers)

Grafana Tempo correlation

In Grafana, link traces to Prometheus metrics via exemplars on mlx_job_duration_seconds. Dashboard: add Tempo data source → traceQL:

{ resource.service.name = "mlx-worker" && span.mlx.platform = "lazada_id" && status = error }

Complements Grafana health dashboard and observability metrics.

What to alert on (trace-derived)

PatternSignalAction
mlx.profile.start p99 > 30sMLX API slownessScale workers down, check plan cap
mlx.cdp.connect errors spikeOrphan browsersProfile debug
Lease span OK but task failsPlatform ban / UI changeBan recovery
Webhook span timeoutOps dashboard downDLQ + retry

Related

Disclosure: MLX-MMO affiliated with Multilogin.