Prometheus counters tell you **that** jobs fail — OpenTelemetry traces tell you **where**. When a Shopee order-export job times out, you need span-level visibility: lease acquired? profile start slow? CDP attach failed? webhook retry? This recipe instruments the queue worker and exports to Grafana Tempo or Jaeger.
Span hierarchy
mlx.job.run (root — job_id, client_id, platform) ├── mlx.lease.acquire (profile_id, worker_id) ├── mlx.profile.start (MLX API latency) ├── mlx.cdp.connect (Playwright connectOverCDP) ├── mlx.task.execute (task name, e.g. sync_orders) ├── mlx.profile.stop └── mlx.webhook.post (callback_url, status)
Python setup (OTLP exporter)
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.resources import Resource
resource = Resource.create({"service.name": "mlx-worker", "worker.id": WORKER_ID})
provider = TracerProvider(resource=resource)
provider.add_span_processor(BatchSpanProcessor(
OTLPSpanExporter(endpoint="tempo:4317", insecure=True)
))
trace.set_tracer_provider(provider)
tracer = trace.get_tracer("mlx.automation")
Instrument run_job
async def run_job(job: dict):
with tracer.start_as_current_span("mlx.job.run") as root:
root.set_attribute("mlx.job_id", job["job_id"])
root.set_attribute("mlx.profile_id", job["profile_id"])
root.set_attribute("mlx.platform", job.get("platform", ""))
root.set_attribute("mlx.client_id", job.get("client_id", ""))
with tracer.start_as_current_span("mlx.lease.acquire"):
if not try_lease(job["profile_id"], WORKER_ID):
raise RetryLater("profile leased")
with tracer.start_as_current_span("mlx.profile.start"):
port = await start_profile(job["profile_id"])
with tracer.start_as_current_span("mlx.cdp.connect"):
browser = await playwright.chromium.connect_over_cdp(f"http://127.0.0.1:{port}")
with tracer.start_as_current_span("mlx.task.execute"):
await execute_task(browser, job)
with tracer.start_as_current_span("mlx.profile.stop"):
await stop_profile(job["profile_id"])
Propagate trace context to webhook
Inject W3C traceparent header on webhook POST so your FastAPI receiver continues the trace — pairs with webhook receiver.
from opentelemetry.propagate import inject
headers = {"Content-Type": "application/json"}
inject(headers) # adds traceparent
httpx.post(callback_url, json=body, headers=headers)
Grafana Tempo correlation
In Grafana, link traces to Prometheus metrics via exemplars on mlx_job_duration_seconds. Dashboard: add Tempo data source → traceQL:
{ resource.service.name = "mlx-worker" && span.mlx.platform = "lazada_id" && status = error }
Complements Grafana health dashboard and observability metrics.
What to alert on (trace-derived)
| Pattern | Signal | Action |
|---|---|---|
mlx.profile.start p99 > 30s | MLX API slowness | Scale workers down, check plan cap |
mlx.cdp.connect errors spike | Orphan browsers | Profile debug |
| Lease span OK but task fails | Platform ban / UI change | Ban recovery |
| Webhook span timeout | Ops dashboard down | DLQ + retry |
Related
Disclosure: MLX-MMO affiliated with Multilogin.