Lab Timer
Solution hint unlocks in 45:00
What You’ll Do
Start InsureWatch on the lab/1-propagation branch. Submit a claim. Notice the trace is broken — spans from claims-service appear as disconnected root spans instead of children of the gateway span. Diagnose the propagator mismatch, understand why it silently breaks distributed tracing, and fix it.
Branch: lab/1-propagation
Primary file: claims-service/src/instrumentation.py
Setup
cd insurewatch
git checkout lab/1-propagation
docker compose up --build
Wait for all services to start (watch for claims-service | INFO: Application startup complete).
The Symptom
Submit a claim through the UI at http://localhost:5173, or via curl:
curl -s -X POST http://localhost:3000/api/claims \
-H "Content-Type: application/json" \
-d '{
"customer_id": "CUST001",
"policy_number": "POL-001",
"claim_type": "medical",
"amount": 500,
"description": "Lab 1 test",
"incident_date": "2026-03-01"
}'
Open Grafana at http://localhost:3100 → Explore → Tempo.
Run a Search query: service name = api-gateway, time range = last 5 minutes. Click a trace.
What you should see on a working system:
api-gateway: POST /api/claims [============================================]
claims-service: submit_claim [ =============================== ]
claims-service: pymongo.insert_one [ ==== ]
policy-service: GET /policy/.../coverage [ ======= ]
notification-service: POST /notify [ ==== ]
What you see on this branch:
api-gateway: POST /api/claims [============================================]
Then, in a completely separate trace with a different trace ID:
claims-service: submit_claim [===============================]
The gateway span has no children from claims-service. The claims spans exist but they’re orphaned — root spans with their own trace ID, completely disconnected from the request that triggered them.
Diagnosis
Step 1: Confirm the trace IDs don’t match
In Tempo, search service = claims-service. Note a trace ID. Then search service = api-gateway. The trace IDs don’t overlap. Two separate traces — one request.
Step 2: Understand what connects traces
Trace context travels between services via HTTP headers. When api-gateway makes an HTTP call to claims-service, the OTel SDK automatically injects the current span context into the outgoing request headers. When claims-service receives the request, the SDK extracts that context and uses it as the parent for new spans.
This inject/extract cycle requires both services to agree on the propagation format — the specific header name and value encoding.
Two common formats:
| Format | Headers injected |
|---|---|
| W3C TraceContext (default) | traceparent: 00-<traceId>-<spanId>-<flags> |
| B3 (Zipkin/Jaeger legacy) | X-B3-TraceId, X-B3-SpanId, X-B3-Sampled |
If the sender injects W3C headers but the receiver only recognizes B3, the receiver finds no headers it understands and starts a new root span — silently.
Step 3: Find the mismatch in the code
Open claims-service/src/instrumentation.py. The bug is near the top:
from opentelemetry.propagators.b3 import B3MultiFormat
from opentelemetry.propagate import set_global_textmap
# ...
set_global_textmap(B3MultiFormat())
This replaces the SDK’s default W3C propagator with B3. The api-gateway (Node.js) uses the default W3C propagator and injects a traceparent header. When claims-service receives the request, it looks for X-B3-TraceId — which isn’t there. It starts a fresh root span.
Step 4: Confirm with logs
docker compose logs api-gateway 2>&1 | grep "claims" | head -3
docker compose logs claims-service 2>&1 | grep "traceId" | head -3
On the broken branch, the trace IDs in both outputs won’t match. After the fix, they will.
The Fix
Remove the three B3 lines from claims-service/src/instrumentation.py:
# DELETE these two imports:
from opentelemetry.propagators.b3 import B3MultiFormat
from opentelemetry.propagate import set_global_textmap
# DELETE this line:
set_global_textmap(B3MultiFormat())
Also remove opentelemetry-propagator-b3==1.24.0 from claims-service/requirements.txt.
The OTel SDK defaults to W3C TraceContext + Baggage when no propagator is explicitly set. Not calling set_global_textmap() is correct.
Verification {#solution}
Rebuild claims-service:
docker compose up --build claims-service
Submit another claim, then check Tempo. The trace from api-gateway should now show a complete waterfall: gateway → claims → policy and notification — all sharing one trace ID.
You can also verify via logs:
# Get the trace ID from the gateway log for a recent request
docker compose logs api-gateway 2>&1 | grep "POST /api/claims" | tail -3
# Then find that trace ID in claims logs
docker compose logs claims-service 2>&1 | grep "<that-trace-id>"
Both should reference the same trace ID.
What You Learned
Propagation is a bilateral contract. Both sides of every service boundary must use the same format. W3C and B3 carry equivalent information — trace ID, span ID, sampling decision — but as different bytes. A format mismatch is silent: no errors, no exceptions, just orphaned spans.
The default is always W3C TraceContext. Every OTel SDK defaults to W3C. The only reason to call set_global_textmap() is when integrating with a legacy system (Zipkin, Jaeger B3) that predates the W3C standard.
This failure mode is common in practice. You’ll encounter it when:
- Migrating from Zipkin or Jaeger to OTel
- Adding a new service to a partially-migrated stack
- Using a third-party middleware or proxy that sets its own propagator
- Copy-pasting instrumentation code from a legacy codebase
The diagnostic checklist:
- Do spans from the same user request share a trace ID? (Check Tempo or logs)
- Are downstream spans appearing as root spans (no
parentSpanId)? - What headers is the upstream service injecting? (
traceparent= W3C,X-B3-*= B3) - Does the downstream service have a
set_global_textmap()call with a non-default propagator?
Bonus Challenge
Add debug logging to the OTel SDK to see exactly what headers are being sent and received. In claims-service/src/instrumentation.py, add:
import logging
logging.getLogger("opentelemetry").setLevel(logging.DEBUG)
Rebuild and watch the Docker logs. You’ll see the SDK reporting “no valid traceparent header found” on the broken branch — the exact moment propagation fails.