Skip to content
Module 2 45 min read

Lab 1: Find the Broken Trace

InsureWatch is running but traces are fragmented. Context propagation is misconfigured in claims-service. Find the mismatch, understand why it breaks the trace, and fix it.

Lab Timer

Solution hint unlocks in 45:00

What You’ll Do

Start InsureWatch on the lab/1-propagation branch. Submit a claim. Notice the trace is broken — spans from claims-service appear as disconnected root spans instead of children of the gateway span. Diagnose the propagator mismatch, understand why it silently breaks distributed tracing, and fix it.

Branch: lab/1-propagation Primary file: claims-service/src/instrumentation.py


Setup

cd insurewatch
git checkout lab/1-propagation
docker compose up --build

Wait for all services to start (watch for claims-service | INFO: Application startup complete).


The Symptom

Submit a claim through the UI at http://localhost:5173, or via curl:

curl -s -X POST http://localhost:3000/api/claims \
  -H "Content-Type: application/json" \
  -d '{
    "customer_id": "CUST001",
    "policy_number": "POL-001",
    "claim_type": "medical",
    "amount": 500,
    "description": "Lab 1 test",
    "incident_date": "2026-03-01"
  }'

Open Grafana at http://localhost:3100ExploreTempo.

Run a Search query: service name = api-gateway, time range = last 5 minutes. Click a trace.

What you should see on a working system:

api-gateway: POST /api/claims                [============================================]
  claims-service: submit_claim               [       ===============================      ]
    claims-service: pymongo.insert_one       [                      ====                  ]
    policy-service: GET /policy/.../coverage [             =======                        ]
    notification-service: POST /notify       [                           ====             ]

What you see on this branch:

api-gateway: POST /api/claims  [============================================]

Then, in a completely separate trace with a different trace ID:

claims-service: submit_claim   [===============================]

The gateway span has no children from claims-service. The claims spans exist but they’re orphaned — root spans with their own trace ID, completely disconnected from the request that triggered them.


Diagnosis

Step 1: Confirm the trace IDs don’t match

In Tempo, search service = claims-service. Note a trace ID. Then search service = api-gateway. The trace IDs don’t overlap. Two separate traces — one request.

Step 2: Understand what connects traces

Trace context travels between services via HTTP headers. When api-gateway makes an HTTP call to claims-service, the OTel SDK automatically injects the current span context into the outgoing request headers. When claims-service receives the request, the SDK extracts that context and uses it as the parent for new spans.

This inject/extract cycle requires both services to agree on the propagation format — the specific header name and value encoding.

Two common formats:

FormatHeaders injected
W3C TraceContext (default)traceparent: 00-<traceId>-<spanId>-<flags>
B3 (Zipkin/Jaeger legacy)X-B3-TraceId, X-B3-SpanId, X-B3-Sampled

If the sender injects W3C headers but the receiver only recognizes B3, the receiver finds no headers it understands and starts a new root span — silently.

Step 3: Find the mismatch in the code

Open claims-service/src/instrumentation.py. The bug is near the top:

from opentelemetry.propagators.b3 import B3MultiFormat
from opentelemetry.propagate import set_global_textmap

# ...

set_global_textmap(B3MultiFormat())

This replaces the SDK’s default W3C propagator with B3. The api-gateway (Node.js) uses the default W3C propagator and injects a traceparent header. When claims-service receives the request, it looks for X-B3-TraceId — which isn’t there. It starts a fresh root span.

Step 4: Confirm with logs

docker compose logs api-gateway 2>&1 | grep "claims" | head -3
docker compose logs claims-service 2>&1 | grep "traceId" | head -3

On the broken branch, the trace IDs in both outputs won’t match. After the fix, they will.


The Fix

Remove the three B3 lines from claims-service/src/instrumentation.py:

# DELETE these two imports:
from opentelemetry.propagators.b3 import B3MultiFormat
from opentelemetry.propagate import set_global_textmap

# DELETE this line:
set_global_textmap(B3MultiFormat())

Also remove opentelemetry-propagator-b3==1.24.0 from claims-service/requirements.txt.

The OTel SDK defaults to W3C TraceContext + Baggage when no propagator is explicitly set. Not calling set_global_textmap() is correct.


Verification {#solution}

Rebuild claims-service:

docker compose up --build claims-service

Submit another claim, then check Tempo. The trace from api-gateway should now show a complete waterfall: gateway → claims → policy and notification — all sharing one trace ID.

You can also verify via logs:

# Get the trace ID from the gateway log for a recent request
docker compose logs api-gateway 2>&1 | grep "POST /api/claims" | tail -3

# Then find that trace ID in claims logs
docker compose logs claims-service 2>&1 | grep "<that-trace-id>"

Both should reference the same trace ID.


What You Learned

Propagation is a bilateral contract. Both sides of every service boundary must use the same format. W3C and B3 carry equivalent information — trace ID, span ID, sampling decision — but as different bytes. A format mismatch is silent: no errors, no exceptions, just orphaned spans.

The default is always W3C TraceContext. Every OTel SDK defaults to W3C. The only reason to call set_global_textmap() is when integrating with a legacy system (Zipkin, Jaeger B3) that predates the W3C standard.

This failure mode is common in practice. You’ll encounter it when:

  • Migrating from Zipkin or Jaeger to OTel
  • Adding a new service to a partially-migrated stack
  • Using a third-party middleware or proxy that sets its own propagator
  • Copy-pasting instrumentation code from a legacy codebase

The diagnostic checklist:

  1. Do spans from the same user request share a trace ID? (Check Tempo or logs)
  2. Are downstream spans appearing as root spans (no parentSpanId)?
  3. What headers is the upstream service injecting? (traceparent = W3C, X-B3-* = B3)
  4. Does the downstream service have a set_global_textmap() call with a non-default propagator?

Bonus Challenge

Add debug logging to the OTel SDK to see exactly what headers are being sent and received. In claims-service/src/instrumentation.py, add:

import logging
logging.getLogger("opentelemetry").setLevel(logging.DEBUG)

Rebuild and watch the Docker logs. You’ll see the SDK reporting “no valid traceparent header found” on the broken branch — the exact moment propagation fails.