Lab Timer
Solution hint unlocks in 45:00
What You’ll Do
This is the capstone lab. Three problems from Labs 1–3 are active at once:
- Collector pipeline incomplete — no telemetry reaches Grafana at all
- Propagator mismatch — traces break at the api-gateway → claims-service boundary
- Missing instrumentation — claims-service generates no spans even if the collector was working
You’ll work through them in order — fix the plumbing first, then fix connectivity, then fill in visibility — and verify each fix restores a new layer of observability.
Branch: lab/4-chaos
Primary files: collector/skeleton.yml, claims-service/src/instrumentation.py, claims-service/src/main.py
Setup
cd insurewatch
git checkout lab/4-chaos
docker compose up --build
Submit a test claim:
curl -s -X POST http://localhost:3000/api/claims \
-H "Content-Type: application/json" \
-d '{
"customer_id": "CUST001",
"policy_number": "POL-001",
"claim_type": "medical",
"amount": 500,
"description": "Capstone test",
"incident_date": "2026-03-01"
}'
Open Grafana at http://localhost:3100 → Explore → Tempo. No results. The system is running but completely dark.
Problem 1: Collector Pipeline {#problem-1}
Symptoms
The Collector container starts but immediately logs errors:
docker compose logs collector | head -20
You’ll see config validation failures — the pipeline sections reference components that aren’t defined (empty receivers, processors, exporters sections).
Even if the Collector were running, no data would reach Grafana. All seven services send OTLP to collector:4318, but there’s no valid route to the backend.
Fix
Open collector/skeleton.yml. It has empty pipeline arrays with TODOs. Complete it:
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
memory_limiter:
check_interval: 5s
limit_mib: 256
spike_limit_mib: 64
batch:
timeout: 5s
send_batch_size: 1024
exporters:
otlphttp/lgtm:
endpoint: http://lgtm:4318
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [otlphttp/lgtm]
metrics:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [otlphttp/lgtm]
logs:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [otlphttp/lgtm]
Restart the Collector:
docker compose restart collector
docker compose logs -f collector
Wait for: Everything is ready. Begin running and processing data.
Verify
Submit a claim. Open Tempo. You should now see traces — but only for api-gateway. The claims-service spans are absent or appear disconnected. This is Problem 2.
Problem 2: Propagator Mismatch {#problem-2}
Symptoms
With the Collector fixed, you can now see traces. Open Tempo, search service = api-gateway. You’ll find a trace — but it stops at the gateway. No claims-service spans are nested under it.
Search service = claims-service. Spans exist — they have their own trace IDs, different from the gateway trace.
Two separate traces, one request. This is the B3/W3C mismatch from Lab 1.
Diagnose
Open claims-service/src/instrumentation.py. Find the propagator setup near the TracerProvider:
from opentelemetry.propagators.b3 import B3MultiFormat
from opentelemetry.propagate import set_global_textmap
set_global_textmap(B3MultiFormat())
The api-gateway injects a W3C traceparent header. claims-service is listening for X-B3-TraceId. Mismatch.
Fix
Remove those three lines. Delete:
from opentelemetry.propagators.b3 import B3MultiFormat
from opentelemetry.propagate import set_global_textmap
And:
set_global_textmap(B3MultiFormat())
Also remove opentelemetry-propagator-b3==1.24.0 from claims-service/requirements.txt.
Rebuild:
docker compose up --build claims-service
Verify
Submit a claim. In Tempo, search service = api-gateway. The trace should now show claims-service spans nested under the gateway span — all sharing one trace ID.
But the claims spans are still sparse. The FastAPIInstrumentor is missing, so instead of seeing framework-level HTTP spans with route and status, you only see whatever manual spans remain. This is Problem 3.
Problem 3: Missing Instrumentation {#problem-3}
Symptoms
The trace is now connected (Problems 1 and 2 fixed), but claims-service is still a partial black box:
- No
POST /claimsframework span (FastAPIInstrumentor missing) - No MongoDB operation spans (PymongoInstrumentor missing)
- No outgoing HTTP spans to policy-service (HTTPXClientInstrumentor missing)
- No trace ID in log lines (LoggingInstrumentor missing)
- No business attributes on claims (manual spans stripped from main.py)
Fix: Auto-instrumentation
In claims-service/src/instrumentation.py, restore the four instrumentors.
Add these imports:
from opentelemetry.instrumentation.pymongo import PymongoInstrumentor
from opentelemetry.instrumentation.httpx import HTTPXClientInstrumentor
from opentelemetry.instrumentation.logging import LoggingInstrumentor
Replace the TODO block with:
PymongoInstrumentor().instrument()
HTTPXClientInstrumentor().instrument()
LoggingInstrumentor().instrument(set_logging_format=True)
In claims-service/src/main.py, restore FastAPI instrumentation:
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
# after app = FastAPI(...):
FastAPIInstrumentor.instrument_app(app)
Fix: Manual spans
In claims-service/src/main.py, wrap each handler in a manual span with business attributes. For submit_claim:
@app.post("/claims", response_model=ClaimResponse)
async def submit_claim(claim: ClaimSubmission, request: Request):
start = time.time()
with tracer.start_as_current_span("submit_claim") as span:
apply_chaos()
span.set_attribute("claim.customer_id", claim.customer_id)
span.set_attribute("claim.type", claim.claim_type)
span.set_attribute("claim.amount", claim.amount)
span.set_attribute("claim.policy_number", claim.policy_number)
# ... existing handler logic ...
span.set_attribute("claim.id", claim_id)
span.set_attribute("claim.status", status)
For get_claim:
with tracer.start_as_current_span("get_claim") as span:
span.set_attribute("claim.id", claim_id)
For list_claims:
with tracer.start_as_current_span("list_claims") as span:
if customer_id:
span.set_attribute("filter.customer_id", customer_id)
# ...
span.set_attribute("claims.count", len(results))
Rebuild:
docker compose up --build claims-service
Final Verification {#solution}
Submit a claim:
curl -s -X POST http://localhost:3000/api/claims \
-H "Content-Type: application/json" \
-d '{
"customer_id": "CUST003",
"policy_number": "POL-003",
"claim_type": "property",
"amount": 8000,
"description": "Capstone complete",
"incident_date": "2026-03-15"
}'
Check 1: Full trace waterfall
In Tempo, search service = api-gateway. The trace should show:
api-gateway: POST /api/claims [=====================================]
claims-service: POST /claims [ ============================= ]
claims-service: submit_claim [ ============================ ]
claims-service: GET http://policy-service [ ======= ]
claims-service: pymongo.insert_one [ ==== ]
claims-service: POST http://notification.. [ === ]
policy-service: GET /policy/.../coverage [ ======= ]
notification-service: POST /notify [ === ]
Check 2: Business attributes
Click the submit_claim span. The Attributes panel should show claim.type, claim.amount, claim.status, policy.valid, etc.
Check 3: Metrics
Grafana → Explore → Prometheus:
claims_submitted_total
claims_approved_total
claims_processing_duration_bucket
Check 4: Logs with trace correlation
Grafana → Explore → Loki, filter: {service_name="claims-service"}. Log lines should contain traceId= values. Copy a trace ID from a log line and paste it into Tempo — it should jump directly to that trace.
Check 5: Chaos
Open the InsureWatch UI at http://localhost:5173. Go to System Status. Enable High Latency. Submit a claim. In Tempo, the claims-service: submit_claim span should be visibly longer. The chaos is observable.
Reflective Summary
You’ve just restored full observability to a broken distributed system using only OpenTelemetry primitives. What you fixed:
| Problem | OTel concept | Consequence of the break |
|---|---|---|
| Collector pipeline | Collector architecture | Zero visibility — no data left the services |
| Propagator mismatch | Context propagation | Fragmented traces — impossible to follow a request |
| Missing instrumentation | SDK instrumentation API | Blind spots — service ran but was invisible |
These three failure modes — pipeline misconfiguration, propagation breaks, and instrumentation gaps — account for the majority of “our traces are broken” tickets in real production environments.
The pattern for diagnosing observability problems:
- Can I see anything? If Grafana is empty, check the export pipeline first. Are services sending? Is the Collector/backend receiving?
- Are the traces connected? If spans exist but are disconnected, check propagator configuration on every service boundary.
- Is anything missing from a trace I can see? Check instrumentation coverage — which services have auto-instrumentation, which have manual spans, which have neither.
Start from the pipeline and work inward toward the code.
What Comes Next
You now have a working, fully-instrumented polyglot application. Where to go from here:
- Sampling — not every trace is worth keeping. Learn about head-based and tail-based sampling to control costs without losing signal.
- Semantic conventions — the attributes you set (
claim.type,policy.valid) should follow the OTel semantic conventions where applicable. Review Module 4 for the standard namespaces. - Production patterns — scaling the Collector to agent + gateway topology, separating concerns between the service mesh layer and the application layer.
- Certification — the OTel Certified Associate exam tests exactly the concepts this lab exercises: SDK initialization, instrumentation APIs, Collector pipeline configuration, and context propagation mechanics.