Lab 4: Restore Full Observability

Lab Timer

Solution hint unlocks in 45:00

What You’ll Do

This is the capstone lab. Three problems from Labs 1–3 are active at once:

Collector pipeline incomplete — no telemetry reaches Grafana at all
Propagator mismatch — traces break at the api-gateway → claims-service boundary
Missing instrumentation — claims-service generates no spans even if the collector was working

You’ll work through them in order — fix the plumbing first, then fix connectivity, then fill in visibility — and verify each fix restores a new layer of observability.

Branch: lab/4-chaos Primary files: collector/skeleton.yml, claims-service/src/instrumentation.py, claims-service/src/main.py

Setup

cd insurewatch
git checkout lab/4-chaos
docker compose up --build

Submit a test claim:

curl -s -X POST http://localhost:3000/api/claims \
  -H "Content-Type: application/json" \
  -d '{
    "customer_id": "CUST001",
    "policy_number": "POL-001",
    "claim_type": "medical",
    "amount": 500,
    "description": "Capstone test",
    "incident_date": "2026-03-01"
  }'

Open Grafana at http://localhost:3100 → Explore → Tempo. No results. The system is running but completely dark.

Problem 1: Collector Pipeline {#problem-1}

Symptoms

The Collector container starts but immediately logs errors:

docker compose logs collector | head -20

You’ll see config validation failures — the pipeline sections reference components that aren’t defined (empty receivers, processors, exporters sections).

Even if the Collector were running, no data would reach Grafana. All seven services send OTLP to collector:4318, but there’s no valid route to the backend.

Fix

Open collector/skeleton.yml. It has empty pipeline arrays with TODOs. Complete it:

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  memory_limiter:
    check_interval: 5s
    limit_mib: 256
    spike_limit_mib: 64
  batch:
    timeout: 5s
    send_batch_size: 1024

exporters:
  otlphttp/lgtm:
    endpoint: http://lgtm:4318

service:
  pipelines:
    traces:
      receivers:  [otlp]
      processors: [memory_limiter, batch]
      exporters:  [otlphttp/lgtm]
    metrics:
      receivers:  [otlp]
      processors: [memory_limiter, batch]
      exporters:  [otlphttp/lgtm]
    logs:
      receivers:  [otlp]
      processors: [memory_limiter, batch]
      exporters:  [otlphttp/lgtm]

Restart the Collector:

docker compose restart collector
docker compose logs -f collector

Wait for: Everything is ready. Begin running and processing data.

Verify

Submit a claim. Open Tempo. You should now see traces — but only for api-gateway. The claims-service spans are absent or appear disconnected. This is Problem 2.

Problem 2: Propagator Mismatch {#problem-2}

Symptoms

With the Collector fixed, you can now see traces. Open Tempo, search service = api-gateway. You’ll find a trace — but it stops at the gateway. No claims-service spans are nested under it.

Search service = claims-service. Spans exist — they have their own trace IDs, different from the gateway trace.

Two separate traces, one request. This is the B3/W3C mismatch from Lab 1.

Diagnose

Open claims-service/src/instrumentation.py. Find the propagator setup near the TracerProvider:

from opentelemetry.propagators.b3 import B3MultiFormat
from opentelemetry.propagate import set_global_textmap

set_global_textmap(B3MultiFormat())

The api-gateway injects a W3C traceparent header. claims-service is listening for X-B3-TraceId. Mismatch.

Fix

Remove those three lines. Delete:

from opentelemetry.propagators.b3 import B3MultiFormat
from opentelemetry.propagate import set_global_textmap

And:

set_global_textmap(B3MultiFormat())

Also remove opentelemetry-propagator-b3==1.24.0 from claims-service/requirements.txt.

Rebuild:

docker compose up --build claims-service

Verify

Submit a claim. In Tempo, search service = api-gateway. The trace should now show claims-service spans nested under the gateway span — all sharing one trace ID.

But the claims spans are still sparse. The FastAPIInstrumentor is missing, so instead of seeing framework-level HTTP spans with route and status, you only see whatever manual spans remain. This is Problem 3.

Problem 3: Missing Instrumentation {#problem-3}

Symptoms

The trace is now connected (Problems 1 and 2 fixed), but claims-service is still a partial black box:

No POST /claims framework span (FastAPIInstrumentor missing)
No MongoDB operation spans (PymongoInstrumentor missing)
No outgoing HTTP spans to policy-service (HTTPXClientInstrumentor missing)
No trace ID in log lines (LoggingInstrumentor missing)
No business attributes on claims (manual spans stripped from main.py)

Fix: Auto-instrumentation

In claims-service/src/instrumentation.py, restore the four instrumentors.

Add these imports:

from opentelemetry.instrumentation.pymongo import PymongoInstrumentor
from opentelemetry.instrumentation.httpx import HTTPXClientInstrumentor
from opentelemetry.instrumentation.logging import LoggingInstrumentor

Replace the TODO block with:

PymongoInstrumentor().instrument()
HTTPXClientInstrumentor().instrument()
LoggingInstrumentor().instrument(set_logging_format=True)

In claims-service/src/main.py, restore FastAPI instrumentation:

from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor

# after app = FastAPI(...):
FastAPIInstrumentor.instrument_app(app)

Fix: Manual spans

In claims-service/src/main.py, wrap each handler in a manual span with business attributes. For submit_claim:

@app.post("/claims", response_model=ClaimResponse)
async def submit_claim(claim: ClaimSubmission, request: Request):
    start = time.time()
    with tracer.start_as_current_span("submit_claim") as span:
        apply_chaos()
        span.set_attribute("claim.customer_id",   claim.customer_id)
        span.set_attribute("claim.type",          claim.claim_type)
        span.set_attribute("claim.amount",        claim.amount)
        span.set_attribute("claim.policy_number", claim.policy_number)

        # ... existing handler logic ...

        span.set_attribute("claim.id",     claim_id)
        span.set_attribute("claim.status", status)

For get_claim:

with tracer.start_as_current_span("get_claim") as span:
    span.set_attribute("claim.id", claim_id)

For list_claims:

with tracer.start_as_current_span("list_claims") as span:
    if customer_id:
        span.set_attribute("filter.customer_id", customer_id)
    # ...
    span.set_attribute("claims.count", len(results))

Rebuild:

docker compose up --build claims-service

Final Verification {#solution}

Submit a claim:

curl -s -X POST http://localhost:3000/api/claims \
  -H "Content-Type: application/json" \
  -d '{
    "customer_id": "CUST003",
    "policy_number": "POL-003",
    "claim_type": "property",
    "amount": 8000,
    "description": "Capstone complete",
    "incident_date": "2026-03-15"
  }'

Check 1: Full trace waterfall

In Tempo, search service = api-gateway. The trace should show:

api-gateway: POST /api/claims                    [=====================================]
  claims-service: POST /claims                   [     =============================   ]
    claims-service: submit_claim                 [      ============================   ]
      claims-service: GET http://policy-service  [       =======                       ]
      claims-service: pymongo.insert_one         [                     ====            ]
      claims-service: POST http://notification.. [                          ===        ]
  policy-service: GET /policy/.../coverage       [      =======                        ]
  notification-service: POST /notify             [                          ===        ]

Check 2: Business attributes

Click the submit_claim span. The Attributes panel should show claim.type, claim.amount, claim.status, policy.valid, etc.

Check 3: Metrics

Grafana → Explore → Prometheus:

claims_submitted_total
claims_approved_total
claims_processing_duration_bucket

Check 4: Logs with trace correlation

Grafana → Explore → Loki, filter: {service_name="claims-service"}. Log lines should contain traceId= values. Copy a trace ID from a log line and paste it into Tempo — it should jump directly to that trace.

Check 5: Chaos

Open the InsureWatch UI at http://localhost:5173. Go to System Status. Enable High Latency. Submit a claim. In Tempo, the claims-service: submit_claim span should be visibly longer. The chaos is observable.

Reflective Summary

You’ve just restored full observability to a broken distributed system using only OpenTelemetry primitives. What you fixed:

Problem	OTel concept	Consequence of the break
Collector pipeline	Collector architecture	Zero visibility — no data left the services
Propagator mismatch	Context propagation	Fragmented traces — impossible to follow a request
Missing instrumentation	SDK instrumentation API	Blind spots — service ran but was invisible

These three failure modes — pipeline misconfiguration, propagation breaks, and instrumentation gaps — account for the majority of “our traces are broken” tickets in real production environments.

The pattern for diagnosing observability problems:

Can I see anything? If Grafana is empty, check the export pipeline first. Are services sending? Is the Collector/backend receiving?
Are the traces connected? If spans exist but are disconnected, check propagator configuration on every service boundary.
Is anything missing from a trace I can see? Check instrumentation coverage — which services have auto-instrumentation, which have manual spans, which have neither.

Start from the pipeline and work inward toward the code.

What Comes Next

You now have a working, fully-instrumented polyglot application. Where to go from here:

Sampling — not every trace is worth keeping. Learn about head-based and tail-based sampling to control costs without losing signal.
Semantic conventions — the attributes you set (claim.type, policy.valid) should follow the OTel semantic conventions where applicable. Review Module 4 for the standard namespaces.
Production patterns — scaling the Collector to agent + gateway topology, separating concerns between the service mesh layer and the application layer.
Certification — the OTel Certified Associate exam tests exactly the concepts this lab exercises: SDK initialization, instrumentation APIs, Collector pipeline configuration, and context propagation mechanics.

What You’ll Do

Setup

Problem 1: Collector Pipeline {#problem-1}

Symptoms

Fix

Verify

Problem 2: Propagator Mismatch {#problem-2}

Symptoms

Diagnose

Fix

Verify

Problem 3: Missing Instrumentation {#problem-3}

Symptoms

Fix: Auto-instrumentation

Fix: Manual spans

Final Verification {#solution}

Reflective Summary

What Comes Next

Unlock this module