Lab 2: Instrument the Black Box

Lab Timer

Solution hint unlocks in 45:00

What You’ll Do

Start InsureWatch on the lab/2-instrumentation branch. claims-service starts and handles requests normally — but it generates zero spans. The TracerProvider is configured and exporting, but all instrumentation has been stripped out. You’ll restore auto-instrumentation first, then add manual spans with meaningful business attributes.

Branch: lab/2-instrumentation Primary files: claims-service/src/instrumentation.py, claims-service/src/main.py

Setup

cd insurewatch
git checkout lab/2-instrumentation
docker compose up --build

The Symptom

Submit a claim:

curl -s -X POST http://localhost:3000/api/claims \
  -H "Content-Type: application/json" \
  -d '{
    "customer_id": "CUST001",
    "policy_number": "POL-001",
    "claim_type": "auto",
    "amount": 2500,
    "description": "Fender bender",
    "incident_date": "2026-03-01"
  }'

Open Grafana → Explore → Tempo. Search service = api-gateway. Click the trace.

What you see:

api-gateway: POST /api/claims  [============================================]

The gateway span is there. But claims-service contributes nothing. The trace stops at the gateway boundary, even though the claim was processed and saved to MongoDB.

Now search service = claims-service. No results. The service is running — it’s just invisible.

Part 1: Restore Auto-Instrumentation

What was removed

Open claims-service/src/instrumentation.py. Find the comment block near the bottom:

# LAB 2: Auto-instrumentations REMOVED — add them back!
# TODO: PymongoInstrumentor().instrument()
# TODO: HTTPXClientInstrumentor().instrument()
# TODO: LoggingInstrumentor().instrument(set_logging_format=True)

Open claims-service/src/main.py. Find:

# LAB 2: FastAPIInstrumentor import and .instrument_app() call REMOVED
# TODO: from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor

And:

# LAB 2: FastAPIInstrumentor.instrument_app(app) REMOVED
# TODO: FastAPIInstrumentor.instrument_app(app)

Four instrumentors were removed. Each covers a different gap:

Instrumentor	What it captures
`FastAPIInstrumentor`	A span per HTTP request, status code, route
`PymongoInstrumentor`	A span per MongoDB operation — collection, command, duration
`HTTPXClientInstrumentor`	A span for each outgoing HTTP call claims makes to policy-service and notification-service
`LoggingInstrumentor`	Injects `traceId` and `spanId` into every Python log line

The fix

In claims-service/src/instrumentation.py, replace the TODO block:

from opentelemetry.instrumentation.pymongo import PymongoInstrumentor
from opentelemetry.instrumentation.httpx import HTTPXClientInstrumentor
from opentelemetry.instrumentation.logging import LoggingInstrumentor

# ... (keep the existing TracerProvider/MeterProvider/LoggerProvider setup) ...

PymongoInstrumentor().instrument()
HTTPXClientInstrumentor().instrument()
LoggingInstrumentor().instrument(set_logging_format=True)

In claims-service/src/main.py, restore the FastAPI instrumentor:

from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor

# ...after app = FastAPI(...):
FastAPIInstrumentor.instrument_app(app)

Placement matters: instrument_app() must be called after app = FastAPI(...) but before the app starts handling requests. The module-level PymongoInstrumentor().instrument() in instrumentation.py must run before any MongoDB connections are made — which is why instrumentation.py is imported first.

Verify auto-instrumentation

Rebuild:

docker compose up --build claims-service

Submit a claim. In Tempo, search service = claims-service. You should now see spans:

claims-service: POST /claims                 [=================================]
  claims-service: pymongo.insert_one         [             ====                ]
  claims-service: GET http://policy-service  [      =====                      ]
  claims-service: POST http://notification.. [                   ===           ]

All automatically generated — no code changes to the handler itself.

Part 2: Add Manual Spans with Business Context

Auto-instrumentation covers the infrastructure layer — HTTP, database, framework routing. It doesn’t know anything about your business logic. It can tell you a MongoDB insert took 12ms, but not what kind of claim was submitted, whether it was auto-approved, or what the coverage limit was.

That context comes from manual spans and attributes.

What to add

In claims-service/src/main.py, the submit_claim handler currently looks like:

@app.post("/claims", response_model=ClaimResponse)
async def submit_claim(claim: ClaimSubmission, request: Request):
    start = time.time()
    # LAB 2: Manual span REMOVED
    # TODO: add tracer.start_as_current_span("submit_claim") and set attributes
    apply_chaos()
    ...

Wrap the handler body in a manual span and attach business attributes:

@app.post("/claims", response_model=ClaimResponse)
async def submit_claim(claim: ClaimSubmission, request: Request):
    start = time.time()
    with tracer.start_as_current_span("submit_claim") as span:
        apply_chaos()

        span.set_attribute("claim.customer_id",   claim.customer_id)
        span.set_attribute("claim.type",          claim.claim_type)
        span.set_attribute("claim.amount",        claim.amount)
        span.set_attribute("claim.policy_number", claim.policy_number)

        # ... (existing handler logic) ...

        # After MongoDB insert, add the claim ID
        span.set_attribute("claim.id",     claim_id)
        span.set_attribute("claim.status", status)

        # After policy validation, record the result
        span.set_attribute("policy.valid", True)
        span.set_attribute("policy.coverage_limit", policy_data.get("coverage_limit", 0))

Do the same for get_claim and list_claims:

@app.get("/claims/{claim_id}")
async def get_claim(claim_id: str):
    with tracer.start_as_current_span("get_claim") as span:
        span.set_attribute("claim.id", claim_id)
        apply_chaos()
        # ... existing logic ...

@app.get("/claims")
async def list_claims(customer_id: Optional[str] = None):
    with tracer.start_as_current_span("list_claims") as span:
        if customer_id:
            span.set_attribute("filter.customer_id", customer_id)
        # ... existing logic ...
        span.set_attribute("claims.count", len(results))

Why manual spans on top of auto-instrumentation?

FastAPIInstrumentor already creates a span for POST /claims. Why add another?

Because the auto-generated span is an HTTP span — it knows the method, route, status code. The manual span is a business span — it knows this was a medical claim for CUST001 worth $2,500 that was auto-approved against a $50,000 policy.

In Grafana, when you’re investigating a slow claim or an unexpectedly rejected one, the HTTP span tells you something happened here. The manual span tells you what happened and why.

The two spans are nested — the manual submit_claim span is a child of the auto-generated POST /claims span. In Tempo, you see the full context: infrastructure (auto) + business (manual) in the same waterfall.

Record exceptions on errors

When the policy service is unreachable, record it on the span:

except httpx.RequestError as e:
    logger.error(f"Policy service unreachable: {e}")
    span.record_exception(e)
    span.set_status(StatusCode.ERROR, str(e))
    span.set_attribute("policy.valid", False)

This requires one import at the top of main.py:

from opentelemetry.trace import StatusCode

span.record_exception(e) attaches the exception type, message, and stack trace as a span event. span.set_status(StatusCode.ERROR, ...) marks the span as failed — Tempo surfaces this as a red error indicator in the trace view.

Verification {#solution}

docker compose up --build claims-service

Submit a high-value claim (won’t be auto-approved):

curl -s -X POST http://localhost:3000/api/claims \
  -H "Content-Type: application/json" \
  -d '{
    "customer_id": "CUST001",
    "policy_number": "POL-001",
    "claim_type": "property",
    "amount": 15000,
    "description": "Water damage",
    "incident_date": "2026-03-15"
  }'

In Tempo, find this trace. Click the submit_claim span. In the Attributes panel, you should see:

claim.customer_id: CUST001
claim.type: property
claim.amount: 15000
claim.status: pending
policy.valid: true
policy.coverage_limit: <value>

Now stop the policy-service to trigger an error:

docker compose stop policy-service

Submit another claim. In Tempo, the submit_claim span should show as an error (red) with the exception event attached. Restart policy-service:

docker compose start policy-service

What You Learned

Auto-instrumentation covers infrastructure; manual spans cover intent. FastAPIInstrumentor knows it processed an HTTP POST. It doesn’t know the claim was for property damage worth $15,000 that went to manual review. That business context only exists if you add it.

Span attributes are queryable. In production, you can filter Tempo traces by claim.type, claim.status, or any attribute you set. This turns distributed traces into a searchable audit trail of business events — not just infrastructure telemetry.

Order and placement are critical for auto-instrumentation:

instrumentation.py must be imported before anything else — it patches the libraries at module load time
FastAPIInstrumentor.instrument_app(app) must be called after app = FastAPI(...) but before the first request
PymongoInstrumentor().instrument() must run before the MongoDB client is created

record_exception() vs a log message. logger.error() writes a log. span.record_exception(e) attaches the exception to the trace — it becomes visible in Tempo without needing to correlate trace IDs with Loki. Both are good practice; the span attachment makes the error immediately visible to whoever is looking at the trace.

Bonus Challenges

1. Add a span event for the auto-approval decision:

if claim.amount < 1000:
    status = "auto_approved"
    span.add_event("auto_approved", {"threshold": 1000, "amount": claim.amount})

Span events (as opposed to attributes) are timestamped points within the span’s duration. In Tempo, they appear as markers on the span timeline. Use events for things that happened at a moment, attributes for things that describe the whole span.

2. Add a metric counter alongside the span:

The claims_submitted and claims_approved counters are already in the code. Check that they appear in Grafana → Explore → Prometheus. Query:

claims_approved_total

Metrics and traces are complementary: the metric tells you how many claims were approved in the last hour; the trace tells you which specific claim was approved and what happened during processing.