Module 3: Instrumentation — SDK, API, Auto vs Manual

The Assumption to Destroy

“Auto-instrumentation means you don’t need to understand OTel.”

This is the most dangerous thing you can believe about OpenTelemetry, and it’s where most teams plateau. They install the Java agent or run opentelemetry-instrument in front of their Python service, see traces in Jaeger, and declare victory. The spans look right. The HTTP routes are there. The database queries show up. What else is there?

Here’s the problem: the agent only knows about libraries. It has no idea what your code does.

When InsureWatch’s Python API calculates an insurance quote — loading risk factor tables, running actuarial models, applying discounts, computing the final premium — the auto-instrumentation agent sees none of it. It sees the HTTP request come in and the database query go out. The 200 milliseconds of business logic that happens between them is a black box. When that logic is slow, or wrong, or raises an exception that gets swallowed somewhere in a try/except, you have no observability into it whatsoever.

Auto-instrumentation is a starting point, not a destination. Understanding the API/SDK split, provider initialization, and manual instrumentation is what turns “I have traces” into “I understand my system.”

The OTCA exam has a 46% weighting on the API and SDK domain. Not because Anthropic wrote trick questions — because instrumentation is genuinely the most important part of OTel to understand correctly. Let’s build the full picture.

The API/SDK Split

This is the foundational design decision that everything else in OTel is built on, and it’s one of the most elegant pieces of the specification.

OpenTelemetry ships two separate packages for every language: the API and the SDK. They are deliberately kept separate, and the reason matters.

The API: A Stable Contract With No Behavior

The opentelemetry-api package is tiny. In Python, it’s around 200KB installed. It has almost no dependencies. When you call trace.get_tracer() from the API, it returns a Tracer object. When you call tracer.start_span() on that object, it returns a Span. Everything compiles and runs without error.

But by default, with no SDK installed, those spans do nothing. They record nothing, export nothing, and consume essentially zero CPU or memory. The API’s default implementation is a no-op.

This is intentional. The API is a stable contract — a set of interfaces and method signatures that library authors can code against without knowing anything about where the telemetry will go, or whether any SDK is even configured.

The SDK: The Pluggable Implementation

The opentelemetry-sdk package is where behavior lives. The SDK provides the actual TracerProvider, MeterProvider, and LoggerProvider implementations that do real work: attaching Resources, applying samplers, batching spans, and exporting them via OTLP to a Collector or backend.

When your application initializes the SDK and registers a provider, the API suddenly has something to talk to. That same trace.get_tracer() call that previously returned a no-op tracer now returns a fully functional tracer backed by the SDK.

┌─────────────────────────────────────────────────────────────┐
│                  Your Application                            │
│                                                              │
│  app_code.py ──────────► opentelemetry-sdk                  │
│                                   │                          │
│                                   ▼                          │
│                         opentelemetry-api ◄── shared_lib.py  │
│                                   │                          │
│                                   ▼                          │
│                         OTLP Exporter → Collector            │
└─────────────────────────────────────────────────────────────┘

Your application code imports the SDK directly. It initializes a TracerProvider, configures exporters, and registers the provider with the global API. Library code — the shared library in this diagram, or any third-party package you’ve installed — imports only the API. It creates spans using the API interfaces. The SDK provides the implementation that those API calls dispatch to.

Why Library Authors Must Never Import the SDK

This is the rule that matters most for the exam, and for real production code.

If a library author imports opentelemetry-sdk as a dependency, they’ve made a critical error. They’ve locked in a specific implementation and forced it on every application that uses their library. Applications that want to use a different SDK version, or a different exporter, or no telemetry at all, can’t — because the library brought along its own SDK with its own opinions.

The API is versioned with strict backward compatibility guarantees. Library authors can depend on the API safely, knowing it will remain stable across minor version bumps. The SDK makes no such guarantees about its internal implementation details.

The correct pattern: libraries import opentelemetry-api. They call trace.get_tracer(__name__) to get a tracer. If the application has initialized an SDK-backed provider, the library’s spans flow through it. If the application hasn’t configured any SDK, the library’s tracer calls are no-ops — zero overhead. The library has no knowledge of and no dependency on whichever SDK the application chose.

Exam callout: The OTCA tests this directly. Library authors import the API only, never the SDK. Applications import both. The SDK registers its providers with the API’s global registry. This is how OTel achieves zero-cost telemetry in environments where no SDK is configured.

Takeaway: The API/SDK split is not a packaging quirk — it’s the design choice that makes OTel safe to embed in shared libraries. When you see opentelemetry-api in a library’s dependencies, that’s correct. When you see opentelemetry-sdk in a library’s dependencies, that’s a red flag.

TracerProvider, MeterProvider, LoggerProvider

Each OTel signal type has its own provider. The provider is the factory and lifecycle manager for that signal. Understanding how providers are initialized — and shut down — is load-bearing knowledge for both production systems and the exam.

Providers Are Entry Points, Not Utilities

Every Tracer, Meter, and Logger your application creates comes from a provider. The provider is where you attach:

Resource — the identity of the process (covered in the next section)
Exporters — where telemetry goes (OTLP to a Collector, direct to a backend, or nowhere)
Processors — for traces: how spans are batched and exported
Samplers — for traces: which spans to keep
Views — for metrics: how instruments are aggregated

When you call trace.get_tracer("insurewatch.api"), you’re asking the registered TracerProvider to create a Tracer scoped to that instrumentation scope. All spans created by that tracer carry insurewatch.api as their instrumentation scope name — a field that tells you which component produced the span.

Initializing a TracerProvider

Here’s a complete, production-ready TracerProvider initialization for InsureWatch’s Python API service. Every line matters.

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.resources import Resource

resource = Resource.create({
    "service.name": "insurewatch-api",
    "service.version": "1.2.0",
    "deployment.environment": "production",
})

provider = TracerProvider(resource=resource)
provider.add_span_processor(
    BatchSpanProcessor(OTLPSpanExporter(endpoint="http://collector:4317"))
)
trace.set_tracer_provider(provider)

tracer = trace.get_tracer("insurewatch.api", "1.2.0")

Walk through what each piece does:

Resource.create(...) — builds the identity of this process. These attributes attach to every span exported. The service.name attribute is mandatory for meaningful observability. Without it, your backend has no idea which service emitted the spans.
TracerProvider(resource=resource) — creates the provider with the Resource attached. The Resource is immutable once the provider is created. You can’t change it after initialization.
BatchSpanProcessor(OTLPSpanExporter(...)) — wraps the exporter in a BatchSpanProcessor. This is almost always the right choice in production. The batch processor queues spans in memory and exports them in batches on a background thread. The alternative, SimpleSpanProcessor, exports synchronously on the same thread that ends each span — it’s useful for debugging, not for production.
trace.set_tracer_provider(provider) — registers this provider as the global default. After this call, any call to trace.get_tracer() anywhere in the process will use this provider, including in third-party libraries that imported the API.
trace.get_tracer("insurewatch.api", "1.2.0") — creates a Tracer with a scope name and version. The scope name identifies which component in your application created these spans. Use a reverse-domain or module-path style name that makes the source obvious in your backend’s UI.

Graceful Shutdown — Why It Matters

The BatchSpanProcessor uses an in-memory queue. When your process exits, spans in that queue that haven’t been exported yet will be lost. In a normal production deployment, you might lose the last few seconds of traces before a pod terminates — the exact window where a deployment error manifests.

You avoid this with a shutdown call during process teardown:

import atexit

atexit.register(provider.shutdown)

Or, in a FastAPI application:

from contextlib import asynccontextmanager
from fastapi import FastAPI

@asynccontextmanager
async def lifespan(app: FastAPI):
    yield
    provider.shutdown()

app = FastAPI(lifespan=lifespan)

provider.shutdown() flushes the BatchSpanProcessor’s queue, waits for all in-flight exports to complete, and then shuts down the exporter connection. It’s a blocking call and it can take a few seconds. That’s acceptable — losing your last traces during a deployment event is worse than a slightly slower shutdown.

Exam callout: The OTCA tests graceful shutdown. Know that BatchSpanProcessor buffers spans in memory. Know that provider.shutdown() flushes that buffer. Know that skipping shutdown means losing buffered telemetry on process exit.

The Global Provider Pattern: Application vs Library

The trace.set_tracer_provider(provider) call sets a global. Any code in the process that calls trace.get_tracer() gets a tracer backed by this provider. This is convenient for application code — you initialize the provider once at startup, and every module in your application gets it automatically.

For library code, this convenience becomes a constraint. A library that calls trace.get_tracer() with no arguments will use whatever provider the application registered. If the application registered a no-op provider (or no provider at all), the library’s spans disappear silently. That’s fine — it’s the intended behavior.

But what if a library wants to make its tracer configurable? The correct pattern is to accept an optional TracerProvider parameter:

def create_quote_engine(
    tracer_provider: trace.TracerProvider | None = None
) -> QuoteEngine:
    tp = tracer_provider or trace.get_tracer_provider()
    tracer = tp.get_tracer("insurewatch.quote_engine", "1.0.0")
    return QuoteEngine(tracer=tracer)

If the caller passes a provider explicitly, use it. If not, fall back to the global. This makes the library testable in isolation — you can pass a mock provider in tests without touching the global state.

Exam callout: Libraries should never call trace.set_tracer_provider(). Setting the global provider is the application’s responsibility. Libraries accept provider parameters or use the global — they never set it.

Takeaway: Initialize all three providers at application startup, register them globally, and call provider.shutdown() on process exit. For library code, accept providers as parameters rather than relying on or mutating global state.

Resource: The Identity of a Process

Every span, metric data point, and log record produced by a process carries a Resource. The Resource is a set of attributes that describe the entity producing the telemetry — the service, its version, where it’s running, and what environment it’s in.

The Resource is not per-request or per-span. It is set once, at SDK initialization, and attached to every piece of telemetry that process emits for its entire lifetime.

┌─────────────────────────────────────────────────────────┐
│  RESOURCE (set once at SDK init, attached to everything) │
│                                                           │
│  service.name              = "insurewatch-api"            │
│  service.version           = "1.2.0"                     │
│  service.instance.id       = "pod-7b9d4f-xk2p1"          │
│  deployment.environment    = "production"                 │
│  host.name                 = "10.0.1.45"                 │
│  cloud.provider            = "aws"                       │
│  cloud.region              = "us-east-1"                 │
│  k8s.namespace.name        = "insurewatch"               │
│  k8s.pod.name              = "api-7b9d4f-xk2p1"          │
│                                                           │
└───────────────────┬───────────────────┬──────────────────┘
                    │                   │                   │
                    ▼                   ▼                   ▼
             ┌──────────┐        ┌──────────┐        ┌──────────┐
             │  TRACES  │        │  METRICS │        │   LOGS   │
             │          │        │          │        │          │
             │ All spans│        │ All meter│        │ All log  │
             │ carry    │        │ datapts  │        │ records  │
             │ Resource │        │ carry    │        │ carry    │
             └──────────┘        │ Resource │        │ Resource │
                                 └──────────┘        └──────────┘

Key Resource Attributes

The OTel semantic conventions define standardized attribute names for resources. The ones you need to know:

service.name — The logical name of your service. This is the most important attribute. If you don’t set it, most backends will show “unknown_service” and you’ll have a bad time. Required.
service.version — The version of your deployed code. Set this from your build pipeline or environment variable. When a regression appears, service.version is how you correlate it to a specific deployment.
service.instance.id — Identifies a specific running instance. In Kubernetes, this is typically the pod name. Critical for distinguishing between multiple replicas of the same service. If you’re running five pods of insurewatch-api and one of them is misbehaving, service.instance.id is what isolates it.
deployment.environment — "production", "staging", "development". Essential for keeping telemetry from different environments separated in your backend. Set it from an environment variable in your deployment manifests.

Resource Detectors: Auto-Populating Infrastructure Attributes

Manually setting host.name, cloud.region, and k8s.pod.name in code is error-prone and couples your application to its deployment environment. OTel solves this with Resource detectors — components that auto-detect infrastructure attributes from the environment at startup.

from opentelemetry.sdk.resources import Resource, OTELResourceDetector
from opentelemetry.sdk.extension.aws.resource.ec2 import AwsEc2ResourceDetector
from opentelemetry.sdk.extension.aws.resource.eks import AwsEksResourceDetector

# Detectors run at startup and merge their results
resource = Resource.create({
    "service.name": "insurewatch-api",
    "service.version": "1.2.0",
    "deployment.environment": "production",
}).merge(AwsEc2ResourceDetector().detect()).merge(AwsEksResourceDetector().detect())

The detectors query the AWS metadata API, environment variables, and downward API files (for Kubernetes) to populate attributes like cloud.region, host.id, k8s.pod.name, and k8s.namespace.name automatically. Your application code stays environment-agnostic. The same code runs on a laptop (where the cloud detector finds nothing and produces no attributes) and in production EKS (where it populates a full set of infrastructure attributes).

Exam callout: The Resource is set at SDK initialization and cannot be changed at runtime. It is the same for every span, metric, and log from that process. Resource detectors run at startup to auto-populate cloud, host, and container attributes. Know that service.name is the most important Resource attribute, and that omitting it causes most backends to label your telemetry as “unknown_service.”

Takeaway: Define service.name, service.version, and deployment.environment explicitly in code. Use resource detectors for infrastructure attributes. Never hardcode pod names or host IPs — let the detectors find them.

Auto-Instrumentation: What It Is and What It Misses

Auto-instrumentation is the fastest way to get telemetry into an application. Install the agent or run the instrumentation CLI, and within minutes you have spans for every incoming HTTP request, every outgoing database query, every Redis call, every downstream service hop. For a new service that has zero observability, it’s transformative.

The mechanism varies by language:

Java — The opentelemetry-javaagent.jar uses Java’s agent API to apply bytecode manipulation at class load time. When the JVM loads org.springframework.web.servlet.DispatcherServlet, the agent rewrites its bytecode to inject span creation and context propagation. Your compiled code is never changed — the manipulation happens in memory at runtime.
Python — The opentelemetry-instrument CLI (or programmatic bootstrap) uses monkey-patching. At startup, it imports the target library and replaces key methods with instrumented versions. requests.Session.send gets replaced with a version that creates a CLIENT span and injects W3C Trace Context headers before calling the original method.
Node.js — Similar to Python: patches are applied at module load time using Node’s module system hooks.

Setting Up Python Auto-Instrumentation

The simplest approach is the CLI wrapper:

opentelemetry-instrument \
  --traces_exporter otlp \
  --metrics_exporter otlp \
  --exporter_otlp_endpoint http://collector:4317 \
  --service_name insurewatch-api \
  python app.py

This starts your application with auto-instrumentation applied, without modifying a single line of application code. For programmatic control — when you need to initialize your SDK configuration before the instrumentation runs — use the bootstrap approach:

from opentelemetry.instrumentation.flask import FlaskInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor
from opentelemetry.instrumentation.sqlalchemy import SQLAlchemyInstrumentor

# Call after SDK/provider initialization
FlaskInstrumentor().instrument()
RequestsInstrumentor().instrument()
SQLAlchemyInstrumentor().instrument(engine=db.engine)

Each instrument() call patches the specific library. FlaskInstrumentor wraps every route handler with a SERVER span. RequestsInstrumentor wraps requests.Session.send to create CLIENT spans and propagate headers. SQLAlchemyInstrumentor wraps query execution to create spans with db.statement and db.system attributes.

What Auto-Instrumentation Covers

A well-configured auto-instrumentation setup gives you:

Inbound HTTP — SERVER spans for every request your framework handles, with http.method, http.route, http.status_code, and other HTTP semantic convention attributes
Outbound HTTP — CLIENT spans for every requests or httpx call, with trace context propagated as W3C headers
Database drivers — CLIENT spans for SQLAlchemy, psycopg2, pymongo, redis-py, and others, with db.system, db.name, and optionally db.statement
Messaging clients — PRODUCER and CONSUMER spans for Kafka, RabbitMQ, and SQS clients
Async frameworks — Spans for Celery tasks, asyncio-based frameworks, gRPC calls

For InsureWatch’s Python API, auto-instrumentation produces traces that show each incoming HTTP request, the SQLAlchemy queries it triggers, and any downstream HTTP calls. That’s a solid foundation.

The Coverage Gap: Where Auto-Instrumentation Ends

┌────────────────────────────────────────────────────────┐
│  HTTP Layer (Flask/Express/Spring)  <- AUTO-INSTRUMENTED│
│  ┌──────────────────────────────────────────────────┐  │
│  │  Route Handler                                    │  │
│  │  ┌────────────────────────────────────────────┐  │  │
│  │  │  Business Logic                             │  │  │
│  │  │  calculate_quote()   <- INVISIBLE TO AGENT  │  │  │
│  │  │  apply_risk_factors()  <- INVISIBLE TO AGENT │  │  │
│  │  │  compute_discount()  <- INVISIBLE TO AGENT  │  │  │
│  │  └────────────────────────────────────────────┘  │  │
│  └──────────────────────────────────────────────────┘  │
│  DB Driver (SQLAlchemy/pg)         <- AUTO-INSTRUMENTED │
└────────────────────────────────────────────────────────┘

Everything inside the route handler, between the framework handing off the request and the database query executing, is invisible. The agent cannot infer meaning from your domain.

Consider InsureWatch’s quote calculation. The API receives a POST /api/quote request. The agent creates a SERVER span. Inside the handler, the code:

Loads the applicant’s profile from the database (SQLAlchemy — auto-instrumented)
Runs calculate_base_premium() against a risk factor model
Calls apply_loyalty_discount() with the applicant’s policy history
Calls apply_regional_risk_adjustment() with the applicant’s postal code
Runs final compliance checks
Stores the quote to the database (SQLAlchemy — auto-instrumented)

Steps 2 through 5 are completely opaque. When a quote calculation takes 400ms instead of the usual 40ms, you have no idea which of those four steps is responsible. The trace shows: SERVER span started, two database queries, SERVER span ended. The 350ms gap in the middle is a black box.

That gap is your business logic. The agent worked by wrapping known library calls — it cannot wrap code that has no library boundary.

Exam callout: The OTCA tests this boundary explicitly. Auto-instrumentation covers framework and library calls. Application-specific business logic — functions you wrote — requires manual instrumentation. Know which is which.

Takeaway: Auto-instrumentation is a starting point, not a complete solution. It covers the I/O boundary. Everything between the I/O calls is invisible until you instrument it manually.

Manual Instrumentation: Filling the Gaps

Once you understand what the agent misses, manual instrumentation becomes purposeful rather than speculative. You’re not adding spans everywhere — you’re adding spans where the agent has no visibility and where the context matters for debugging.

Creating Spans with the Tracer

The fundamental pattern is the context manager:

tracer = trace.get_tracer("insurewatch.quotes")

def calculate_quote(policy_type: str, risk_factors: dict) -> float:
    with tracer.start_as_current_span("calculate_quote") as span:
        span.set_attribute("policy.type", policy_type)
        span.set_attribute("policy.risk_score", risk_factors.get("score", 0))

        try:
            result = _run_pricing_model(policy_type, risk_factors)
            span.set_attribute("quote.amount_usd", result)
            return result
        except PricingModelError as e:
            span.record_exception(e)
            span.set_status(StatusCode.ERROR, str(e))
            raise

Walk through what this does:

tracer.start_as_current_span("calculate_quote") — creates a span and installs it as the active span in the current context. Any code that runs inside this with block, including downstream function calls, will see this as their parent span. When the with block exits, the span ends automatically.
span.set_attribute(...) — attaches key/value metadata to the span. These attributes are indexed by most backends and are queryable — you can filter traces by policy.type="home" or find all quotes above a certain quote.amount_usd. Set attributes that represent facts about the operation, not things that happened during it.
span.record_exception(e) — records the full exception — type, message, stack trace — as a span event. The event is timestamped to the moment the exception was caught. This is separate from setting the span status; you want both.
span.set_status(StatusCode.ERROR, str(e)) — marks the span as errored. Without this, the span completes with UNSET status, and the exception information from record_exception is there but the span doesn’t surface as an error in trace UIs.
raise — re-raises the exception. It’s critical that you don’t swallow the exception here just because you’ve recorded it. The span records it; the calling code still needs to handle it.

Span Attributes vs Span Events

These are both ways to add information to a span, and they serve different purposes. Getting this distinction right matters for both exam and production use.

Span attributes are key/value pairs that describe the operation itself. They’re set before the span ends. They describe what the operation was doing: what input it received, what output it produced, what decisions it made. Attributes are indexed and queryable — your backend can filter, aggregate, and alert on them.

span.set_attribute("policy.type", "home")          # what type of policy
span.set_attribute("policy.risk_score", 0.72)       # what risk score was calculated
span.set_attribute("quote.amount_usd", 1240.00)     # what the result was
span.set_attribute("pricing_model.version", "v3.1") # which model was used

Span events are timestamped annotations that record things that happened during the span’s execution. They’re like structured log entries that are attached to a specific span. Use them for significant moments within the operation’s lifecycle.

span.add_event("cache_miss", {"cache.key": "risk_table:home:CA"})
span.add_event("pricing_model_loaded", {"model.cache_hit": True})
span.add_event("discount_applied", {
    "discount.type": "loyalty",
    "discount.percentage": 5.0
})

The rule of thumb: if you’d filter traces by it, use an attribute. If you’d read it to understand the sequence of events, use an event. Attributes answer “what was this span”; events answer “what happened while this span was running.”

A spam of 50 attributes on a span is usually a sign that span events should be used for some of them. A span event that only has two fields and no timestamp relevance is usually better as an attribute.

Exam callout: The OTCA tests the attributes vs events distinction. Attributes are indexed key/value pairs that describe the operation. Events are timestamped annotations that record things that happened during the span. Attributes are for filtering; events are for sequencing.

Span Status: The Three Values

Span status has exactly three values. Know them, know when to use each, and know the exam traps.

UNSET — The default. No explicit status has been set. The span is assumed to represent a successful operation. Most spans in a healthy system should have UNSET status. You do not need to explicitly set OK unless you want to override a child span’s error status.
OK — Explicitly marked as successful. The spec says: once a span has been marked OK, it cannot be changed to ERROR. Use OK only when you need to assert explicit success, which is rare. If you’re setting OK on every successful span, you’re doing extra work with no benefit.
ERROR — An unexpected failure occurred. Set this when an exception propagates out of the span, when a downstream call returns an error that breaks your operation, or when your code detects a condition that represents a failure.

The exam trap: HTTP status codes are not span status. A 404 Not Found is not an error span. If your API is designed to return 404 when a requested policy doesn’t exist, that’s a valid, expected outcome. The operation succeeded — it correctly determined the resource doesn’t exist. The span should remain UNSET.

ERROR is for when something broke unexpectedly: the database is down, the pricing model threw an unhandled exception, the upstream service returned 503, a timeout expired.

from opentelemetry.trace import StatusCode

# Wrong: 404 is not an error
if response.status_code == 404:
    span.set_status(StatusCode.ERROR, "not found")  # don't do this

# Right: 404 is a valid outcome, no status change needed
if response.status_code == 404:
    span.set_attribute("policy.found", False)
    return None  # expected path

# Right: unexpected failure is an error
try:
    result = pricing_service.get_rates(postal_code)
except PricingServiceUnavailable as e:
    span.record_exception(e)
    span.set_status(StatusCode.ERROR, "pricing service unavailable")
    raise

Exam callout: UNSET is not an error. A span with UNSET status represents a completed operation where no explicit status was set — it is assumed successful. OK is explicit success and is rarely needed. ERROR is for unexpected failures. HTTP 4xx responses are not automatically error spans.

Span Kinds for Manual Spans

Module 1 covered span kinds. When you’re creating spans manually, the choice matters.

If you’re creating a span for an internal function — something that doesn’t cross a network boundary — use INTERNAL. This is the default for manually created spans and the correct choice for calculate_quote, apply_risk_factors, and similar business functions.
If you’re creating a span for an outbound call you’re making directly (not via an auto-instrumented library), use CLIENT.
Do not set SERVER on a manually created span unless it genuinely represents handling an inbound request that the framework instrumentation missed.

with tracer.start_as_current_span(
    "calculate_quote",
    kind=SpanKind.INTERNAL
) as span:
    ...

In practice, INTERNAL is the right choice for the vast majority of manually added spans in application code.

Takeaway: Add manual spans at the boundaries of meaningful business operations — where the agent ends and your logic begins. Use attributes for queryable facts about the operation. Use events for things that happened during it. Always call record_exception and set_status(ERROR) together when handling failures.

Span Links vs Span Events

These two concepts are both “additional context attached to a span” and they’re frequently confused. They serve completely different purposes.

Span Events Revisited

A span event is a timestamped annotation attached to a single span. It records something that happened during the span’s execution window. Events belong to one span and one trace. They are not a relationship between spans.

with tracer.start_as_current_span("process_claims_batch") as span:
    span.add_event("batch_started", {"batch.size": len(claims)})

    for claim in claims:
        try:
            process_claim(claim)
        except ValidationError as e:
            span.add_event("claim_validation_failed", {
                "claim.id": claim.id,
                "error": str(e)
            })

    span.add_event("batch_completed", {
        "batch.processed": success_count,
        "batch.failed": failure_count
    })

Events are linear: they record what happened, in order, within this span’s lifetime.

Span Links: References Across Trace Boundaries

A span link is a reference to another span, potentially in a completely different trace. Links are how you record that a span was causally influenced by something outside its trace context.

The canonical use case is async message queue processing.

Producer (Trace A):                    Consumer (Trace B):

POST /api/claims                       process_claim (Kafka consumer)
  span_id: a1b2                          span_id: c3d4
  trace_id: TRACE-A                      trace_id: TRACE-B
  |                                      |
  |-- publish_to_kafka                   |-- link --> {trace_id: TRACE-A, span_id: b3c4}
       span_id: b3c4
       trace_id: TRACE-A

┌─────────────────────────────────────────────────────────────┐
│  Trace A (POST /api/claims)                                  │
│                                                              │
│  [POST /api/claims]  ──►  [publish_to_kafka]                 │
│   span_id: a1b2              span_id: b3c4                   │
│   trace_id: TRACE-A          trace_id: TRACE-A               │
│                                          │                   │
│                             Kafka Topic  │                   │
└─────────────────────────────────────────┼───────────────────┘
                                           │
                          link reference   │
                                           ▼
┌─────────────────────────────────────────────────────────────┐
│  Trace B (Kafka Consumer)                                    │
│                                                              │
│  [process_claim]                                             │
│   span_id: c3d4                                              │
│   trace_id: TRACE-B                                          │
│   links: [{trace_id: TRACE-A, span_id: b3c4}]               │
│                                                              │
└─────────────────────────────────────────────────────────────┘

The producer and consumer are in separate traces because the async handoff breaks the synchronous call chain. The consumer can’t have the producer span as its parent — by the time the consumer runs, the producer’s trace may be long since completed. But there’s a real causal relationship: the consumer ran because the producer put a message in the queue.

The span link preserves that relationship without forcing a single trace tree that spans both sides of the async boundary.

from opentelemetry.trace import Link, SpanContext, TraceFlags

# Consumer side: the producer's context arrives in the message headers
producer_context = propagate.extract(message.headers)

with tracer.start_as_current_span(
    "process_claim",
    links=[Link(context=producer_context)]
) as span:
    # process the claim
    ...

In practice, when using auto-instrumented Kafka clients, the link is created automatically by the instrumentation. But knowing the mechanism is necessary for the exam — and for the cases where you’re consuming from a queue that isn’t auto-instrumented.

The Instrumentation Library Pattern

You’ve been thinking about instrumenting your own application code. But what about shared internal libraries — packages your team maintains that are imported by a dozen different services?

When to Write an Instrumentation Library

The bar is: if a library is used by multiple services and performs operations that are meaningful to observe (database calls, HTTP calls, message publishing, business calculations), it should have an instrumentation package.

InsureWatch’s claims-common library is used by the Python API, the Java claims processor, and the Node.js gateway. It contains shared data access logic, validation utilities, and the canonical risk scoring algorithm. Every service that imports it benefits from instrumentation, and centralizing that instrumentation in one package is far better than copy-pasting tracer.start_span() calls across three codebases.

The Pattern

The instrumentation library wraps the original library and injects spans at the right boundaries:

# claims_common/instrumentation.py

from opentelemetry import trace
from opentelemetry.trace import TracerProvider

_DEFAULT_TRACER_NAME = "claims_common"
_DEFAULT_TRACER_VERSION = "1.0.0"


class ClaimsCommonInstrumentor:
    def __init__(self, tracer_provider: TracerProvider | None = None):
        tp = tracer_provider or trace.get_tracer_provider()
        self._tracer = tp.get_tracer(_DEFAULT_TRACER_NAME, _DEFAULT_TRACER_VERSION)

    def instrument(self):
        """Patch claims_common functions to add spans."""
        import claims_common.risk as risk_module
        self._original_score = risk_module.calculate_risk_score
        risk_module.calculate_risk_score = self._instrumented_risk_score

    def uninstrument(self):
        """Remove patches (useful for testing)."""
        import claims_common.risk as risk_module
        risk_module.calculate_risk_score = self._original_score

    def _instrumented_risk_score(self, applicant_id: str, policy_type: str) -> float:
        with self._tracer.start_as_current_span("claims_common.calculate_risk_score") as span:
            span.set_attribute("applicant.id", applicant_id)
            span.set_attribute("policy.type", policy_type)
            result = self._original_score(applicant_id, policy_type)
            span.set_attribute("risk.score", result)
            return result

Application code then uses it exactly like any OTel instrumentation package:

from claims_common.instrumentation import ClaimsCommonInstrumentor

# In your SDK initialization block, after setting up the provider:
ClaimsCommonInstrumentor().instrument()

Key design decisions in this pattern:

Accept an optional TracerProvider — if not provided, fall back to the global. This makes the instrumentation work in any application without configuration, and testable with an explicit provider.
Provide uninstrument() — allows cleanup in tests. Libraries that can’t be uninstrumented make unit testing painful.
Import the API only — this file has zero dependency on opentelemetry-sdk. The TracerProvider type hint comes from the API. The actual provider implementation is the application’s concern.
Use a stable scope name — claims_common as the tracer scope name means every span this library creates is identifiable by its source in trace UIs.

Exam callout: The OTCA tests the separation between library instrumentation and application instrumentation. Library instrumentors accept an optional TracerProvider and import only the API. They do not call trace.set_tracer_provider(). The application initializes the provider; the library uses whichever provider the application registered.

Takeaway: For shared internal libraries, write an instrumentation package using this pattern. It centralizes the instrumentation code, keeps the library itself free of OTel dependencies, and follows the same conventions as the official OTel instrumentation packages.

Adding a Counter Metric Manually

Traces tell you what happened in a specific request. Metrics tell you the aggregate behavior across thousands of requests. Both matter, and the manual instrumentation pattern for metrics mirrors traces closely enough that once you understand one, the other is straightforward.

Creating a Counter

from opentelemetry import metrics

meter = metrics.get_meter("insurewatch.quotes", "1.2.0")

quote_counter = meter.create_counter(
    "insurewatch.quotes.total",
    description="Total number of insurance quotes generated",
    unit="1",
)

quote_errors = meter.create_counter(
    "insurewatch.quotes.errors",
    description="Total number of quote calculation errors",
    unit="1",
)

quote_duration = meter.create_histogram(
    "insurewatch.quotes.duration",
    description="Duration of quote calculation in seconds",
    unit="s",
)

A few notes on instrument naming and design:

unit="1" for dimensionless counts — this follows the OTel semantic conventions. Use "s" for seconds, "ms" for milliseconds, "By" for bytes.
Name with dots as namespaces — insurewatch.quotes.total scopes the metric to the service and subsystem. This avoids collisions when multiple services send metrics to the same backend.
Instruments are created once and reused. Don’t create a new counter inside your request handler — create it at module or class initialization time and reference it from your handler.

Using the Counter in the Handler

import time

def calculate_quote(policy_type: str, risk_factors: dict) -> float:
    quote_counter.add(1, {
        "policy.type": policy_type,
        "status": "requested"
    })
    start = time.perf_counter()

    try:
        result = _run_pricing_model(policy_type, risk_factors)
        quote_counter.add(1, {
            "policy.type": policy_type,
            "status": "success"
        })
        quote_duration.record(
            time.perf_counter() - start,
            {"policy.type": policy_type}
        )
        return result
    except PricingModelError as e:
        quote_errors.add(1, {
            "policy.type": policy_type,
            "error.type": type(e).__name__
        })
        raise

The add() call takes a value and a dictionary of attributes. These attributes become the dimensions of the metric in your backend — you can break down quote counts by policy.type, error counts by error.type, and latency distributions by policy.type.

The Attribute Cardinality Warning

Every unique combination of attribute values creates a new time series in your metrics backend. policy.type probably has 5-10 values (home, auto, life, etc.) — safe. If you add applicant.id as a metric attribute, you now have one time series per customer — potentially millions. That’s a cardinality explosion, and it will crash Prometheus.

The rule: metric attributes should have bounded, low-cardinality values. Span attributes can be high-cardinality (trace data is stored differently and isn’t indexed the same way). When in doubt, put high-cardinality values in spans, not metrics.

Exam callout: Instrument creation happens once. The Meter is obtained from the MeterProvider the same way Tracers are obtained from TracerProvider. Metric attributes must be low-cardinality — high-cardinality attributes cause time series explosion in metrics backends. High-cardinality data belongs in trace spans.

Connecting the MeterProvider

The MeterProvider is initialized just like the TracerProvider:

from opentelemetry import metrics
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader
from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import OTLPMetricExporter

metric_reader = PeriodicExportingMetricReader(
    OTLPMetricExporter(endpoint="http://collector:4317"),
    export_interval_millis=30_000,
)

meter_provider = MeterProvider(
    resource=resource,  # same Resource as TracerProvider
    metric_readers=[metric_reader],
)
metrics.set_meter_provider(meter_provider)

Notice resource=resource uses the same Resource object defined for the TracerProvider. This is intentional and important. When your traces show service.name=insurewatch-api and your metrics show the same, backends can correlate them. Use the same Resource instance for all three providers.

Takeaway: Create metric instruments once at module initialization, not inside request handlers. Use the same Resource for all providers. Keep metric attributes low-cardinality — high-cardinality data belongs in trace attributes.

Putting It Together: Full Initialization for InsureWatch’s Python API

Here’s what a complete, production-ready SDK initialization looks like — bringing together everything in this module.

# telemetry.py — initialize once, import everywhere

from opentelemetry import trace, metrics
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader
from opentelemetry.sdk.resources import Resource
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import OTLPMetricExporter
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from opentelemetry.instrumentation.sqlalchemy import SQLAlchemyInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor
import atexit
import os


def init_telemetry(app=None, db_engine=None):
    resource = Resource.create({
        "service.name": os.environ.get("SERVICE_NAME", "insurewatch-api"),
        "service.version": os.environ.get("SERVICE_VERSION", "unknown"),
        "deployment.environment": os.environ.get("ENVIRONMENT", "development"),
    })

    collector_endpoint = os.environ.get(
        "OTEL_EXPORTER_OTLP_ENDPOINT",
        "http://collector:4317"
    )

    # Trace provider
    trace_provider = TracerProvider(resource=resource)
    trace_provider.add_span_processor(
        BatchSpanProcessor(OTLPSpanExporter(endpoint=collector_endpoint))
    )
    trace.set_tracer_provider(trace_provider)

    # Metric provider
    metric_reader = PeriodicExportingMetricReader(
        OTLPMetricExporter(endpoint=collector_endpoint),
        export_interval_millis=30_000,
    )
    meter_provider = MeterProvider(resource=resource, metric_readers=[metric_reader])
    metrics.set_meter_provider(meter_provider)

    # Auto-instrumentation
    if app is not None:
        FlaskInstrumentor().instrument_app(app)
    RequestsInstrumentor().instrument()
    if db_engine is not None:
        SQLAlchemyInstrumentor().instrument(engine=db_engine)

    # Graceful shutdown
    atexit.register(trace_provider.shutdown)
    atexit.register(meter_provider.shutdown)

Called from your application entry point:

# app.py
from flask import Flask
from telemetry import init_telemetry
from database import engine

app = Flask(__name__)
init_telemetry(app=app, db_engine=engine)

This is the complete picture: Resource defined once and shared across both providers, providers registered globally, auto-instrumentation applied to the known libraries, and graceful shutdown registered for process exit. Everything in this module flows into this initialization block.

What’s Next

Lab 2 is the hands-on complement to this module. It walks through InsureWatch’s Python service from scratch: you’ll remove the existing instrumentation, run the service with auto-instrumentation only and observe the coverage gaps, then manually instrument the calculate_quote function, the risk factor pipeline, and add a counter metric for quote throughput by policy type.

By the end of Lab 2, you’ll have spans that cover the full request lifecycle — HTTP layer through business logic through database — and a metric dashboard that shows quote volume, error rate, and latency broken down by policy type. It’s the difference between “I have traces” and “I understand my system.”

Lab 2 is available in the paid tier.