The Assumption to Destroy
“Traces, metrics, and logs are three separate things that you correlate after the fact.”
If you’ve been in observability for any amount of time, this is probably how you think about it. You have Prometheus for metrics. You have Loki or Splunk for logs. You have Jaeger or Tempo for traces. Three systems, three query languages, three dashboards. When something breaks, you context-switch between them, manually correlating timestamps and service names until you find the thread that connects them.
It works. Kind of. Until you’re 30 minutes into an incident at 2 AM and you’re copy-pasting trace IDs between browser tabs, hoping the timestamps line up.
Here’s the mental model we’re going to build in this module:
Traces, metrics, and logs are not three separate systems. They are three different views of the same event stream, unified by a shared context schema. OpenTelemetry defines that schema. Whether your pipeline actually preserves it end-to-end — that’s a separate question, and the right question to start asking.
Let’s take these apart, piece by piece.
Traces: Spans as Causally-Linked Units of Work
A trace is a tree of spans. That’s it. A span represents one unit of work — one function call, one HTTP request, one database query. When spans share the same trace_id and link to each other through parent_span_id, they form a trace.
But the exam doesn’t test whether you know that. It tests whether you know what’s inside a span.
The Full Anatomy of a Span
Every field here is fair game on the OTCA. Let’s walk through them.
┌─────────────────────────────────────────────────────────────┐
│ SPAN │
│ │
│ name: "GET /api/policy/quote" │
│ kind: SERVER │
│ trace_id: 7f3b8a2c... (128-bit, shared across trace)│
│ span_id: a1b2c3d4 (64-bit, unique to this span) │
│ parent_span_id: (none) (root span — no parent) │
│ start_time: 2026-03-25T14:30:00.000Z │
│ end_time: 2026-03-25T14:30:00.045Z │
│ status: UNSET │
│ │
│ attributes: │
│ http.method = "GET" │
│ http.route = "/api/policy/quote" │
│ http.status_code = 200 │
│ policy.type = "home" │
│ │
│ events: │
│ [14:30:00.012] "cache miss — fetching from DB" │
│ [14:30:00.038] "quote calculated successfully" │
│ │
│ links: │
│ (none in this example) │
└─────────────────────────────────────────────────────────────┘
Let’s break each field down:
-
name— A human-readable operation name. Keep it low-cardinality:GET /api/policy/quote, notGET /api/policy/quote?id=7382&type=home. The route, not the full URL. -
kind— What role this span plays in the system. Five options:SERVER,CLIENT,PRODUCER,CONSUMER,INTERNAL. We’ll cover these in detail below. -
trace_id— A 128-bit identifier shared by every span in the distributed request. When a request crosses from service A to service B, both services’ spans carry the sametrace_id. This is the glue. -
span_id— A 64-bit identifier unique to this specific span. No two spans share aspan_id. -
parent_span_id— Points to the parent span’sspan_id. This is how the tree structure forms. Root spans have no parent. -
start_time/end_time— Wall-clock timestamps. Duration =end_time - start_time. OTel uses nanosecond precision. -
status— Three possible values:UNSET,OK,ERROR. We’ll talk about why this is an exam trap in a moment. -
attributes— Key/value pairs that annotate the span. Values can be strings, integers, booleans, or arrays of those. This is where you put the context that makes a span useful: HTTP methods, database systems, custom business attributes likepolicy.type. -
events— Timestamped annotations within a span’s lifetime. Think of them as structured log lines attached to the span. Use them for significant moments: “cache miss”, “retry attempt 2”, “circuit breaker opened”. -
links— Connect this span to spans in a different trace. The classic use case: a batch job processes 50 messages from a queue. The batch span links to each of the 50 producer spans. They’re in different traces, but causally related.
A Trace in Practice
Here’s what an actual trace looks like — an insurance quote request hitting our API, running a calculation, then querying the database:
Trace ID: 7f3b8a2c...
[GET /api/policy/quote] <- root span, kind=SERVER
| span_id: a1b2
| start: 0ms end: 45ms
| status: UNSET
| attrs: http.method=GET, http.route=/api/policy/quote
|
|--[quote_engine.calculate] <- child span, kind=INTERNAL
| span_id: c3d4
| parent: a1b2
| start: 5ms end: 30ms
| events: [12ms] "cache miss - fetching from DB"
|
+--[SELECT policies WHERE...] <- child span, kind=CLIENT
span_id: e5f6
parent: a1b2
start: 31ms end: 43ms
attrs: db.system=postgresql, db.statement=SELECT...
Three spans. One trace. The root span is the API handler. It spawned two children: the calculation engine and the database query. The parent_span_id on each child points back to a1b2, forming the tree.
Span Kinds — Why They Matter
The five span kinds aren’t just labels. They tell visualization tools how to connect spans across service boundaries.
SERVER— This span is handling an incoming request. Your API receiving an HTTP call.CLIENT— This span is making an outgoing request. Your API calling a database or another service.INTERNAL— Work happening within a single process. No network boundary crossed.PRODUCER— Sending a message to an async system (Kafka, SQS, RabbitMQ).CONSUMER— Receiving a message from an async system.
Here’s why kind matters: when service A makes a CLIENT call to service B, service B creates a SERVER span. Jaeger, Tempo, and other trace viewers use the kind pairing to draw the connection between the two services. If you set the kinds wrong, the visualization breaks — the tools can’t figure out which spans represent the two sides of the same network call.
Exam callout: The OTCA tests whether you can match span kinds to scenarios. A CLIENT span on service A should have a corresponding SERVER span on service B with the same
trace_id. PRODUCER/CONSUMER pairs work the same way for async messaging.
Span Status — The Exam Trap
This catches people. OTel span status is not the same as HTTP status code.
Span status has three values:
UNSET— The default. Nobody explicitly set it. The span is assumed successful.OK— Explicitly marked as successful. Rarely needed —UNSETalready implies success.ERROR— Something went wrong unexpectedly.
Here’s the trap: a 404 Not Found is not necessarily an ERROR span. If your API is supposed to return 404 when a resource doesn’t exist, that’s a valid, expected response. The span should stay UNSET. You only set ERROR when something broke that shouldn’t have — a database connection timeout, an unhandled exception, a downstream service that’s down.
Exam callout: The OTCA specifically tests this distinction. A 404 response is not inherently an error span. Span status reflects whether the operation itself failed unexpectedly, not whether the HTTP status code is in the 4xx range.
Metrics: Counter, Gauge, Histogram
OTel defines three core metric instrument types. Each has a distinct shape, a distinct use case, and a distinct way of breaking if you use it wrong.
Counter — It Only Goes Up
A Counter is a monotonically increasing value. It never decreases (except on process restart, when it resets to zero).
Counter: quote_requests_total
quote_requests_total{method="GET", route="/api/quote"} 1423
quote_requests_total{method="POST", route="/api/quote"} 87
Use a Counter for: request counts, error counts, bytes sent, items processed — anything that accumulates over time.
The rate of change is usually more interesting than the raw value. In PromQL, rate(quote_requests_total[5m]) gives you requests per second. The raw number 1423 is less useful on its own.
Exam callout: Counters reset to zero on process restart. This is expected and normal. Monitoring systems like Prometheus detect these resets and handle them in rate calculations.
Gauge — A Snapshot in Time
A Gauge is a point-in-time measurement. It can go up or down.
Gauge: memory_used_bytes
memory_used_bytes{service="claims-api"} 536870912
Use a Gauge for: CPU utilization, memory usage, queue depth, active connections, temperature — anything where the current value matters, not the cumulative total.
You can’t meaningfully take the rate() of a Gauge. A Gauge tells you “what is the value right now.” A Counter tells you “how much has accumulated.”
Histogram — The Shape of Your Data
A Histogram records the distribution of values. It’s the most powerful instrument and the most misunderstood.
Histogram: request_duration_seconds
request_duration_seconds_bucket{le="0.05"} 80
request_duration_seconds_bucket{le="0.1"} 95
request_duration_seconds_bucket{le="0.25"} 110
request_duration_seconds_bucket{le="0.5"} 145
request_duration_seconds_bucket{le="1.0"} 150
request_duration_seconds_bucket{le="+Inf"} 150
request_duration_seconds_count 150
request_duration_seconds_sum 48.5
Each bucket says “how many observations were less than or equal to this boundary.” 80 requests completed in under 50ms. 95 completed in under 100ms. All 150 completed in under infinity (obviously).
Here’s why Histograms matter: p50, p95, and p99 latencies come from Histograms. They do not come from Gauges. They do not come from Counters. If your SLO says “99% of requests complete in under 500ms,” you need a Histogram to prove it.
A Histogram isn’t just a fancier Counter. It records the shape of your data. When your p99 latency is 3 seconds but your p50 is 50ms, a Counter tells you nothing useful. A Histogram shows you that 99% of requests are fast and 1% are suffering. That 1% is your incident.
Instrument Type Summary:
Counter -> Only goes up -> request count, errors, bytes
Gauge -> Goes up and down -> CPU, memory, queue depth
Histogram -> Distribution -> latency, payload size, SLOs
The Temporality Trap
This is one of the most common OTCA exam traps, and one of the most common production pitfalls.
Metrics have a temporality — the way they report values over time:
-
Cumulative temporality: The value represents the total since process start. At minute 1, the counter reads 100. At minute 2, it reads 250. That means 250 total requests since the process started. Prometheus works this way.
-
Delta temporality: The value represents only the change since the last report. At minute 1, the counter reads 100 (100 new requests). At minute 2, it reads 150 (150 new requests in that minute). OTLP push-based pipelines can use this.
Why this matters in practice: if you send delta metrics to a backend expecting cumulative, the math breaks catastrophically. The backend sees “150” and thinks that’s the total, not the increment. Counter resets get misreported as negative values. Dashboards show nonsense.
Exam callout: The OTCA exam specifically tests whether you know the difference between cumulative and delta temporality. Prometheus uses cumulative. If you’re exporting delta metrics to a Prometheus-compatible backend, you need an aggregation temporality conversion in the Collector — typically handled by the
cumulativetodeltaordeltatocumulativeprocessor.
Logs: OTel Log Records vs Traditional Logs
Let’s be honest: most logging in production today is still logger.info("Processing request for user %s", user_id). An unstructured string dumped to stdout, maybe picked up by a log shipper, maybe ending up in Splunk or Loki.
OTel doesn’t try to replace that. Instead, it defines a structured Log Record that can carry the same context as traces and metrics.
OTel Log Record Fields
┌──────────────────────────────────────────────────────────┐
│ LOG RECORD │
│ │
│ Timestamp: 2026-03-25T14:30:00.012Z │
│ ObservedTimestamp: 2026-03-25T14:30:00.015Z │
│ SeverityText: "WARN" │
│ SeverityNumber: 13 │
│ Body: "Cache miss for policy_id=7382" │
│ TraceId: 7f3b8a2c... │
│ SpanId: c3d4e5f6 │
│ Attributes: │
│ policy_id = 7382 │
│ cache.hit = false │
│ Resource: │
│ service.name = "claims-api" │
│ service.version = "2.1.0" │
└──────────────────────────────────────────────────────────┘
Key fields to know:
Timestamp— When the event actually happened.ObservedTimestamp— When the SDK or Collector first saw the log. These can differ for async log shipping — the log might have been written to a file and picked up later.TraceId/SpanId— This is the magic. These fields link the log record to a specific span in a specific trace. When your log says “cache miss,” you can click straight through to the trace that was running when it happened.SeverityText/SeverityNumber— “WARN” and 13. OTel defines a severity number scale (1-24) that maps to standard severity names. The text is human-readable; the number is machine-comparable.Body— The actual log message.Attributes— Key/value annotations, just like span attributes.Resource— The process that emitted it. Same Resource definition as traces and metrics.
The Log Bridge — Working With What You Have
Most applications already have a logging framework. Python uses logging. Java uses SLF4J and Logback. Node.js has Winston or Pino. Nobody is going to rewrite their logging code to use OTel directly.
OTel solves this with the log bridge pattern. Instead of replacing your logging framework, OTel provides appenders and handlers that plug into your existing framework. When your code calls logger.warn("Cache miss for policy_id=%s", policy_id), the OTel log bridge:
- Intercepts the log output from your existing framework
- Reads the active span context (the
TraceIdandSpanIdof whatever span is currently running) - Wraps it into a structured OTel Log Record with all the fields above
- Sends it through the OTel pipeline
Your code doesn’t change. Your logging calls stay the same. But now every log line carries the trace context, which means your logs are automatically correlated with your traces.
Exam callout: The log bridge pattern is frequently tested on the OTCA. Know that OTel does NOT require you to replace your existing log framework. It bridges from existing frameworks (Python
logging, Java SLF4J/Logback, etc.) and injects trace context into the output.
The Context Layer: How Everything Connects
This is the conceptual core of the entire module. If you understand this section, the rest of the OTel specification starts making sense.
The three signals — traces, metrics, and logs — share a common context schema. That schema has three layers:
┌──────────────────────────────────────────────────────────────┐
│ RESOURCE (attached to every signal from this process) │
│ │
│ service.name = "claims-api" │
│ service.version = "2.1.0" │
│ deployment.environment = "production" │
│ host.name = "claims-api-7b9d4f-xk2p1" │
│ │
│ Every trace span, every metric data point, every log record │
│ emitted by this process carries this Resource. │
└──────────────────────────────────────────────────────────────┘
| | |
v v v
+--------------+ +--------------+ +--------------+
| TRACE | | METRIC | | LOG |
| | | | | |
| trace_id: | | (no trace_id | | trace_id: |
| 7f3b... | | -- metrics | | 7f3b... |
| span_id: | | aggregate | | span_id: |
| a1b2... | | across many | | c3d4... |
| attributes: | | requests) | | attributes: |
| http.method | | | | policy_id |
| http.route | | attributes: | | |
+--------------+ | method | +--------------+
| route |
+--------------+
Same Resource = same process
Same TraceID = same request (traces + logs only)
Here’s the key insight that trips people up: a metric data point does not carry a TraceID. Metrics are aggregations — a counter that says “1,423 requests” doesn’t belong to a single request. It belongs to a process. So metrics connect to the process via Resource, not to a specific request via TraceID.
The correlation chain works like this:
- Log to Trace: Direct. The log record carries
TraceIdandSpanId. Click through from any log line to the exact span that was executing when that log was emitted. - Metric to Process: Via Resource. The metric carries
service.name=claims-api. You know which process emitted it. - Metric to Logs: Via shared Resource and time. When a metric spikes, query logs from the same
service.nameat that timestamp. - Logs to Metric context: Via shared Resource. If you see error logs from
claims-api, you can check if the error rate metric from the same service is elevated.
This is why, in Grafana, you can click from a metric spike to the logs from that service at that time, and from those logs to the specific trace. Three different data sources, three different queries, but the same Resource is the bridge that makes it work.
Exam callout: Know the correlation chain. Traces and logs share TraceID for direct correlation. Metrics correlate to traces/logs through shared Resource attributes, not through TraceID. The OTCA tests this distinction.
Worked Example: One Request, All Three Signals
Let’s tie it all together. A GET /api/quote?policy_type=home request arrives at our insurance quoting API. Here’s what OTel produces — all three signals, from one request.
Step 1: The Trace
The OTel SDK creates a root span when the request arrives:
# Trace Span (root)
trace_id: 7f3b8a2c
span_id: a1b2c3d4
parent_span: (none)
name: "GET /api/quote"
kind: SERVER
start_time: 14:30:00.000
end_time: 14:30:00.045
status: UNSET
resource:
service.name: "claims-api"
service.version: "2.1.0"
attributes:
http.method: "GET"
http.route: "/api/quote"
http.status_code: 200
policy.type: "home"
events:
- [14:30:00.012] "cache miss — fetching from database"
- [14:30:00.038] "quote calculated: $1,240/year"
Step 2: The Metric
During request processing, the Counter instrument increments:
# Metric Data Point
instrument: Counter
name: "quote_requests_total"
value: 1 (delta) or 1424 (cumulative)
resource:
service.name: "claims-api"
service.version: "2.1.0"
attributes:
policy_type: "home"
method: "GET"
And the Histogram records the latency:
# Metric Data Point
instrument: Histogram
name: "request_duration_seconds"
value: 0.045 (this observation)
resource:
service.name: "claims-api"
service.version: "2.1.0"
attributes:
http.route: "/api/quote"
Notice: no trace_id on the metrics. They don’t belong to a single request. They aggregate across many requests. But they share the same resource.
Step 3: The Log
The application’s existing Python logging call:
logger.warning("Cache miss for policy_id=%s", policy_id)
Gets intercepted by the OTel log bridge and becomes:
# Log Record
timestamp: 14:30:00.012
severity_text: "WARN"
severity_num: 13
body: "Cache miss for policy_id=7382"
trace_id: 7f3b8a2c # <-- injected by log bridge
span_id: a1b2c3d4 # <-- injected by log bridge
resource:
service.name: "claims-api"
service.version: "2.1.0"
attributes:
policy_id: 7382
cache.hit: false
The Result: Three Views, One Story
What you see in Grafana at 2 AM when the pager fires:
1. Dashboard: quote_requests_total rate spiked 3x
|
| "Show me logs from claims-api at 14:30"
v
2. Logs: 47 WARN entries: "Cache miss for policy_id=..."
|
| "Show me trace 7f3b8a2c"
v
3. Trace: GET /api/quote took 450ms (p99 normally 45ms)
-> quote_engine.calculate: 380ms (cache miss)
-> SELECT policies: 60ms (normal)
Root cause: cache invalidation event at 14:28 caused
a thundering herd of cache misses.
One request. Three signals. Same trace_id on the span and the log. Same service.name on all three. The shared context schema is what lets you navigate between them without copy-pasting IDs between browser tabs.
That’s not three separate observability systems. That’s one data model, three views.
What’s Next
We’ve covered the shape of each signal — what fields they carry, how they relate to each other, and why the shared context schema matters. But we haven’t answered a critical question: how does the trace_id actually get from service A to service B?
When service A makes an HTTP call to service B, something has to carry the trace_id across that network boundary. That “something” is context propagation — the W3C Trace Context headers, the baggage spec, and the Propagator API that makes it all work.
That’s Module 2: Context Propagation — how signals cross process boundaries without losing their identity.