Observability 101: Signals, Layers, and Why Context Wins

If you’re new to observability, every tool, blog post, and conference talk seems to assume you already know what a trace is, why you’d want one, and how it differs from a log. This module fills that gap.

By the end you’ll understand the three signals, what layer of your system each one comes from, how data gets from your running application to a dashboard, and why correlation across signals is where the real value lives.

The Core Problem

Running software fails in ways you didn’t predict. When it does, you need to answer questions like:

Which service caused this latency spike?
Which users were affected by that error?
Was the problem in my code, the database, or the network?
Did this work yesterday? What changed?

The ability to answer arbitrary questions about your production system — including questions you didn’t think to ask when you wrote the code — is what observability means.

Traditional monitoring answered a narrower question: is it up? You defined checks in advance, and the system told you when something you’d anticipated went wrong. Observability is about unknown unknowns — the failure modes you haven’t seen yet, the performance characteristics that only emerge at scale, the interactions between services you didn’t model.

The raw material of observability is telemetry: data that a running system emits about itself.

The Three Signals

All observability telemetry falls into three categories. They’re called signals, not because of any technical reason, but because each one signals something different about your system.

graph LR subgraph "What happened — end-to-end" T["🔍 Traces"] end subgraph "How many / how fast / how often" M["📊 Metrics"] end subgraph "What was happening at this moment" L["📋 Logs"] end

Traces — What Happened

A trace is the record of a single operation from start to finish, across every service it touched.

When a user submits an insurance claim in InsureWatch, the request flows through the API gateway, into the claims service, out to the policy service, into MongoDB, and finally to the notification service. A trace captures all of that as a tree of spans — each span representing one unit of work, with a start time, end time, and metadata.

gantt title A claim submission trace (simplified) dateFormat x axisFormat %Lms section api-gateway POST /api/claims : 0, 280 section claims-service submit_claim : 10, 260 GET /policy/coverage : 20, 80 pymongo.insert_one : 150, 190 POST /notify : 200, 240

Every span in a trace shares the same trace ID — a 128-bit random identifier generated at the start of the root span. When the API gateway calls the claims service, it passes this ID in an HTTP header (traceparent). The claims service reads it and uses it as the parent for all spans it creates. This is context propagation — the mechanism that connects spans from different processes into a single coherent trace.

Traces answer: What exactly happened during this request? Which service was slow? Where did this error occur? What was the sequence of calls?

When to reach for traces: Latency investigations, error diagnosis, understanding the full lifecycle of a user action, service dependency mapping.

Metrics — How Many, How Fast, How Often

A metric is a numerical measurement over time. Unlike a trace (which is tied to a specific request), a metric is an aggregate — it summarises many requests into a single number.

The primary metric types in OpenTelemetry:

Type	What it measures	Example
Counter	A value that only goes up	`claims.submitted.total`
Gauge	A value that can go up or down	`claims.active` (currently processing)
Histogram	Distribution of values (with percentiles)	`claims.processing.duration`

Metrics are cheap. You can keep years of metric data at 15-second resolution. They’re the right signal for alerting, dashboards, and SLO tracking — not because they’re more powerful than traces, but because they’re fast and cheap to query at scale.

Metrics answer: Is the error rate elevated? Is latency trending up over the last hour? How many claims were processed today? Did throughput drop after the 3pm deployment?

When to reach for metrics: Alerting, dashboards, SLO measurement, capacity planning, long-term trends.

Logs — What Was Happening at This Moment

A log is a timestamped record of something that happened — an event, an error, a state change — with arbitrary detail attached. Unlike traces (which model causality) or metrics (which aggregate numbers), logs are the place where you record context: what values were in scope, what decision was made, what the system was thinking.

2026-03-26 14:23:11 INFO [claims-service] [traceId=abc123 spanId=def456]
  Claim CLAIM-789 created with status auto_approved for customer CUST001
  policy_number=POL-001 amount=500.0 processing_time_ms=47

The critical evolution in modern logging is structured logs: instead of a free-text message, log entries are JSON (or another structured format) with typed fields. This makes them queryable — you can filter Loki for claim.status = "rejected" instead of grepping for strings.

The even more critical evolution is trace-correlated logs: when your logging library injects the current traceId and spanId into every log line, you can jump from a specific span in Tempo directly to the logs generated during that span — in one click, without timestamp arithmetic.

Logs answer: What was the application doing at this exact moment? What were the values of these variables? What did this specific user request look like in detail?

When to reach for logs: Debugging specific incidents, understanding error details, audit trails, forensic analysis of individual events.

How Telemetry Gets From Your Code to a Dashboard

The path from “code running in production” to “span visible in Grafana” involves three components: the SDK, the Collector, and the backend.

The SDK runs inside your application process. It creates spans, records metrics, and captures log events. In OpenTelemetry, the SDK is language-specific — there’s a Python SDK, a Node.js SDK, a Java SDK — but they all produce the same wire format: OTLP (OpenTelemetry Protocol).

Auto-instrumentation patches popular libraries automatically. If you install opentelemetry-instrumentation-fastapi, the SDK wraps every FastAPI request handler with a span — no code changes needed. Manual instrumentation is what you add on top: the business-context spans that capture information the framework can’t know.

The Collector is an optional but important middleware. It receives OTLP from your services and routes it to your backends. The Collector lets you do things like: sample high-volume traces, add resource attributes to every span, route traces to Datadog and metrics to Prometheus, all without touching service code. It’s the right place for cross-cutting telemetry policy.

The backend stores and queries the data. Grafana Tempo stores traces. Prometheus stores metrics. Loki stores logs. Grafana provides the UI that queries all three.

The Observability Stack Has Layers

Not all telemetry comes from your application code. A full observability picture comes from three layers of your system, each with different ownership:

graph TB subgraph AppLayer ["Application Layer — your code"] A1[Distributed traces] A2[Business metrics] A3[Application logs] end subgraph PlatLayer ["Platform Layer — Kubernetes, service mesh"] P1[Container CPU / memory / restarts] P2[Pod scheduling events] P3[Network latency between services] end subgraph InfraLayer ["Infrastructure Layer — hosts, cloud"] I1[Host CPU, disk, network] I2[Cloud provider metrics] I3[Database performance] end AppLayer -->|OTel SDK| Collector[OTel Collector] PlatLayer -->|node-exporter / kube-state-metrics| Collector InfraLayer -->|cloud APIs / node-exporter| Collector

Infrastructure layer — physical hosts, VMs, hypervisors, cloud services. Tells you whether the hardware/cloud is healthy. High CPU steal? Disk saturation? RDS read latency elevated? This data comes from infrastructure agents (Prometheus node exporter, cloud watch) not your application.

Platform layer — Kubernetes, service meshes, load balancers. Tells you whether the orchestration layer is healthy. Are pods being OOM-killed? Is Istio injecting retries that are masking errors? This comes from Kubernetes metrics, mesh telemetry, ingress logs.

Application layer — your code. Only you know what a “claim submission” means. Only you know that auto_approved means the amount was under $1,000. This context can only come from instrumentation you write.

A latency spike might be caused by:

Slow MongoDB query (application layer — you see it in traces)
Pod being throttled (platform layer — you see it in container metrics)
Network congestion in the datacenter (infra layer — you see it in host metrics)

Connecting signals across layers — the same trace ID appearing in both your application log and the service mesh access log, for instance — is how you identify which layer caused the problem.

Why the Signals Are Worth More Together

Each signal answers a different question. The real power comes when you correlate them.

graph TD Alert["📟 Alert: p99 latency > 2s for claims-service"] --> M M["📊 Metrics: latency elevated, error rate stable
Duration: last 35 minutes"] --> T T["🔍 Traces: find the slow traces
Bottleneck: pymongo.insert_one taking 1.8s"] --> L L["📋 Logs: in that trace context
MongoDB connection pool exhausted, 12 retries"] L --> RCA["Root cause: MongoDB connection pool too small
Fix: increase pool size from 10 → 50"]

The metric tells you something is wrong. The trace tells you where it’s slow. The log (correlated via trace ID) tells you why. All three are necessary. Any one alone would leave you guessing.

This is the observability loop — alert on metrics, drill into traces, confirm with logs. It’s the workflow that the rest of this course is built around.

What OpenTelemetry Adds

You could build the above with any combination of tools. What OTel standardizes:

A single SDK per language — instrument once, export anywhere
OTLP — a standard wire protocol so any SDK works with any backend
Semantic conventions — agreed names for common attributes (http.request.method, db.system, service.name) so your traces from Python and Java use the same attribute names
Context propagation — a standard mechanism for passing trace context across service boundaries via HTTP headers (traceparent)
The Collector — a standard pipeline component between your services and your backends

The result: your instrumentation doesn’t depend on your observability vendor. You can switch backends, add a second backend, or change your Collector pipeline without modifying a line of application code.

What’s Next

This module gave you the conceptual foundation. The rest of the course builds on it:

Module 1: Signals — the data model in detail: spans, attributes, events, metrics instruments, log records
Module 2: Context Propagation — how trace context crosses service boundaries via headers, and what happens when it doesn’t
Module 3: Instrumentation — SDK initialization, auto-instrumentation vs manual spans, the API vs SDK boundary
Module 4: Semantic Conventions — the standard attribute names and why they matter
Module 5: The Collector — receivers, processors, exporters, and pipeline design

The labs apply all of this to InsureWatch — a real polyglot microservices application running locally in Docker. By Lab 4, you’ll be diagnosing broken traces, restoring instrumentation, and building Collector pipelines from scratch.

Start with the InsureWatch application guide to understand the lab environment, then continue to Module 1.