You Don’t Have Time for Bad Telemetry
Let’s be honest. You’re an SRE. You’re on-call. You’re juggling incidents, capacity planning, and a backlog of “observability improvements” that never gets prioritized.
You don’t need another “Getting Started with OpenTelemetry” guide. You need to know what actually matters in production.
This post is for you.
The Three Signals: Traces, Metrics, Logs
OpenTelemetry standardizes three core signals:
| Signal | What It Tells You | When You Need It |
|---|---|---|
| Traces | The journey of a request across services | Debugging latency, finding bottlenecks |
| Metrics | Aggregated measurements over time | Alerting, capacity planning, SLOs |
| Logs | Discrete events with context | Forensic debugging, audit trails |
The magic of OTel is that these three signals share context. A trace ID in your logs links directly to a distributed trace, which correlates with the metric spike you’re investigating.
In production, this correlation is everything. Without it, you’re grep-ing through logs at 3 AM hoping to find a needle in a haystack.
Context Propagation: The Invisible Glue
The single most important concept in OTel for production SREs is Context Propagation.
When Service A calls Service B, a traceparent header is passed along:
traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
This header contains:
- Trace ID (
4bf92f...): Unique identifier for the entire request journey. - Span ID (
00f067...): Identifier for this specific hop. - Trace Flags (
01): Whether this trace is sampled.
Why This Breaks in Production
In a lab, you control every service. In production:
- Legacy services don’t propagate headers. One missing hop = broken trace.
- Message queues (Kafka, RabbitMQ) require explicit context injection into message headers.
- Load balancers and proxies can strip custom headers if not configured properly.
The Guru Lesson: Before you instrument a single line of code, map your request flow and identify every boundary where context could be lost. Fix propagation gaps first. Everything else is noise without connected traces.
Instrumentation Strategy: Where to Start
Don’t instrument everything at once. Start with the critical path:
Priority 1: Entry Points
- API gateways, load balancers, ingress controllers.
- These are your “root spans.” Every trace starts here.
Priority 2: Database Calls
- SQL queries, Redis lookups, Elasticsearch requests.
- These are almost always the source of latency issues.
Priority 3: External Dependencies
- Third-party APIs, payment gateways, email services.
- You can’t control their performance, but you can measure it.
Priority 4: Inter-Service Communication
- gRPC calls, HTTP requests between microservices.
- This is where you find cascading failures and retry storms.
# Example: Instrument a Flask endpoint with OTel
from opentelemetry import trace
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor
# Auto-instrument Flask and outbound HTTP
FlaskInstrumentor().instrument()
RequestsInstrumentor().instrument()
tracer = trace.get_tracer(__name__)
@app.route("/checkout")
def checkout():
with tracer.start_as_current_span("checkout-flow") as span:
span.set_attribute("user.id", current_user.id)
span.set_attribute("cart.items", len(cart))
# Your business logic here
result = process_payment()
return result
The Guru Lesson: Add business context to your spans (user.id, cart.items, order.value). Generic traces are useless for debugging. Rich context turns a trace into a story.
Sampling: The Budget vs. Signal Trade-off
In the lab, you sample 100% of traces. In production, that’s financial suicide.
Head-Based Sampling
- Decision made at the start of a trace.
- Simple: “Sample 10% of all requests.”
- Problem: You might miss the one request that caused the outage.
Tail-Based Sampling
- Decision made after the trace completes.
- Smart: “Keep all traces with errors or latency > 2s. Sample 5% of the rest.”
- Problem: Requires buffering complete traces before deciding. More memory, more complexity.
# Collector config: Tail-based sampling
processors:
tail_sampling:
decision_wait: 10s
policies:
- name: errors-policy
type: status_code
status_code: {status_codes: [ERROR]}
- name: latency-policy
type: latency
latency: {threshold_ms: 2000}
- name: probabilistic-policy
type: probabilistic
probabilistic: {sampling_percentage: 5}
The Guru Lesson: Start with head-based sampling (simple, cheap). Graduate to tail-based when you have the infrastructure to support it. Never sample 100% in production unless you enjoy surprise bills.
The Collector: Your Safety Net
The OpenTelemetry Collector is the most important piece of your production pipeline. It sits between your apps and your backends.
Mandatory Processors
These are not optional in production:
processors:
# Prevents OOM kills
memory_limiter:
check_interval: 1s
limit_mib: 512
spike_limit_mib: 128
# Reduces export overhead
batch:
send_batch_size: 8192
timeout: 200ms
Without memory_limiter, a traffic spike will crash your Collector pod. Without batch, you’ll overwhelm your backend with individual span exports.
The Guru Lesson: The Collector is powerful but dangerous. Treat it like a production service: monitor it, set resource limits, and test it under load before deploying.
Building an SLO from Traces
One of the most powerful production patterns is deriving Service Level Objectives (SLOs) from trace data:
- Define the SLI: “Percentage of checkout requests completing in under 500ms.”
- Measure with OTel: Use the
checkout-flowspan duration. - Alert on the SLO: “If the 30-day error budget drops below 10%, page the on-call.”
This closes the loop: your instrumentation directly drives your reliability targets.
Takeaways for the Busy SRE
- Fix propagation first. Broken traces are worse than no traces.
- Instrument the critical path. Don’t boil the ocean.
- Add business context.
user.id>span.kind. - Sample intelligently. Head-based to start, tail-based when ready.
- Protect the Collector.
memory_limiterandbatchare mandatory. - Derive SLOs from traces. Close the loop between observability and reliability.
You don’t need perfect telemetry. You need useful telemetry. OpenTelemetry gives you the tools. Now go build something that helps you sleep through the night.