Skip to content
Go back

OpenTelemetry for Busy SREs: The Real-World Advantage

Suggest Changes

You Don’t Have Time for Bad Telemetry

Let’s be honest. You’re an SRE. You’re on-call. You’re juggling incidents, capacity planning, and a backlog of “observability improvements” that never gets prioritized.

You don’t need another “Getting Started with OpenTelemetry” guide. You need to know what actually matters in production.

This post is for you.


The Three Signals: Traces, Metrics, Logs

OpenTelemetry standardizes three core signals:

SignalWhat It Tells YouWhen You Need It
TracesThe journey of a request across servicesDebugging latency, finding bottlenecks
MetricsAggregated measurements over timeAlerting, capacity planning, SLOs
LogsDiscrete events with contextForensic debugging, audit trails

The magic of OTel is that these three signals share context. A trace ID in your logs links directly to a distributed trace, which correlates with the metric spike you’re investigating.

In production, this correlation is everything. Without it, you’re grep-ing through logs at 3 AM hoping to find a needle in a haystack.


Context Propagation: The Invisible Glue

The single most important concept in OTel for production SREs is Context Propagation.

When Service A calls Service B, a traceparent header is passed along:

traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01

This header contains:

Why This Breaks in Production

In a lab, you control every service. In production:

  1. Legacy services don’t propagate headers. One missing hop = broken trace.
  2. Message queues (Kafka, RabbitMQ) require explicit context injection into message headers.
  3. Load balancers and proxies can strip custom headers if not configured properly.

The Guru Lesson: Before you instrument a single line of code, map your request flow and identify every boundary where context could be lost. Fix propagation gaps first. Everything else is noise without connected traces.


Instrumentation Strategy: Where to Start

Don’t instrument everything at once. Start with the critical path:

Priority 1: Entry Points

Priority 2: Database Calls

Priority 3: External Dependencies

Priority 4: Inter-Service Communication

# Example: Instrument a Flask endpoint with OTel
from opentelemetry import trace
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor

# Auto-instrument Flask and outbound HTTP
FlaskInstrumentor().instrument()
RequestsInstrumentor().instrument()

tracer = trace.get_tracer(__name__)

@app.route("/checkout")
def checkout():
    with tracer.start_as_current_span("checkout-flow") as span:
        span.set_attribute("user.id", current_user.id)
        span.set_attribute("cart.items", len(cart))
        # Your business logic here
        result = process_payment()
        return result

The Guru Lesson: Add business context to your spans (user.id, cart.items, order.value). Generic traces are useless for debugging. Rich context turns a trace into a story.


Sampling: The Budget vs. Signal Trade-off

In the lab, you sample 100% of traces. In production, that’s financial suicide.

Head-Based Sampling

Tail-Based Sampling

# Collector config: Tail-based sampling
processors:
  tail_sampling:
    decision_wait: 10s
    policies:
      - name: errors-policy
        type: status_code
        status_code: {status_codes: [ERROR]}
      - name: latency-policy
        type: latency
        latency: {threshold_ms: 2000}
      - name: probabilistic-policy
        type: probabilistic
        probabilistic: {sampling_percentage: 5}

The Guru Lesson: Start with head-based sampling (simple, cheap). Graduate to tail-based when you have the infrastructure to support it. Never sample 100% in production unless you enjoy surprise bills.


The Collector: Your Safety Net

The OpenTelemetry Collector is the most important piece of your production pipeline. It sits between your apps and your backends.

Mandatory Processors

These are not optional in production:

processors:
  # Prevents OOM kills
  memory_limiter:
    check_interval: 1s
    limit_mib: 512
    spike_limit_mib: 128

  # Reduces export overhead
  batch:
    send_batch_size: 8192
    timeout: 200ms

Without memory_limiter, a traffic spike will crash your Collector pod. Without batch, you’ll overwhelm your backend with individual span exports.

The Guru Lesson: The Collector is powerful but dangerous. Treat it like a production service: monitor it, set resource limits, and test it under load before deploying.


Building an SLO from Traces

One of the most powerful production patterns is deriving Service Level Objectives (SLOs) from trace data:

  1. Define the SLI: “Percentage of checkout requests completing in under 500ms.”
  2. Measure with OTel: Use the checkout-flow span duration.
  3. Alert on the SLO: “If the 30-day error budget drops below 10%, page the on-call.”

This closes the loop: your instrumentation directly drives your reliability targets.


Takeaways for the Busy SRE

  1. Fix propagation first. Broken traces are worse than no traces.
  2. Instrument the critical path. Don’t boil the ocean.
  3. Add business context. user.id > span.kind.
  4. Sample intelligently. Head-based to start, tail-based when ready.
  5. Protect the Collector. memory_limiter and batch are mandatory.
  6. Derive SLOs from traces. Close the loop between observability and reliability.

You don’t need perfect telemetry. You need useful telemetry. OpenTelemetry gives you the tools. Now go build something that helps you sleep through the night.


Further Reading


Suggest Changes
Share this post on:

Previous Post
Decoupling Observability: Why OTel Breaks the Vendor Lock-in Cycle
Next Post
Hello World: From Lab to Production