Foundation: Why Telemetry Fails Without Standards | otel.guru

The Assumption to Destroy

“OpenTelemetry is just another observability library.”

If that’s your mental model, everything downstream is wrong — your architecture decisions, your instrumentation strategy, your vendor negotiations. OTel is not a library you pip install and forget about. It’s a standard. And the difference matters more than you think.

Let’s destroy this assumption from first principles.

What Is Provably True

OpenTelemetry is a specification first. Not a library, not an SDK, not a collector binary. A specification — published, versioned, language-agnostic — that defines:

An API — a stable contract for producing telemetry. No-op by default.
An SDK — a swappable implementation of that API. Configurable per-process.
A Collector — an infrastructure component that receives, processes, and exports telemetry.
OTLP — a wire protocol (gRPC + HTTP/Protobuf) that carries telemetry between components.

These are four different things. They version independently. They deploy independently. You can replace any one of them without touching the others. That’s not how libraries work — that’s how standards work.

Key insight: OTel is to observability what HTTP is to the web. You don’t think of HTTP as a “library.” You think of it as the protocol everything speaks. OTel is building the same thing for telemetry.

The Pre-OTel World: Instrumentation as a Tax

Before we appreciate what OTel gives us, we need to feel the pain of what came before.

Every vendor wanted a piece of your app

Before 2016, if you used Datadog, you instrumented with the Datadog SDK. If you used New Relic, you instrumented with the New Relic SDK. Dynatrace had its own agent. Splunk had its own forwarder. Every vendor had its own wire format, its own data model, its own set of opinions about what a “span” or a “metric” should look like.

Switching vendors didn’t mean changing a configuration file. It meant a code change across every service. Re-instrument. Re-test. Re-deploy. For a large organization with hundreds of services, that’s not a migration — it’s a multi-quarter engineering project.

The hidden cost: instrumentation churn

It got worse at the library level. Say you maintained an internal HTTP framework used by 40 services. You wanted traces from that framework. So you wrote a Datadog integration. Then the platform team adopted Jaeger for some services. Now you needed a Jaeger integration too. Then the data team wanted Prometheus metrics. Another integration.

Every library you wanted to instrument required a different vendor plugin. Every vendor plugin had different APIs, different conventions, different bugs. The cost wasn’t just “write the integration once” — it was maintain N integrations forever.

The N × M problem

Zoom out to the ecosystem level and the math is brutal:

N vendors  ×  M frameworks  =  N × M integrations to maintain

Vendors:    Datadog, New Relic, Dynatrace, Jaeger, Zipkin, Prometheus, Lightstep, ...
Frameworks: Express, Flask, Spring Boot, gRPC, Django, Gin, net/http, ...

Even with just 7 vendors and 7 frameworks: 49 integrations.
Each one maintained by someone. Each one slightly different.

This is what the pre-OTel architecture looked like:

┌─────────────┐
│  Your App   │
├─────────────┤
│ Datadog SDK ├──────────► Datadog
│ NR Agent    ├──────────► New Relic
│ Jaeger SDK  ├──────────► Jaeger
│ Prom Client ├──────────► Prometheus
│ Zipkin SDK  ├──────────► Zipkin
└─────────────┘

Every vendor SDK lives inside your process.
Switching vendors = rewriting instrumentation code.
Running multiple vendors = multiple SDKs, multiple overhead costs.

It was a mess. And the industry knew it.

The First Attempts: OpenTracing and OpenCensus

Two projects tried to fix this. Both got part of the answer right. Neither got enough.

OpenTracing (2016)

OpenTracing, a CNCF project, had the right core idea: define a vendor-neutral tracing API. Instrument your code against the API. Let the vendor provide the implementation.

What it got right:

The abstraction was clean. Library authors could instrument against OpenTracing without depending on any specific vendor.
It proved the concept: vendor-neutral instrumentation was possible and desirable.

What it got wrong:

Tracing only. No metrics. No logs. You still needed Prometheus for metrics and something else for logs.
No data model. OpenTracing defined an API but not what the data looked like on the wire. Each backend still defined its own format. This meant interoperability was limited to the API surface — not the actual data.

OpenCensus (2017)

OpenCensus, driven by Google and Microsoft, expanded the scope: tracing and metrics in a single project. It also shipped its own exporters, so you could send data to multiple backends.

What it got right:

The scope was correct. Telemetry is traces + metrics + logs, not just one signal.
It included a data model, not just an API. Data could be exported in a consistent format.

What it got wrong:

Opinionated data model. OpenCensus had its own view of what metrics and traces should look like. Not every vendor agreed. Not every existing system could map cleanly to it.
Competing with OpenTracing. Now the ecosystem had two “vendor-neutral” standards. Libraries had to pick one. Some picked OpenTracing. Some picked OpenCensus. Some threw up their hands and picked neither.

Why neither won

Two standards competing to be the standard is worse than having no standard at all. The ecosystem fragmented. Library authors didn’t know which one to back. Vendors had to support both. The problem OpenTracing and OpenCensus set out to solve — “too many ways to instrument” — they accidentally made worse.

The lesson: A standard only works if the ecosystem converges on it. Two competing standards produce three ecosystems: standard A, standard B, and “neither.”

The Merger: OpenTelemetry (2019)

In May 2019, the OpenTracing and OpenCensus teams made a rare and ego-less move. They announced a merger under the CNCF: OpenTelemetry. Both teams stepped back from their own projects to build one unified standard.

This wasn’t just a rebrand. The OTel team took the lessons from both predecessors and made a specific architectural decision that neither project had fully committed to.

The insight: separate the API from the implementation

This is the single most important design decision in OpenTelemetry, and it’s deceptively simple:

The API is a contract. The SDK is an implementation. They are separate packages.

Here’s what that means in practice:

# ---- What a LIBRARY author does ---- #
# Install: pip install opentelemetry-api  (tiny, no dependencies)

from opentelemetry import trace

tracer = trace.get_tracer("my-library", "1.0.0")

def do_work():
    with tracer.start_as_current_span("do_work") as span:
        span.set_attribute("work.type", "important")
        # ... actual work ...

# ---- What an APPLICATION owner does ---- #
# Install: pip install opentelemetry-sdk opentelemetry-exporter-otlp

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter

# Configure the SDK — this is where you decide WHERE data goes
provider = TracerProvider()
provider.add_span_processor(
    BatchSpanProcessor(OTLPSpanExporter(endpoint="http://collector:4317"))
)
trace.set_tracer_provider(provider)

# Now every library that uses opentelemetry-api will produce real spans

Notice the separation:

The library author imports opentelemetry-api only. This package is tiny, has no dependencies, and is no-op by default. If no SDK is configured, the tracer does nothing. Zero overhead. Zero side effects. This is why library authors can safely depend on it.
The application owner imports opentelemetry-sdk and configures it once, at process startup. The SDK implements the API, hooks into the tracers that libraries already created, and ships the data wherever you configure it to go.

This separation is what makes OTel work at ecosystem scale. Library maintainers don’t need to know or care which backend you use. They instrument against the API. You, the application operator, wire everything together in one place.

Why this matters for the exam: The API/SDK separation is fundamental to OTel’s architecture. Expect questions about what the API provides (no-op behavior, tracer/meter/logger providers) vs. what the SDK provides (span processors, exporters, samplers, resource detection).

OTLP: Vendor-Neutral at the Wire Level

OpenTracing defined a vendor-neutral API but not a wire format. OpenCensus defined a data model but one that didn’t map to every backend. OTel needed a protocol that could carry all three signals — traces, metrics, and logs — in a single, well-defined format.

That protocol is OTLP — the OpenTelemetry Protocol.

What OTLP actually is

OTLP is a gRPC and HTTP protocol with a Protobuf schema. It defines exactly how telemetry data is serialized on the wire: what a span looks like, what a metric data point looks like, what a log record looks like. The schema is versioned and backward-compatible.

The practical implication: any OTLP-capable backend can receive data from any OTel SDK, without modification. You don’t need a Datadog exporter, a Jaeger exporter, and a Prometheus exporter in your application. You export OTLP. The backend (or the Collector) handles the rest.

What “vendor-neutral at the wire level” means

Before OTLP, switching from Jaeger to Grafana Tempo meant updating every service’s exporter configuration. Every service needed to know about the new backend’s wire format.

With OTLP and the Collector, it’s one line in one YAML file:

# Collector config: switch from Jaeger to Tempo
exporters:
  # Before:
  # jaeger:
  #   endpoint: "jaeger-collector:14250"

  # After:
  otlp:
    endpoint: "tempo:4317"
    tls:
      insecure: true

Your services? Unchanged. They still export OTLP to the Collector. The Collector handles the routing. This is what “vendor-neutral at the wire level” actually means in practice — your instrumentation is decoupled from your backend choice at every layer.

The Four Components: How They Fit Together

Now we can see the full picture. OTel is four components, each solving a different problem:

┌──────────────────────────────────────────────────────┐
│                    YOUR CODE                         │
│  (services, libraries, frameworks)                   │
└──────────────┬───────────────────────────────────────┘
               │ imports
               ▼
┌──────────────────────────────────────────────────────┐
│                   OTel API                           │
│  Stable contract. No-op by default.                  │
│  Import this in libraries. Tiny footprint.           │
└──────────────┬───────────────────────────────────────┘
               │ implements
               ▼
┌──────────────────────────────────────────────────────┐
│                   OTel SDK                           │
│  Implementation. Configure once per process.         │
│  Samplers, processors, exporters.                    │
└──────────────┬───────────────────────────────────────┘
               │ exports via OTLP
               ▼
┌──────────────────────────────────────────────────────┐
│                OTel Collector                         │
│  Infrastructure component. Receives, transforms,     │
│  samples, routes. Runs as agent or gateway.          │
└──────────────┬───────────────────────────────────────┘
               │ exports to
               ▼
┌──────────────────────────────────────────────────────┐
│                 ANY BACKEND                           │
│  Grafana / Datadog / Jaeger / Tempo / Dynatrace /    │
│  New Relic / Splunk / your own storage               │
└──────────────────────────────────────────────────────┘

Let’s be precise about what each layer does:

OTel API

What it is: Interfaces and no-op implementations for creating traces, metrics, and logs.
Who uses it: Library authors, framework maintainers, anyone writing shared code.
Key property: Adding opentelemetry-api to a library has zero runtime cost if no SDK is configured. The API calls resolve to no-ops.
Stability: The API is the most stable part of OTel. Once it hits 1.0 for a signal, it doesn’t break.

OTel SDK

What it is: The concrete implementation of the API. This is where the real work happens — creating spans, aggregating metrics, batching exports.
Who uses it: Application owners. You configure the SDK at the entrypoint of your service.
Key property: The SDK is where all the knobs are — sampling rates, export intervals, resource attributes, span processors. You own this configuration.
Swappable: In theory, you could replace the official SDK with a vendor’s SDK that implements the same API. Some vendors do exactly this.

OTel Collector

What it is: A standalone binary that receives telemetry (via OTLP or other protocols), processes it (filter, transform, sample, enrich), and exports it to one or more backends.
Who uses it: Platform/infra teams. It runs as a sidecar, a DaemonSet agent, or a standalone gateway.
Key property: The Collector decouples your services from your backends. Services export to the Collector. The Collector exports to backends. Change backends without touching services.
Optional but recommended: You can export directly from the SDK to a backend. But the Collector gives you a central point for retries, batching, sampling, and routing.

OTLP

What it is: The wire protocol. gRPC (port 4317) and HTTP/Protobuf (port 4318). Covers all three signals.
Who uses it: Everything. SDKs export OTLP. The Collector receives and exports OTLP. Backends ingest OTLP.
Key property: OTLP is the lingua franca. As long as both sides speak OTLP, they can exchange telemetry without knowing anything about each other’s internals.

The Rebuilt Mental Model

Let’s bring it all together. Replace the old assumption with the correct model:

OTel is a standard — like HTTP — not a library.

You instrument against the API (stable, safe to depend on).
The SDK ships the data (configurable, swappable).
The Collector routes it (infrastructure concern, not application concern).
OTLP carries it (vendor-neutral wire format).

Each layer is replaceable independently. That’s the whole point.

When someone says “we use OpenTelemetry,” they might mean any combination of these components. A library author using the API is “using OTel.” A platform team running the Collector is “using OTel.” A vendor accepting OTLP is “supporting OTel.” The term covers the entire standard, not a single artifact.

For the exam: When a question says “OpenTelemetry,” pay attention to which component it’s actually asking about. The API, SDK, Collector, and OTLP have different responsibilities, different stability guarantees, and different deployment models. Don’t conflate them.

What’s Next

In Module 1: Signals — The Unified Data Model, we’ll dig into the three telemetry signals — traces, metrics, and logs — and the context model that ties them together. You’ll see why OTel treats them as one correlated stream, not three separate pipelines, and what that means for how you design your instrumentation.