From Nagios to Agentic SRE: 20 Years of Production Visibility

I’ve been running production systems for a long time. I’ve been paged by Nagios at 2am, drowned in Splunk dashboards at 11pm, stared at Datadog service maps trying to find which of 200 microservices decided to become the bottleneck.

What I’ve watched happen over 20 years isn’t just tooling churn. It’s a fundamental shift in how we think about understanding running software — from “is it up?” to “what is it doing and why?”. And right now, we’re in the middle of a third shift that will change SRE as a discipline.

This is the story of that evolution.

Era 1: Monitoring — Is It Up?

The first generation of production visibility was built around a simple question: is the thing running?

Tools like Nagios (1999), Zabbix, and PRTG operated on a polling model. Every N seconds, a check script ran against a host or service. It returned OK, WARNING, or CRITICAL. If CRITICAL, page the on-call engineer.

This worked well for the infrastructure of the time: a manageable number of physical servers, monolithic applications, predictable failure modes. When the web server was down, Nagios told you the web server was down.

The ceiling: You could detect that something was wrong. You couldn’t understand why it was wrong, or what specifically inside the application had failed. You knew the web server returned HTTP 500 — but which function call? Which database query? Which downstream dependency?

The tool also required you to know what to monitor before the incident. You wrote checks for things you’d thought of. Novel failure modes — the ones you hadn’t anticipated — were invisible until a user called.

Era 2: Metrics and Dashboards — How Is It Behaving?

As systems grew more complex and cloud infrastructure emerged, monitoring evolved. The question expanded from “is it up?” to “how is it performing?”

Graphite (2006), StatsD, and later Prometheus (2012) introduced time-series metrics at a granularity that monitoring platforms couldn’t match. You could track not just up/down, but request rate, error rate, latency percentiles, queue depths, cache hit ratios. The RED method (Rate, Errors, Duration) gave teams a structured vocabulary for service health.

graph TB A[Application] -->|StatsD / Prometheus scrape| B[Metrics store] B --> C[Grafana dashboards] B --> D[Alertmanager] D -->|alert rule fired| E[On-call]

This was a genuine leap. Dashboard culture arrived. Golden signals became a shared language. Alerting improved because you could alert on error_rate > 1% rather than “web server is down.”

The ceiling: Metrics are aggregates. They tell you something is wrong — error rate spiked, latency p99 jumped — but they don’t tell you which requests were affected, why those specific requests failed, or which path through your distributed system caused the problem. A metric is a summary; it discards the detail that produced it.

Logs filled some of this gap. But unstructured logs at scale meant text search through terabytes of output, correlating timestamps manually across services, and hoping the relevant error was actually logged.

The deeper problem: as systems moved from monoliths to microservices, a single user request started touching 5, 10, 20 services. A metric spike in service A didn’t tell you whether A caused the problem or was downstream of a problem in service C.

Era 3: Observability — What Is It Doing and Why?

The word observability comes from control theory: a system is observable if you can infer its internal state from its external outputs. Applied to software, it means: can you answer arbitrary questions about your production system using the telemetry it emits — including questions you didn’t think to ask when you wrote the code?

The three signals that make up modern observability aren’t new. What changed was understanding them as a unified model rather than separate tools.

graph TD subgraph "The Three Signals" T[Traces
— what happened, in sequence] M[Metrics
— how often, how fast, how many] L[Logs
— what was happening at this moment] end T <-->|correlated by trace ID| L M <-->|time-correlated| L T <-->|service, operation labels| M

Traces give you the end-to-end story of a single request: which services it touched, in what order, how long each step took, what errors occurred at each hop. A distributed trace is the only way to reconstruct causality across service boundaries.

Metrics give you population-level statistics over time: how many requests per second, what fraction were errors, what the latency distribution looked like. They’re cheap to store and query, and perfect for alerting and trend analysis.

Logs give you arbitrary event records at specific moments: what the system was thinking when something happened, captured at human-readable detail. The key insight is that logs attached to a trace ID become queryable in trace context — you can jump from a slow span to its logs without manual timestamp correlation.

The Distributed Tracing paper from Google (Dapper, 2010) described this model. Zipkin (2012) and Jaeger (2016) brought it to open source. But the missing piece was standardization.

Every tracing vendor invented their own wire format, SDK, and agent. Migrating from Zipkin to Jaeger meant re-instrumenting your application. Instrumenting for both Datadog and your in-house Prometheus setup meant running two SDKs. The instrumentation itself became vendor lock-in.

The Inflection Point: OpenTelemetry

OpenTelemetry (2019) merged the OpenTracing and OpenCensus projects under the CNCF and answered one question: what if there was a single, vendor-neutral way to emit traces, metrics, and logs?

graph LR subgraph "Your Application" SDK[OTel SDK] end SDK -->|OTLP| C[OTel Collector] C --> D[Datadog] C --> E[Grafana / LGTM] C --> F[Jaeger] C --> G[Honeycomb] C --> H[Your own backend]

You instrument once. The SDK produces OTLP — the OpenTelemetry Protocol — a standardized wire format over gRPC or HTTP. The Collector receives it and routes it anywhere. Switching backends becomes a config change, not a re-instrumentation project.

This matters because it decouples two concerns that were previously fused:

How you emit telemetry (the SDK, in your code)
Where telemetry goes (the backend, an operational decision)

For the first time, your application’s instrumentation has nothing to do with your observability vendor. This is the architectural inflection point that makes the next era possible.

The Observability Stack as Layers

One of the most important mental models for understanding modern observability is that telemetry comes from different layers of your system — and each layer has different ownership, different tooling, and answers different questions.

graph TB subgraph "Application Layer" A1[Distributed traces — request context] A2[Custom business metrics — claims processed, orders converted] A3[Structured logs — application events] end subgraph "Platform Layer" P1[Container metrics — CPU, memory, restarts] P2[Service mesh traces — network latency, retries] P3[Kubernetes events — pod scheduling, OOM kills] end subgraph "Infrastructure Layer" I1[Host metrics — disk, network, system calls] I2[Hypervisor metrics — noisy neighbour, CPU steal] I3[Cloud provider metrics — RDS latency, S3 throttling] end A1 & A2 & A3 --> OTel[OTel Collector] P1 & P2 & P3 --> OTel I1 & I2 & I3 --> OTel OTel --> Backend[Observability Backend]

Infrastructure layer — physical or virtual hosts, network fabric, storage. Primarily owned by platform/infra teams. Metrics from node exporters, cloud provider APIs, hypervisors. Tells you: is the hardware/cloud behaving? High CPU steal? Disk saturation? Network packet loss?

Platform layer — Kubernetes, service mesh (Istio, Linkerd), load balancers, managed databases. Owned by platform engineering. Tells you: is the orchestration layer healthy? Are pods being evicted? Is the mesh injecting latency?

Application layer — your code. Owned by development teams. This is where distributed tracing lives, where business-meaningful metrics are emitted, where structured log events capture what the application was doing. Only you know what “a claim submission” means — no infrastructure tool can emit that context for you.

The layered model explains why a full observability solution requires cooperation across teams. An SLO breach might be caused by a slow database query (app layer), a network partition (platform layer), or CPU throttling on the host (infra layer). Correlating signals across layers — the same trace ID in both app logs and service mesh access logs, for instance — is how you identify root cause across ownership boundaries.

Era 4: AIOps — Can the Machine Understand It?

With a decade of structured telemetry at scale, the machine learning question became inevitable: can we use all this signal to automate the work of understanding production?

AIOps (AI for IT Operations) emerged as a category in the mid-2010s. The initial applications were:

Anomaly detection — ML models trained on metric baselines to alert on statistically unusual deviations rather than fixed thresholds
Alert correlation — clustering related alerts from different systems to reduce noise and surface the likely root cause
Predictive failure — forecasting resource exhaustion, degradation trends, or impending incidents from historical patterns

The limitation of first-generation AIOps was that the underlying telemetry was fragmented. If your traces are in Jaeger, your metrics in Prometheus, and your logs in Splunk — all with different schemas, different service naming, no correlation IDs — an AI model has a hard time reasoning across them. Garbage in, garbage out.

OpenTelemetry changes this. With a unified data model, correlated trace IDs across all three signals, and semantic conventions that standardize how things like http.request.method or db.system are named, the machine finally has a coherent picture to reason about.

Era 5: Agentic SRE — Can the Machine Act?

We’re now at the leading edge of the next shift. The question is no longer just “can the machine understand it?” but “can the machine respond to it?”

Agentic SRE frameworks combine large language models with tool use — the ability to call observability APIs, run queries, execute runbooks, make rollback decisions — in an autonomous or semi-autonomous loop.

graph LR A[Alert fires] --> B[SRE Agent] B -->|query Tempo| C[Retrieve affected traces] B -->|query Prometheus| D[Get metric context] B -->|query Loki| E[Get correlated logs] C & D & E --> B B -->|hypothesis| F{Confidence?} F -->|high| G[Execute runbook / rollback] F -->|low| H[Page human with context]

The agent doesn’t replace the on-call engineer. It reduces mean time to diagnosis (MTTD) by doing the first 10 minutes of work automatically: pulling the relevant traces, correlating the error with recent deployments, checking whether similar incidents have occurred before, and presenting a structured hypothesis to the human.

This only works if the telemetry is clean, correlated, and semantically consistent. An agent that has to reason about why http_status_code in one service and http.response.status_code in another might be the same thing will hallucinate before it diagnoses. An agent with OTel-standard telemetry — consistent naming, correlated trace IDs, structured attributes — can navigate the data reliably.

OpenTelemetry is the prerequisite for Agentic SRE. The companies getting value from AI in their ops workflows are the ones who spent the last few years building clean, correlated, standardized telemetry. That work pays compound interest.

Where We Are Now

The evolution looks like this in retrospect:

Era	Question	Tooling	Limitation
Monitoring	Is it up?	Nagios, Zabbix, PRTG	No context, known-unknowns only
Metrics + Dashboards	How is it behaving?	Prometheus, Graphite, Grafana	Aggregates hide per-request causality
Observability	What is it doing and why?	OTel, Jaeger, Tempo, Honeycomb	Requires good instrumentation discipline
AIOps	Can the machine understand it?	Anomaly detection, alert correlation	Fragmented data undermines ML accuracy
Agentic SRE	Can the machine act on it?	LLM + tool use + runbooks	Requires clean, correlated, standard telemetry

Each era didn’t replace the prior one — it added a layer. You still need uptime checks. You still need dashboards. You still need logs. What changed is what you can do with the data once you have it.

The reason to care about OpenTelemetry right now isn’t just cleaner dashboards. It’s that OTel-standard telemetry is the foundation that every layer above it depends on. Anomaly detection is only as good as the consistency of your metric names. An SRE agent is only as useful as the quality of the traces it can query.

Instrument properly now. The compound interest starts immediately.

Want to go from concept to practice? The otel.guru course walks you through OTel signals, context propagation, the Collector, and semantic conventions — hands-on, with a real polyglot lab environment.