The Observability of Observability – O’Reilly

Despite the promise of AIOps, the dream of fully automated, self-healing IT environments remains elusive. Generative AI tools may be the solution that finally abstracts away enough of the workload to get there. However, today’s reality is far more complex. Internet performance monitoring firm Catchpoint’s recent SRE Report 2025 found that for the first time ever, and despite—or perhaps because of—the growing reliance on AI tools, “the burden of operational tasks has grown.”

True, AI can smooth out thorny workflows, but doing so may have unexpected knock-on effects. For example, your system may use learned patterns to automatically suppress alerts, but this could cause your teams to miss novel events entirely. And AI won’t magically fix what’s outdated or broken: After implementing an AI solution, “issues often remain because change happens over time, not immediately,” Catchpoint’s Mehdi Daoudi explained to IT Brew. That’s in part because “making correlations between [the] different data types living in different data stores is error-prone and inefficient” even with the assistance of AI-powered tools, write Charity Majors, Liz Fong-Jones, and George Miranda in their forthcoming edition of Observability Engineering. And this is before taking into account the broader worry that overreliance on AI systems and AI agents will lead to the widespread erosion of human expertise.

It’s safe to say AIOps is a double-edged sword, cutting through complex processes with ease while introducing new forms of hidden complexity on the backswing. As with generative AI as a whole, the utility of a solution most often hinges on its reliability. Without insight into how AI tools are arriving at the decisions they make, you can’t be sure those decisions are trustworthy. Michelle Bonat, chief AI officer at AI Squared, calls this “the paradox of AI observability.” In short, as we delegate observability to intelligent systems, we reduce our ability to understand their actions—or our monitoring systems. What happens then when they fail, become unreliable, or misinterpret data? That’s why we need observability of our observability.

Table of Contents

Why “Observability of Observability” Matters

IT ops teams are putting more of their trust in automated alerts, AI-driven root cause analysis, and predictive insights, but this confidence is built on shaky ground. There are already concerns about how effective current AI benchmarks are at assessing models, and benchmarks for AI agents are “significantly more complex” (and therefore less reliable). And observability presents its own task-specific complications:

The integrity of your data and data pipeline: If the data sources feeding your observability platform are faulty (e.g., dropped logs, misconfigured agents, high cardinality issues from new services) or if data transformation pipelines within the observability stack introduce errors or latency, you’re in trouble from the start. You can’t address the problems you don’t see.

Model drift and bias: AI models tend to degrade or “drift” over time, because of changes in system behavior or data, new application versions, or growing discrepancies between proxy metrics and actual results. And bias is a frequent problem for generative AI models. This is particularly vexing for observability systems, where properly diagnosing issues demands accurate analysis. You can’t trust the output from an AI model that develops biases or misinterprets signals from the data, but because LLM-in-observability platforms can’t often explain how they reach their conclusions, these issues can be hard to spot without metaobservability.

Platform health and performance: Observability platforms are complex distributed systems—they have outages, performance degradation, and resource contention like any other. Keeping your primary source of truth healthy and performing reliably is crucial. But how will you know your monitoring tools are working properly without observability into the observability layer itself?

Your Observability Stack Is a Critical System. Treat It That Way.

The solution is simple enough: Apply the same monitoring principles to your observability tools as you do to your production applications. Of course, the devil’s in the details.

Metrics, logs, and traces: Telemetry data gives you insight into your system’s health and activity. You should be monitoring platform latency, data ingestion rates, query performance, and API error rates as well as AI-focused metrics like resource utilization of agents and collectors, time to first token, intertoken latency, and tokens per second if applicable. Collecting logs from your observability components will help you understand their internal behavior. And you can identify bottlenecks by tracing requests through your observability pipeline.

Data validation and quality checks: Standardizing observability data collection and consolidating your data streams gives stakeholders a unified view of system health—essential for understanding and trusting AI-driven decisions. OpenTelemetry is a particularly good platform for observability, as it offers portability for your data, obviates vendor lock-in, and promotes consistent instrumentation across diverse services; it also enables better explainability by linking telemetry to decision origin points. But be sure to also implement automated checks on the quality and completeness of data flowing into your observability tools (number of unique service names, expected metric cardinalities, timestamp drift, etc.) as well as alerts for anomalies in data collection itself (e.g., sudden drop in log volume from a service). Like AI models themselves, your configuration will drift over time (a problem less than one-third of organizations are proactively monitoring for). As Firefly’s Ido Neeman notes in The New Stack, “Partial IaC [Infrastructure as Code] adoption mixed with systematic ClickOps basically guarantees configuration divergence.”

Model monitoring and explainability: Honeycomb’s Austin Parker argues that the speed at which LLM-based observability tools can provide analysis is the real game changer, even though “they might be wrong a dozen times before they get it right.” (He’ll be discussing how observability can match the tempo of AI in more detail at O’Reilly’s upcoming Infrastructure & Ops Superstream.) That speed is an asset—but accuracy cannot be assumed. View results with skepticism. Don’t just trust the AI’s output; cross-reference it with simpler signals, and don’t discount human intuition. Better yet, demand insights into model behavior and performance, such as accuracy, false positives/negatives, and feature importance.¹ It’s what Frost Bank CISO Eddie Contreras calls “quality assurance at scale.” Without this, your AI observability system will be opaque—and you won’t know when it’s leading you astray.

The Evolving Role of the Engineer

AI is adding new layers of complexity and criticality to IT ops, but that doesn’t diminish the software engineer’s role. Ben Lorica contends that the “‘boring’ truth about successful AI” is that “the winners. . .will be defined not just by the brilliance of their models, but by the quiet efficiency and resilience of the infrastructure that powers them.” Considering this “truth” from another angle, CISO Series host David Spark asks, “Are we creating an AI-on-AI arms race when what we really need is basic engineering discipline, logging, boundaries, and human-readable insight?”

Good engineering practices will always outperform “using AI to solve your AI problems.” As Yevgeniy Brikman points out in Fundamentals of DevOps and Software Delivery, “The most important priorities are typically security, reliability, repeatability, and resiliency. Unfortunately, these are precisely GenAI’s weak areas.” That’s why the quiet reliability Lorica and Spark champion requires continuous, intentional oversight—even of tools that claim to automate oversight itself.² Engineers are now the arbiters of trust and reliability, and the future belongs to those who can observe not just the application but also the tools we’ve entrusted to watch it.

Start building metaobservability into your systems with O’Reilly
On August 21, join host Sam Newman and an all-star lineup of observability pros for the Infrastructure & Ops Superstream on AI-driven operations and observability. You’ll get actionable strategies you can use to enhance your traditional IT functions, including automating crucial tasks such as incident management and system performance monitoring. It’s free for O’Reilly members. Save your seat here.

Not a member? Sign up for a free 10-day trial to attend—and check out all the other great resources on O’Reilly.

Footnotes

For a detailed look at what’s required, see Chip Huyen’s chapter on evaluating AI systems in AI Engineering and Abi Aryan’s overview of monitoring, privacy, and security in LLMOps. Aryan will also share strategies for observability at each stage of the LLM pipeline at O’Reilly’s upcoming Infrastructure & Ops Superstream.
Just where humans belong in the loop is an open question: Honeycomb SRE Fred Hebert has shared a useful list of questions to help you figure it out for your specific circumstances.