
Most AI teams think they have observability. What they actually have is instrumentation.
Latency dashboards. Error rates. Cost per call. These are useful numbers — but they’re averages, and averages lie. They tell you that something changed. They don’t tell you what changed, for whom, or why.
Real observability means understanding the behavioral patterns inside your production data — the user segments where your agent degrades, the query types it handles inconsistently, the failure modes that weren’t anticipated. Most teams never get there. Not because they’re not trying, but because they’re measuring the wrong things.
After working with AI teams across the industry, we’ve identified four distinct stages of observability maturity. Here’s what each one looks like — and where the transitions break down.
You have monitoring. You have dashboards. When your error rate spikes, you know about it.
What you don’t have: any understanding of behavioral variation.
Imagine your agent starts giving worse answers to enterprise customers — but your aggregate quality score stays flat because your SMB customers are fine. You’d never see it. The signal is buried inside the average.
Teams in the Aggregate Trap typically discover problems one of two ways: users report them, or a blunt threshold alert fires. Both are lagging indicators. By the time you know something is wrong, it’s been wrong for a while.
The core issue isn’t that your monitoring is bad — it’s that aggregate metrics can only detect changes that affect everyone, uniformly, at the same time. Real production AI problems rarely work that way.
This is where most teams with “good” observability actually live — and it’s the stage that’s hardest to recognize from the inside, because things feel like they’re working.
You have tracing. You have evals. You might have custom segments you track — query categories, user cohorts, topic buckets. When something goes wrong, you can pull traces and investigate. You’re proactive about reviewing sampled data.
The problem is the word sampled. And the word predefined.
When you sample traces, you’re making a bet that the interesting patterns are distributed randomly across your data. They’re not. Anomalous behavior clusters — and random sampling will systematically miss low-frequency, high-impact patterns.
When you use predefined segments, you can only find patterns you already suspected existed. The entire category of unknown unknowns — the behavioral clusters that would change how you prioritize your roadmap — stays invisible.
You’re doing real work. You’re just doing it inside a box whose walls you can’t see.
Teams at this stage are doing a lot right. Enriched trace data. Custom NLP metrics. Topic classifications. LLM-as-judge scores. Proactive behavioral investigation.
The challenge is that none of it scales.
Your data science team runs ad hoc analyses. They find patterns, document them, add them to dashboards. Two months later, user behavior has shifted and the patterns they documented no longer reflect what’s actually happening in production. The analysis is always chasing the product.
The other scaling problem: coverage. As trace volume grows, the percentage of your data that any human actually looks at approaches zero. You can have excellent analytical frameworks and still miss the pattern that matters most this week, because it only shows up in 0.3% of traces — which is thousands of examples at scale, but invisible to any sampling strategy.
The teams who break through this stage are the ones who stop asking “how do we analyze our data better?” and start asking “how do we make pattern discovery continuous and automatic?”
Very few teams operate here. The ones that do share a few characteristics.
First, pattern discovery is automated and continuous. New behavioral signals are surfaced from production data without anyone having to think to look for them. The system discovers the unknown unknowns.
Second, production data drives curation. Instead of randomly sampling traces for fine-tuning or eval sets, they select based on discovered behavioral signals — which means their training data actually reflects the failure modes that matter.
Third, their understanding of agent behavior compounds over time. Every week they know more about how their product behaves in production than they did the week before, in a systematic way that accumulates rather than churning.
The result is that improvement cycles get faster, not slower, as the product scales.
The Manual Investigation Plateau is seductive. You feel like you’re on top of your data. You have processes. You’re doing the work.
What makes it hard to leave is that the problems it creates are invisible by definition. You don’t know what patterns you’re missing. You don’t see the user segments you’re not segmenting. The unknown unknowns don’t show up in your dashboards as gaps — they just don’t show up at all.
Closing the gap requires a different kind of tool: one that analyzes behavioral dimensions across your entire production dataset, continuously, without requiring you to define what you’re looking for in advance.
We built a 7-question self-assessment to help AI teams find out which stage they’re actually at — not which stage they think they’re at.
It takes 2 minutes. Each answer reveals an insight about what your current approach is missing. At the end, you get a scored result with a specific recommendation for how to move to the next stage.

