
Distributional Co-Founder & CEO Scott Clark recently led a lightning lesson hosted by Jason Liu as part of his series of talks helping builders successfully develop, deploy, and scale AI.
The observability hierarchy for AI systems has three distinct layers: logging and tracing to understand what happened in a specific session, monitoring and evals to determine whether the system is up and passing defined checks, and behavioral analytics to understand patterns across populations of agents and users. This talk focused on the analytics layer of AI observability. You can watch the lesson recording here:
And here are the lessons that Jason shared with the participants after the session.
Traditional monitoring tells you if your system is up and whether your evals are passing, while logging and tracing let you debug specific sessions. Analytics fills the gap between these two extremes by helping you discover, understand, track, and prioritize hidden behavioral signals across many sessions. Analytics is about discovery, not direct diagnosis. It surfaces candidate patterns that require human judgment to assess importance and guide downstream debugging and product decisions.
The observability hierarchy for AI systems progresses through stages: first you need basic logging and tracing to see what’s happening, then monitoring to know if the system is up and evals are passing, and finally analytics to understand behavioral patterns across your entire user base.
Think of it like product analytics for traditional web apps, but instead of tracking users through funnels, you’re tracking agents as the atomic unit. You want to understand how many different agents across many sessions perform specific tasks and what patterns or sub-behaviors emerge.
This approach completes the AI data flywheel. By knowing what to look for in production data, you can create better evals, better reward functions for fine-tuning or reinforcement learning, and identify specific issues to address through prompt engineering or system improvements. The flywheel works as a continuous loop: observe production behavior, detect emergent patterns, convert those patterns into evals or reward signals, improve system behavior, and repeat.
The system operates as an unsupervised data flywheel. First, trace data from your agentic system gets enriched with behavioral signals. These can be LLM-as-judge evals, classic NLP statistical measures, or signals from other tools. The goal is adding as much enriched signal as possible to whatever data your app naturally produces. Unsupervised methods matter because you cannot label or define failures you do not yet know exist.
Every trace gets a behavioral vector in high-dimensional space representing what actually happened. The analysis phase looks at distributions of these vectors across many traces to pull out sub-pockets that represent infrequent behaviors or patterns correlated with cost, latency, or quality issues.
These subclusters get fed into an LLM backend to generate insights. The system uses cheaper, faster models for high-volume behavioral evaluation, then more capable mid-weight models with reasoning capabilities for final insight generation and fix recommendations.
The philosophy is “many weak signals are better than a single strong signal.” Instead of trying to create one perfect eval for quality, you combine multiple signals around frustration, tone, verbosity, reading level, and other dimensions. Through clustering, you extract the strong signal for the performance you actually care about.
Distributional provides a guided analytics experience designed to collapse weeks of manual data analysis into fast, structured triage. Instead of spending a week doing data science yourself, insights are presented to you so you can quickly assess whether an issue matters, investigate it with specific evidence, and track whether your fixes worked.
One common pattern is redundant and inefficient tool usage. The demo showed an agent making duplicate Google Maps searches for the same information within a single session. The system detected this pattern, provided specific evidence with exact traces, and suggested fixes ranging from simple prompt changes to implementing caching systems with guardrails.
Expansion-related failures happen frequently when systems scale. A company might develop their agent with a small group where it scores well on evals, then roll it out to the rest of the company or internationally and everything breaks. Users in different regions ask questions differently. What one country calls “parental leave” another calls “maternity leave,” triggering guardrails inappropriately.
Combinatorial complexity issues emerge as you add more tools. An agent might work well with four MCP servers, but when you add both GitHub and Linear, the system fails in non-linear ways because it can’t distinguish between issues and tickets. The order in which MCP tools load can cause odd behavior.
Edge cases in production that offline evals miss are surprisingly common. Companies report rare errors like 403 responses in 0.2% of traffic that weren’t caught in testing. These rough edges accumulate across agentic systems with their directed graphs of tool calls.
Another pattern is agents getting stuck in loops when they can’t return something, calling the same tool repeatedly. Since these directed graphs aren’t always acyclic, they can go off the rails quickly in ways that are hard to anticipate.
Offline evals alone are insufficient because it’s impossible to anticipate everything that can happen in production. Users will use your system in different ways than you expect, and the foundational models underneath are non-stationary and changing continuously.
Getting 100 percent on your evals doesn’t mean you’ll get 100 percent in the real world. You can pass a kindergarten math test with perfect scores, but that tells you nothing about real-world performance. Just like you could overfit a random forest to get 100 percent on a test set years ago, you can create evals that don’t capture actual system behavior.
The value of production analytics is discovering unknown unknowns that you can then convert into known issues. When you find a pattern in production, you can create an eval to catch that behavior going forward or use it as a reward function in reinforcement learning.
This isn’t a replacement for evals. It’s part of the flywheel. You observe what happens in the real world, discover new failure modes, add those to your eval suite, and continuously improve. The only way to see emergent behaviors is by observing production.
For agentic systems, binary pass or fail percentages don’t even make sense. You’re operating in a continuous behavioral space that you’re trying to guide your agent through. Many weak behavioral signals combined give you better understanding than any single strong eval could provide.
The need escalates as system complexity increases. If you’re building a simple chatbot with one-shot questions, basic monitoring might suffice. But as you move from chatbots to RAG systems to actual agents performing work, you need to climb up the observability hierarchy.
Companies building internal AskHR or AskIT-style systems increasingly add tooling so the system can fix problems directly rather than just pointing users to documentation. When you ask about PTO, it processes the request instead of linking to a website. These systems become combinatorially complex extremely rapidly. Even the router deciding which tool to call becomes interesting to analyze.
Regulated industries particularly value the on-premise, secure-first approach where you own the data and models. This matters for companies that can’t send production data to external services.
The analytics become essential when you care about performance, quality, and behavioral understanding at scale. If you’re just trying to get something to work initially, you don’t need this yet. But when you want to scale and be best in class, you need to understand behavioral patterns.
One clear indicator is when you start seeing unexplainable degradation. A team focused on making email search excellent discovered through observability that 30 percent of search queries were actually for photos, screenshots of purchase orders taken on phones. They spent a month optimizing the wrong thing because they lacked visibility into actual usage patterns.
The system integrates through multiple paths. Many agent frameworks already have OpenTelemetry built in, so you can route traces to Distributional the same way you’d route to Datadog or CloudWatch using a write-once, send-many approach.
Companies with robust ETL pipelines can ingest data through Parquet files, sit on top of Iceberg tables, or use SQL ingestion if events exist within larger databases. The fundamental goal is taking the richest possible version of your data. If you only have inputs and outputs, some insight is possible, but adding tracing data, user feedback, session-level events, and evals from other tools enables deeper analysis.
The product deploys as open source and free. It’s distributed as a Kubernetes cluster for full deployment or a K3D cluster within a single Docker image for the sandbox version. The sandbox can run on your laptop and be operational in under an hour.
Distributional never sees your underlying data. Everything runs on-premise in your cloud or bare metal infrastructure. This architecture matters for companies in regulated industries or those with strict data governance requirements.
The system is agnostic to your agent backend, cloud provider, and existing tooling. It bolts on top of whatever traces you’re already writing for logging and monitoring.
History shows that every major paradigm shift in software, web, microservices, mobile, follows the same pattern. Building comes first, then logging to see what happens, then monitoring, and finally analytics to squeeze out maximum value and create amazing products.
The mindset shift is from “I don’t see any problems, therefore they don’t exist” to “I want to know about unknown unknowns.” Stopping testing for a disease doesn’t mean the disease is cured. You need active investigation to understand what’s actually happening.
The ideal customer is a product owner who treats their product as something they care about and want to improve, not a checklist where as long as PagerDuty isn’t alerting, everything’s fine. As agentic systems provide more enterprise value, more people will need to take on that agency and care.
The goal is helping you play whack-a-mole with issues. There’s always another problem that comes up, whether in parenting, fighting disease, or managing AI systems. Analytics provides the flashlight so you’re not wandering in the dark or only looking at your evals while ignoring the rest of the room.
You optimize what you measure, but you can only measure what you know to look for. Analytics helps you see more, measure more, and hopefully optimize for the right things rather than local maxima in your eval suite.
The final message: don’t be pigeonholed into only the specific evals you’re looking at. Look around and try to find as many unknowns as possible, because the big solutions are usually behavioral investments in tooling and understanding rather than tweaking words in system prompts.
Distributional is a free, open, and installable platform for agent analytics. Try it today and quickly learn how it complements your existing agent observability stack. We are also always happy to learn more about your use case and enterprise needs, so reach out to contact@distributional.com with any questions.

