Examples of issues that break agents in production
Written by
Nick Payton
As we discuss how our production agent analytics work, one of the first questions we normally get is, “Will you give me a concrete example?” This is because we are usually distinguishing between monitoring and analytics.
Monitoring is useful for assessing thresholds on known issues or opportunities or evals in near real-time. Analytics are useful for finding signals on unknown issues or opportunities that are hidden in production agent traces. AI product teams use analytics to know what to monitor.
In this post, we’ll share a few examples of issues that analytics would help identify. These issues were unknown prior to this analysis, are relatively nuanced correlations across multiple metrics, or some combination of both. The goal is to give you more intuition on what you’d get from using DBNL on your production traces.
This post was heavily inspired by a conversation between Scott Clark, Distributional co-founder and CEO, and Jason Liu of the Developer Experience team at OpenAI. As Scott put it in his discussion with Jason Liu, “If you don’t run the test for a disease, it doesn’t mean you don’t have it.” Analytics is about finding those issues (or opportunities) you don’t know you have. You can watch this here (especially minutes 30-40):
Examples of production issues that analytics can catch
Offline, pre-deployment evals that fail to perform out of sample in production
Your agent gets a 100% pass rate on an eval, but this eval is a kindergarten level math test. When users ask questions beyond kindergarten-level math, it underperforms.
Your agent is evaluated on the rate at which it catches and passes negativity in an input to a specific guardrail. When it scales in production, however, it also tends to catch misspellings that are classified as negative, which sends far too many inputs to the guardrail, creating a poor user experience.
You design your agent around a concentrated group of users in a few countries. Then you scale your agent internationally, and discover edge cases that result in degradations in performance that weren’t contemplated with pre-production evals.
You correlate low feedback scores with topics and find that there are a subset of tasks the agent fails to perform well. You re-engineer the tools to account for this, and add these tasks to your eval set for future development and monitoring.
Insights in production on user topics, intent, and input patterns
You’ve scaled an internal multi-turn chat RAG application across multiple functional areas (e.g., HR, IT, finance). Now you lack visibility into how people are using this system. Without insights on topics, it is hard to add additional support or to evaluate how the agent is performing on those queries.
Your agent goes viral in Turkey and the majority of prompts are now in Turkish. Because you wrote the system prompt in English, however, it creates a bug where half the time the agent responds in Turkish, but half the time it responds in English, creating a bad user experience.
You launch an agent with a marketing campaign targeting developers as the expected user base. Instead, managers find the agent more valuable and start to use it more than developers. Due to this shift, you need to rework prompt, context, and tools to cater to this different manager-level user base.
You build an agent for productivity software workflows based largely on the assumption that email was the most important medium. This includes a lot of work iterating on email search tooling to make sure this is a strong aspect of the product. In production, you discover that more than 30% of search queries were based on photos – someone snapping a shot of a receipt, etc. – and not email. The agent is not designed to perform in this scenario, resulting in a poor user experience.
Issues in the complexity of agent behavior
You develop an agent with a variety of LLM calls and access to a diverse set of tools. You organize the agent in a workflow with a relatively structured framework. When you push to production, you notice 403s when there is a specific prompt that results in the agent calling a specific tool. You notice you didn’t include a retry step in that process and it is consistently failing the first time, driving the error.
You correlate tool calls, token cost, topics, and response quality metrics, and discover that there is a specific topic where the agent is inefficiently calling a tool that is too expensive.
You find a spike in user frustration. Once you correlate this with topics, you discover that users are frustrated because their queries are being routed to a guardrail when they should actually be answered. You redesign the guardrail to avoid overcorrection.
You find interesting queries that result in interesting agent behavior, and use these traces to evolve your reward function for reinforcement learning.
Next
AI product teams use Distributional to address these problems, and many more. Distributional is a free, open, and installable platform for agent analytics. Try it today and quickly learn how it complements your existing agent observability stack. We are also always happy to learn more about your use case and enterprise needs, so reach out to contact@distributional.com with any questions.
Subscribe to DBNL
Thank you for your submission!
Oops! Something went wrong while submitting the form.