“We scaled our chat experience to 15,000 internal users across dozens of use cases. We introduced an agentic router to boost this user experience, but have gotten mixed feedback on it. Most importantly, we don’t have visibility into the performance of the router—or the underlying AI use cases in general.” Sound familiar?
AI products are unreliable in production. They break in ways that are hard to debug. They have multiple components, tools, context sources, and retrieval steps that each introduce multiple variables. They are inconsistent in their response to users. They may go off the rails. They answer the same question with a different response. Their performance is hard to measure. All of this makes it challenging to know whether you are happy with your AI product, or how to improve it over time.
Most AI teams take some measures to overcome these challenges. But how do you know these measures are insufficient? How do you know you have an AI product improvement problem? How do you know your AI product feedback loop is broken? Here are a few signs.
As you developed your AI product, you built up and optimized to evals that represented expected product usage, but these broke the moment you pushed to production. As you scaled usage in production, the gap between online evals and real user experience continued to grow, making it hard to understand how to improve your product or avoid degradation in the experience. You stood up monitoring on some of these evals as performance checks, but these summary statistics hid a lot of issues, resulting in an uptick in negative user feedback. You applied product analytics to understand the usage, but this tooling lacked insight on sessions, tool calls, data sources, inputs, and model behavior. Your data science team came up with valuable insights on your product usage, but the episodic nature of this analysis meant that you missed important shifts in trends. Your stack may be robust, but it is incomplete.
You started with an enterprise chat experience, then added RAG to elevate the response quality for some of the highest priority prompts. Your usage has since scaled in an organic way, but you lack insight into which exact use cases are most prominent, best performing, and most liked by your user base. You need product insights to decide which use cases to make top priority with full support from a centralized AI platform or data team, and lack a data-driven pathway to perform this type of analysis. You want to develop these top priority AI products for internal usage with cost or revenue business cases, but then have plans to scale out to external customers as well after you’ve proven them out.
You have a variety of AI applications in place, each of which has its own bespoke monitoring solutions to track basic business metrics like tokens and human feedback. Occasionally, you’ll also sample traces to try to understand behavior with greater depth. This is all starting to break as you scale in two ways: usage of each application and the number of applications. You need a centralized, standard view into behavior and usage patterns of each of these applications so you can run more of an apples-to-apples comparison. This is important for product decisions, cross-team resource allocation, and regulatory purposes.
You have successfully scaled a subset of AI applications and are now trying to upgrade various components of them to make them more robust—whether adding data sources, an intelligent agentic router, new tooling, the model itself, or other types of functionality. As you go through this process and try to eval it, you notice that your users complain of issues that aren’t represented in your evals. This makes it hard to perform any of these changes and know whether the user experience has actually improved or degraded.
You regularly discover issues with your AI product, mostly from user feedback and in some cases through evals. But when you try to more deeply understand what is driving the issue, you find yourself running into dead ends. It regularly takes you multiple days to solve even simple issues, taking time that could otherwise be spent building a better product.
Any of these are a sign that you have a broken AI product feedback loop. The complexity, non-determinism, and non-stationarity of AI products leaves gaps in your cycle. These attributes are what make these systems so powerful, but also so hard to understand, improve, and scale.
If you have any of these problems, install our full product free today to start building a clear AI product feedback loop. And I’m always happy to discuss more, so reach out at nick-dbnl@distributional.com.