For teams scaling AI applications in production, adaptive testing is critical for consistency and reliability. But testing these applications is not trivial. Due to the nature of the embedding process and subsequent recovery of text from embedding space, these apps require statistical analysis and testing. Additionally, once an app is running in production, testing automation becomes necessary since teams can no longer manually test for all possible usages and analyze all possible responses.
Distributional provides an adaptive testing platform that is uniquely designed to address these needs. Distributional’s testing strategy is based on analyzing recent production usage to test for consistency of app behavior as a whole, while also providing mechanisms to both alert users to behavioral deviations and provide interpretable evidence for users to understand what has occurred.
In this article, we’ll dive into how Distributional tests for consistent app behavior in production.
At its core, the Distributional platform uses a testing strategy that embraces the idea that every time an AI app is used, both the prompt (input) and response (output) are random variables drawn from some distribution. The goal is to analyze the behavioral consistency of the app. For example, has the app responded to questions on a specific topic differently today than yesterday?
The platform then surfaces notable deviations to users in an unsupervised way. By design, it does not pass judgment on whether the responses are correct or not—rather, Distributional gives users the ability to introspect and apply supervision by passing judgment based on their own expertise and their use cases. The platform is able to analyze input/output consistency from an app, as well as consistency across intermediate data generated by the app.
To start to visualize this testing strategy, capital letters represent a random variable, and lowercase letters represent a realization of that random variable or other deterministic quantity:
Next, some conditional notation to facilitate analysis in Distributional:
Distributionalʼs core testing functionality studies the following hypothesis test:
In general, te is a recent time period and tb is a fixed time window from the past. For example, comparing usage from the past 24 hours to usage from last Monday.
Distributional then helps users understand whether they should reject H0 by presenting relevant evidence of behavioral deviations based on logged app usage between tb and te. If the user finds this evidence compelling, notifications can be created to identify such behavior in the future.
This evidence can also be used to consider alternate, more targeted, null hypotheses. These could be:
which asks only whether the inputs to the app have changed or
which asks only whether the outputs have changed given the input distribution.
Distributional consistency could be naturally analyzed under special circumstances. For example, the classic method of the t-test would be a logical strategy for considering consistency of normally distributed data. If we were studying only the inputs I|t and they consisted of only a single numerical value that was normally distributed, then we could analyze whether E [I | tb] ≠ E [I | te] with a t-test. But in the text-first world of generative AI, it is unlikely that such parametrized analysis will ever be sufficient.
A nonparametric analysis of distributional similarity has been well addressed for a single continuous or discrete random variable by tools such as the Kolmogorv-Smirnov statistic or Chi-squared statistic, respectively. Tools such as the Kullbeck Liebler (KL) Divergence provide a strategy to measure dissimilarity between random variables when the distribution of those random variables is known.
However, these tools alone are generally insufficient to analyze the consistency of observed app behavior since:
Nevertheless, Distributional does incorporate these quantities as facets for helping to define how severely an app’s recent behavior has deviated from previously observed behavior.
Distributional is designed to empower users to ultimately make the judgment whether a significant or worrisome behavioral deviation has occurred. This means the platform must clearly provide understandable evidence of deviations to users.
The analysis of H0 is powered by interpretable evaluation (eval) metrics. Distributional provides a set of built-in evals of different structures:
Furthermore, any additional quantities that users already compute can be sent to Distributional to be incorporated into the analysis of H0.
Distributional is building the modern enterprise platform for adaptive testing to make AI safe, secure and reliable. As the power of AI applications grows, so does the risk of harm. By taking a proactive, adaptive testing approach with Distributional, AI teams can deploy AI applications with more confidence and catch issues before they cause significant damage in production.
Learn more about Distributional’s testing strategy by downloading our tech paper on Distributional’s Approach to AI Testing.