all posts

How Distributional tests for consistent app behavior in production

Written by 
Harvey Cheng

For teams scaling AI applications in production, adaptive testing is critical for consistency and reliability. But testing these applications is not trivial. Due to the nature of the embedding process and subsequent recovery of text from embedding space, these apps require statistical analysis and testing. Additionally, once an app is running in production, testing automation becomes necessary since teams can no longer manually test for all possible usages and analyze all possible responses.

Distributional provides an adaptive testing platform that is uniquely designed to address these needs. Distributional’s testing strategy is based on analyzing recent production usage to test for consistency of app behavior as a whole, while also providing mechanisms to both alert users to behavioral deviations and provide interpretable evidence for users to understand what has occurred.

In this article, we’ll dive into how Distributional tests for consistent app behavior in production.

Strategy to test for consistent app behavior in production

At its core, the Distributional platform uses a testing strategy that embraces the idea that every time an AI app is used, both the prompt (input) and response (output) are random variables drawn from some distribution. The goal is to analyze the behavioral consistency of the app. For example, has the app responded to questions on a specific topic differently today than yesterday? 

The platform then surfaces notable deviations to users in an unsupervised way. By design, it does not pass judgment on whether the responses are correct or not—rather, Distributional gives users the ability to introspect and apply supervision by passing judgment based on their own expertise and their use cases. The platform is able to analyze input/output consistency from an app, as well as consistency across intermediate data generated by the app.

To start to visualize this testing strategy, capital letters represent a random variable, and lowercase letters represent a realization of that random variable or other deterministic quantity:

  • 𝐼 – the distribution of possible inputs (prompts) to the app
  • 𝑖 – an observed set of prompts
  • 𝑂 – the distribution of possible outputs (responses) from the app
  • 𝑜 – an observed set of outputs
  • 𝑡 – time span over which inputs and outputs are considered

Next, some conditional notation to facilitate analysis in Distributional: 

  • I | t1 – the possible inputs over time span t1
  • I, O | t1 – the possible inputs and outputs over time span t1
  • O | t1, I – the possible outputs over time span t1, given the inputs I
  • tb – a baseline time period against which recent usage is compared

Distributionalʼs core testing functionality studies the following hypothesis test:

  • H0I, O | te and I, O | tb are the same distribution
  • H1 – They are not the same distribution

In general, te is a recent time period and tb is a fixed time window from the past. For example, comparing usage from the past 24 hours to usage from last Monday.

Distributional then helps users understand whether they should reject H0 by presenting relevant evidence of behavioral deviations based on logged app usage between tb and te. If the user finds this evidence compelling, notifications can be created to identify such behavior in the future.

This evidence can also be used to consider alternate, more targeted, null hypotheses. These could be: 

  • Ho: I | te and I | tb are the same distribution

which asks only whether the inputs to the app have changed or

  • Ho: O | I, te and O | I, tb are the same distribution

which asks only whether the outputs have changed given the input distribution.

Background on distributional consistency

Distributional consistency could be naturally analyzed under special circumstances. For example, the classic method of the t-test would be a logical strategy for considering consistency of normally distributed data. If we were studying only the inputs I|t and they consisted of only a single numerical value that was normally distributed, then we could analyze whether E [I | tb] ≠ E [I | te] with a t-test. But in the text-first world of generative AI, it is unlikely that such parametrized analysis will ever be sufficient.

A nonparametric analysis of distributional similarity has been well addressed for a single continuous or discrete random variable by tools such as the Kolmogorv-Smirnov statistic or Chi-squared statistic, respectively. Tools such as the Kullbeck Liebler (KL) Divergence provide a strategy to measure dissimilarity between random variables when the distribution of those random variables is known.

However, these tools alone are generally insufficient to analyze the consistency of observed app behavior since:

  • the nature of text is neither numerical nor categorical;
  • we desire to study multiple variables representing characteristic behavior of the text simultaneously; and
  • we are unable to proactively sample data (for, e,g., the KL divergence) given fixed historical logs.

Nevertheless, Distributional does incorporate these quantities as facets for helping to define how severely an app’s recent behavior has deviated from previously observed behavior.

Evidence of a perceived deviation in behavior may be summarized with a statement such as “Distribution moderately drifted to the left,” but users can dig deeper to interrogate that evidence and judge if this change in behavior is worrisome.

Incorporating interpretable evaluation metrics

Distributional is designed to empower users to ultimately make the judgment whether a significant or worrisome behavioral deviation has occurred. This means the platform must clearly provide understandable evidence of deviations to users.

The analysis of H0 is powered by interpretable evaluation (eval) metrics. Distributional provides a set of built-in evals of different structures:

  • Locally computed classical NLP quantities such as reading level
  • LLM-as-judge style quantities leveraging the user’s choice of model
  • Hooks for users to create their own LLM-as-judge quantities for submission to Distributional
  • RAG-specific quantities to help analyze the behavior of the retrieval process

Furthermore, any additional quantities that users already compute can be sent to Distributional to be incorporated into the analysis of H0.

Download the full tech paper

Distributional is building the modern enterprise platform for adaptive testing to make AI safe, secure and reliable. As the power of AI applications grows, so does the risk of harm. By taking a proactive, adaptive testing approach with Distributional, AI teams can deploy AI applications with more confidence and catch issues before they cause significant damage in production.

Learn more about Distributional’s testing strategy by downloading our tech paper on Distributional’s Approach to AI Testing.

Subscribe to DBNL

By subscribing you are agreeing to our Privacy Policy

Thank you for your submission!

Oops! Something went wrong while submitting the form.