As AI systems grow more complex, the methodologies used to evaluate them must evolve accordingly. Traditional machine models are largely deterministic—the same input yields the same output. Machine learning has a long history of model evaluation and benchmarking, making it relatively straightforward to evaluate performance of production ML systems over time. On the other hand, assessing the performance of generative AI systems introduces new complexities.
Modern AI applications, which may contain GenAI models such as GPT-4o or Claude 4 Sonnnet, have a number of complex properties, such as:
The evaluation metrics for GenAI systems look very different than in ML. They are functions that take text as input, such as BLEU or ROUGE, though they seek to answer many of the same questions:
Another way evaluation of GenAI systems differs is that humans (and their opinions) are often involved through the creation of golden—or validation—datasets. These are limited because golden datasets evaluated in development might not reflect the true distributions of the input data (how users interact with the system) over time, and may only reflect certain conditions that were anticipated, whereas real life usage may have many wild deviations from that.
GenAI systems need to be evaluated through distributional testing. In contrast to ML, where tracking the distribution of a single predicted value is very straightforward, the output of LLMs is text-based (structured outputs can contain other metadata) and there are many derived metrics from LLM input/output which can be tracked over time. Some of these metrics include word count, toxicity, ROUGE and BLEU scores (check out a list of standard off-the-shelf LLM eval metrics here).
Distributional helps teams test input/output distributions over time, gaining insight on exactly how, where, and why their GenAI systems have shifted. Get in touch with our team to learn more about Distributional.
To learn more about model evaluation, watch the full live talk recording below.