Model Evaluation: From ML to GenAI

Written by

Erin LeDell

As AI systems grow more complex, the methodologies used to evaluate them must evolve accordingly. Traditional machine models are largely deterministic—the same input yields the same output. Machine learning has a long history of model evaluation and benchmarking, making it relatively straightforward to evaluate performance of production ML systems over time. On the other hand, assessing the performance of generative AI systems introduces new complexities.

Modern AI applications, which may contain GenAI models such as GPT-4o or Claude 4 Sonnnet, have a number of complex properties, such as:

AI applications are multi-component systems where changes in one part can affect others in unexpected ways. For instance, a change in the vector database could affect the LLM’s responses, or updates to a feature pipeline could impact the machine learning model’s predictions.
AI applications are non-stationary, meaning their behavior changes over time even if the code doesn’t change. This happens because the world they interact with changes—new data comes in, language patterns evolve, and third-party models get updated. A test that passes today might fail tomorrow, not because of a bug, but because the underlying conditions have shifted.
AI applications are non-deterministic. Even with the exact same input, they might produce different outputs each time. Think of asking an LLM the same question twice—you might get two different, but equally valid, responses. This makes it impossible to write traditional tests that expect exact matches.

Evaluating GenAI systems

The evaluation metrics for GenAI systems look very different than in ML. They are functions that take text as input, such as BLEU or ROUGE, though they seek to answer many of the same questions:

Accuracy: Does it give the correct answer?
Factuality: Are its claims true?
Helpfulness: Is it useful to the user?
Safety: Does it avoid harmful or biased outputs?
Generalization: Can it handle tasks it wasn't explicitly trained on?

Another way evaluation of GenAI systems differs is that humans (and their opinions) are often involved through the creation of golden—or validation—datasets. These are limited because golden datasets evaluated in development might not reflect the true distributions of the input data (how users interact with the system) over time, and may only reflect certain conditions that were anticipated, whereas real life usage may have many wild deviations from that.

Distributional testing

GenAI systems need to be evaluated through distributional testing. In contrast to ML, where tracking the distribution of a single predicted value is very straightforward, the output of LLMs is text-based (structured outputs can contain other metadata) and there are many derived metrics from LLM input/output which can be tracked over time. Some of these metrics include word count, toxicity, ROUGE and BLEU scores (check out a list of standard off-the-shelf LLM eval metrics here).

Distributional helps teams test input/output distributions over time, gaining insight on exactly how, where, and why their GenAI systems have shifted. Get in touch with our team to learn more about Distributional.

To learn more about model evaluation, watch the full live talk recording below.