LLM applications are easy to prototype. Most of the complexity in a custom chatbot or retrieval augmented generation (RAG) system is in the LLM – and those LLMs have already been trained and packaged up into APIs. However, while LLMs’ prose is human-seeming, they sometimes act in unusual and inexplicable ways.

While an afternoon can be enough to get an LLM app demo working, it can take much longer to characterize and curtail unexpected LLM behavior. The process of making an LLM app reliable is mostly trial and error – the iteration cycle has three loops:

Innermost, the developer makes tweaks to prompts, models, and data, following intuition and expertise. This loop is rapid but non-systematic, and evaluation consists of spot-checking a small sample of LLM app output.
Outermost, (often non-technical) product owners review the output of the LLM application. These reviews serve as ground truth assessments of product quality, but are necessarily infrequent, as systematic human review of LLM outputs is time-intensive.
Between these two layers is auto-evaluation – LLM evaluation of the LLM application. LLM calls are cheap and fast, and LLMs can be instructed to reason about the quality of LLM app output, so why not have ChatGPT check your work?

Auto-evaluation was introduced with the GPTScore paper, which demonstrated the use of GPT-3 to evaluate the quality of algorithmically generated text. It’s been a year since the paper came out, and by now people appreciate the need to evaluate this middle layer of LLM evaluators.

Generally, auto-evaluation entails the following steps:

Sample: Collect ~100 representative LLM app inputs, paired with the corresponding output.
Rubric: Design guidelines for LLM application output.
Benchmark: Have human raters apply the rubric to the collected input-output pairs.
Calibrate: Engineer a prompt to evaluate input-output pairs, and tweak until its evaluations match the humans’ – this is the auto-evaluator.
Generate: Ask an LLM to generate synthetic inputs similar to the benchmark inputs.
Evaluate: Score the synthetic input-output pairs with the auto-evaluator.

To be clear – LLMs create the inputs, they create the outputs, and they evaluate the outputs.

At Hop, we’ve spent much of the past year working with auto-evaluation and feel that there’s a rich set of design decisions that aren’t regularly discussed. Here are some of the things we’ve been thinking about along the way.

Rating Versus Ranking

A central choice in designing an auto-evaluation process is whether to ask the LLM to rate or to rank. Rating assesses the quality of output according to an absolute scale; ranking compares the quality of multiple outputs to each other.

Rating and ranking each have pros and cons, and the decision between them depends on the product being developed. Here are some of the heuristics we use.

Rating is effective when satisficing an explicit rubric

For some LLM applications, it’s possible to explicitly state and to formally verify success criteria.

For summarization, LLMs are typically capable of checking that a summary contains all the entities listed in the original text.
For data normalization, LLMs and chain-of-thought prompting can check the correspondence between structured data output and the contents of the unstructured original.
For more qualitative targets, it’s sometimes possible to write a clear rubric and use few-shot prompting to get reliable LLM ratings.

However, not all qualities are amenable to LLM scoring. For example, when developing a chatbot for difficult conversations, we needed it to be empathetic around sensitive topics. But even with a rubric and several examples in the prompt, the LLM wasn’t able to assign meaningful empathy scores.

Ranking is more discriminating than rating

LLMs are more effective at distinguishing fine-grained differences when asked to rank input-output pairs side-by-side.

Pairwise comparison is natural when evaluating changes to LLM applications: for each of a sample of inputs, the old output is paired with the new one, then the number of times the new one is better is counted.

The main downside to ranking is the lack of reference to external measures of quality. If a tweak to an LLM application results in 40% of input-output pairs improving, it’s not clear how many more of them are acceptable to put in front of users. In contrast, when rating, learning that the percent of input-output pairs scoring 4/5 on a well-designed rubric increased by 12% translates directly into expected product experience.

Calibration

Whether auto-evaluation uses rating or ranking to guide LLM application development, it should be calibrated against human evaluations.

Calibrate to multiple human evaluators

A useful intuition pump for auto-evaluation calibration is that it’s analogous to psychometric testing. In psychometric testing, inventory questions are given to individuals, whose responses reveal deeper information about, e.g., their personality; in auto-evaluation calibration, benchmark input-output pairs are given to auto-evaluators, and their evaluations reveal their abilities as evaluators.

What this analogy highlights for us is that 1) some benchmark questions are more informative than others, and 2) in order to tell which, you need to calibrate to multiple human evaluators. If many human evaluators agree about the quality of an input-output pair, an auto-evaluator disagreeing shows that it’s not human-like; the converse if the human evaluators disagree.

In addition to guiding auto-evaluator design, assessing the inter-human variation on the benchmark has several additional benefits.

First, high levels of disagreement about evaluations can indicate deeper disagreements about product vision. Because of the nature of their output, it can be difficult to align expectations around the behavior of LLM applications. Having multiple stakeholders look at concrete outputs clarifies these differences.

If evaluators are aligned on desired LLM application behavior, large variation between human evaluations can mean that evaluation criteria aren’t nuanced enough. In the case of ranking evaluations, it can also reveal that the application versions being compared aren’t meaningfully different from each other.

The final benefit of calibrating to multiple humans is that it sets a target for auto-evaluator performance. If expert human evaluators differ with consensus 20% of the time, it’s more understandable when the LLM evaluator does as well.

Assess the discriminant power of ranking evaluations

Especially for rankings, the distribution of ranks determines how difficult it is to detect a change in model performance. One idiosyncratic feature of LLM pairwise comparisons is that the preference sometimes depends on the order of presentation – presented with a pair of items in each order, the LLM might pick the first (or last) item both times instead of picking the same item regardless of its position, essentially expressing a non-preference.

Because of this, pairs of items should be presented to the auto-evaluator in both orders. Unfortunately, the more often the auto-evaluator is indifferent between two input-output pairs, the harder it is to detect a statistically significant difference between the two versions of the LLM application.

Ranking auto-evaluators should be designed to minimize non-preference, which can be accomplished by changing the LLM and changing the prompt. A simple measure is to take a handful of items, say $n$, ask the auto-evaluator to rank each pair in both orders, and then assign a score to each item equal to the number of pairings won. For a maximally discriminating ranker, these scores are evenly distributed from 0 to $2(n-1)$ – and for a general purpose measure of the strength of a ranker, the statistical distance¹ between the distribution of scores and the uniform distribution from 0 to $2n$ is computed.

Limits of LLM Evaluation Informativeness

It can be tempting to believe that auto-evaluation can always tell you which of two LLM application versions is better. The more input-output pairs you have, the more precise an estimate is produced by the auto-evaluator – and arbitrarily many inputs can be made with generative models – so, theoretically, there should be no lower bound to precision.

This is wrong! While we may be able to make the variance of the estimate arbitrarily small, auto-evaluations are always biased.

Imperfect synthetic input-output pairs

High precision auto-evaluation estimates require very many input-output pairs. A common technique to source these is to generate synthetic inputs by prompting an LLM with typical inputs. However, LLM-generated content is often highly stereotyped: each LLM has its own distinctive voice, and generated content is typically very self-similar.

This causes a difference between the distribution of input-output pairs generated by the LLM and the distribution encountered by the live application. It’s well documented that LLMs can be sensitive to small changes in input, which introduces an unknown bias into auto-evaluation.

Calibration uncertainty

The output of a calibration process is an estimate of the sensitivity and specificity of the auto-evaluator. These statistics describe the ability of the auto-evaluator to correctly classify input-output pairs of various kinds. Like all statistics, the sensitivity and specificity have uncertainty associated with their estimation. It’s possible to reduce this uncertainty by using large calibration sets, but this is infeasible for most projects.

First, as the LLM application is tweaked, the distribution of input-output pairs also changes. While the auto-evaluator itself doesn’t change, applying it to a different distribution of input-output pairs changes its sensitivity and specificity. This is true for both rating and ranking auto-evaluators, and it is especially intuitive for rankers:

As the LLM application is improved, there are diminishing returns to changes.
As marginal improvements diminish, the auto-evaluator is asked to compare ever-similar collections of input-output pairs.
An auto-evaluator asked to rank more similar items will make more mistakes (i.e., have worse sensitivity and specificity).

The upshot of this line of reasoning is that calibration goes stale and so the calibration benchmark should be adjusted periodically. Unfortunately, we don’t know of a formal criterion for when to recalibrate. Like designing the evaluation criteria in the first place, it must be done by feel.

Second, the standard error of the estimate shrinks as $1/\sqrt{n}$, where $n$ is the number of input-output pairs in the benchmark. In order to halve the standard error, the benchmark has to be quadrupled, and so increasing precision of calibration quickly becomes untenable. Given that calibration has to be refreshed anyway, after a certain point, it makes more sense to allocate human effort directly to evaluating rather than to calibrating the auto-evaluator.

Regardless of where it originates, uncertainty in the auto-evaluation calibration acts as a noise floor: regardless of how many input-output pairs are auto-evaluated, true precision can’t exceed that allowed by the level of calibration.

A Look to the Future

For the reasons outlined in the previous section, auto-evaluation via LLMs is currently only suitable as a stopgap. It can bridge a prompt engineer’s spot-checking with infrequent systematic review, but distribution shift and calibration-related uncertainty limit its accuracy. We still endorse its use – it’s an important part of the LLM app developer’s toolkit – but it’s not a panacea.

Like all things related to LLMs, this is liable to change: we expect that both the models underlying auto-evaluation and also auto-evaluation methodology will improve, each of which will improve the value and ease of auto-evaluation. On the other hand, improving foundation models may diminish the need for auto-evaluation. It’s impossible to predict exactly how events will develop, but in the meantime, the above considerations offer some guidance on how to incorporate auto-evaluation into the LLM application development process.

— Liban Mohamed-Schwoebel, ML Researcher @ Hop

¹ I like Kolmogorov-Smirnov for this purpose.

Insights