LLM evals are the production test suite for AI features — a structured set of input/output pairs scored by graders that catches model regressions, prompt drift, and edge-case failures before they reach users. Without them, every prompt change is a roll of the dice; with them, shipping AI features feels like shipping normal software.
This is the eval framework we run across every LLM-backed product we ship — what to test, how to score it, and the architecture mistakes that make most eval suites useless within a month. If you're building anything on top of Claude, GPT, or Gemini and you don't have a CI-runnable eval suite, you're flying blind. Our ai integration services work starts every engagement by stating the same thing to clients: the eval suite is the product spec.
Why evals matter more than traditional tests
Traditional unit tests assert that a function returns an exact value. LLM outputs are non-deterministic, so equality assertions are useless. The output of a summarization prompt might phrase the same idea fifty different ways across runs — all correct.
Evals replace exact-match assertions with graded scoring. Each test case has an input, an expected behavior (not an expected string), and a grader that returns a score between 0 and 1. The grader can be a regex, a structural check, a deterministic comparison, or — most often — another LLM call evaluating whether the output satisfies the criteria.
The shift is conceptual: you're not testing strings, you're testing behaviors. "Does the summary preserve the key dates mentioned in the source?" is an eval. "Does the function return 'Hello World'?" is a test.
The four grader types we actually use
Most eval frameworks ship with dozens of grader primitives. In practice, we use four and they cover ~95% of real cases.
Exact-match graders handle structured outputs — JSON keys, classification labels, tool-call names. If the LLM is supposed to return {"intent": "refund"}, an exact match on the intent field is the right grader. These are fast, free, and deterministic.
Regex/contains graders handle outputs that must include specific entities — an order ID, a date, a SKU. These catch the failure mode where the model paraphrases away critical data. A regex like \bORD-\d{6}\b is the right grader for "did the response preserve the order number".
LLM-as-judge graders handle subjective quality — tone, completeness, factual alignment with source material. We use a separate model call (usually a smaller, cheaper model) with a strict rubric: "Score 0-1 based on whether the response addresses all three of: refund eligibility, timeline, and next steps." The judge prompt itself goes through its own eval suite — yes, evals for the eval.
Semantic similarity graders compare embeddings of the output against a reference answer. We use these sparingly — they're noisy and tend to reward verbose, generic responses. Useful as a sanity-check signal, not as a primary grader.
How to build an eval dataset that actually catches bugs
The number one mistake we see is teams writing evals that only cover the happy path. Twenty test cases of "user asks normal question, bot answers correctly" tells you nothing — the model passes those trivially. The eval dataset has to be adversarial by construction.
Our split is roughly: 20% happy path, 40% known edge cases from production logs, 30% red-team prompts (prompt injection, refusal probes, ambiguous inputs), and 10% regression cases — every bug we've ever fixed gets a test case so it can never regress silently.
The 40% from production logs is the highest-leverage chunk. We sample real user queries weekly, manually grade a subset, and any case where the model performs poorly gets added to the eval suite with the correct behavior annotated. The suite grows with the product.
We chose this over the synthetic-dataset approach because synthetic test cases drift from real usage fast. A model can score 95% on a synthetic suite while failing 30% of real user queries — we ran into exactly this on a support automation pipeline where the synthetic evals all passed but the model was confidently misrouting roughly one in three real tickets.
When LLM-as-judge graders are worth the cost
LLM-as-judge graders are powerful but they're not free, and they introduce a second non-deterministic layer into your eval pipeline. Used carelessly, you end up grading a flaky model with another flaky model and trusting the score.
The pattern that works: use a strict, structured rubric with explicit pass/fail criteria, not vibes. Bad judge prompt: "Rate the response quality from 1-10." Good judge prompt: "Return JSON with three boolean fields — addresses_refund_eligibility, mentions_timeline, provides_next_steps. Score is the count of true fields divided by 3."
Structured rubrics make the judge's output auditable and dramatically reduce variance between runs. The MT-Bench paper on LLM-as-judge agreement showed that strong judge models match human preferences with over 80% agreement when the rubric is well-defined — comparable to the inter-rater agreement between human evaluators themselves.
We also pin the judge model version explicitly and re-run the full eval suite when we update it. The judge is part of your testing infrastructure — version it like code.
Wiring evals into CI without bankrupting yourself
Running the full eval suite on every commit is overkill and expensive. Our setup runs three tiers.
Smoke evals — 10-20 fast deterministic cases that run on every PR. These catch the obvious "you broke the prompt" failures in under 30 seconds. No LLM-as-judge, no semantic grading, just exact-match and regex on critical paths.
Full eval suite — 200-500 cases including LLM-as-judge graders. Runs on merge to main and on prompt-file changes. Takes 5-15 minutes and costs a few dollars per run.
Production replay evals — weekly run against the latest 1000 real user queries, scored by LLM-as-judge with a human review of the bottom 10% by score. This is what catches drift — the model might pass your static suite while quality silently degrades on real traffic.
Each tier has a threshold. Smoke evals are 100% pass-required. Full suite must hit ≥92% on critical evals and ≥85% overall. Production replays alert if scores drop more than 5 points week-over-week. A failing eval blocks merge — same as a failing unit test.
What not to evaluate (and why over-evaluating slows you down)
Not everything needs an eval. Resist the urge to write a test case for every imaginable input — you'll end up with a suite that takes an hour to run, costs $50 a pop, and nobody bothers fixing failures because half of them are nitpicks.
Skip evals for: stylistic preferences with no business impact, behaviors that are already enforced by structured output schemas, and edge cases so rare they'd never affect more than a single user. Spend your eval budget on the failure modes that would embarrass you or cost the customer money.
We've seen teams build 2,000-case eval suites that take 40 minutes to run and provide less signal than a tight 150-case suite focused on actual risk. Eval quality beats eval quantity by a large margin.
The complete LLM eval pattern at a glance
A working LLM eval suite has four properties: graded scoring instead of equality assertions, an adversarial dataset built from real production failures, structured rubrics for any LLM-as-judge graders, and a tiered CI integration that runs fast checks on every commit and full suites on merge. Without all four, you're not running evals — you're running a vibe check that gives false confidence.
Treat the LLM eval suite as the contract between your product and your model. When the contract is explicit, model upgrades become routine, prompt changes become reviewable, and production regressions become catchable. That's the difference between shipping AI features and shipping AI roulette.