What is an "LLM-as-judge"?

A pattern where a language model evaluates the output of another model (or itself) against a rubric. It became standard in 2023-2024 benchmarks (MT-Bench, AlpacaEval) and is the core technique behind evaluator agents.

Why not let the specialist judge its own work?

Self-marking is systematically lenient. A separate evaluator with a separate prompt and separate context catches failure modes self-judgment misses. It's the software version of peer review.

How do you write a good rubric?

Start with 5–7 binary or 1–5 criteria that cover the failure modes you care about. Include examples of pass and fail outputs. Keep the total short enough that the evaluator can hold it in working memory.

Does the evaluator always use the same model as the specialist?

Not necessarily. Best practice is a different model (or at least a different prompt) to reduce shared blind spots. Some teams use a smaller, cheaper model for the evaluator if the rubric is simple.

Blog ·Definition· April 23, 2026 ·12 min read

What is the Evaluator Gate?

Q: What if the evaluator is wrong?

Evaluator disagreements surface in the Action Feed. Owners can override, and systematic evaluator miscalls feed back into the meta-loop for rubric revision.

The checkpoint in an agent workflow where a grader must pass the output against a rubric before downstream steps run — the difference between a demo and a production system.

TL;DR

An Evaluator Gate is a mandatory quality check in an agent workflow: a separate evaluator agent scores the output of a specialist against a rubric, and only passing outputs move forward. It is how serious multi-agent systems avoid shipping broken work.

In chat products, the human is the evaluator — you read the reply, decide if it's good, ask for another. In an agent system, nobody is watching each sub-task. The agents keep going whether the work is good or not. The Evaluator Gate fixes that. Short definition at glossary/evaluator-gate. Long form below.

The precise definition

An Evaluator Gate is an explicit checkpoint in a multi-agent workflow at which a dedicated evaluator — usually a separate agent with its own prompt and rubric — scores the output of the preceding specialist against defined criteria, and returns either a pass (work proceeds) or a reject with reasons (work is returned for revision). The evaluator is structurally separated from the work producer to avoid the systematic leniency of self-assessment.

In plain English

A good newsroom doesn't publish a reporter's first draft. It goes through an editor. The editor has a checklist — accuracy, voice, structure, legal exposure, headline quality. Drafts that don't clear the checklist come back with notes. Only passing drafts hit the print queue.

The Evaluator Gate is that editor, in software. A coding specialist writes a function; the evaluator checks it against "tests pass, no obvious security issues, matches the project's style guide" before it gets committed. A content specialist drafts a landing page; the evaluator checks it against "voice match, claim accuracy, accessibility, legal disclaimers" before it goes to the preview URL. When the gate rejects, the CEO agent routes the work back for revision. When the gate passes, downstream steps run.

Without the gate, the system ships whatever the specialist produced. Sometimes great. Sometimes hallucinated claims, broken code, off-brand copy, or regulatory landmines. The gate is the seat-belt — mostly invisible, enormously valuable the day it matters.

The history

Evaluator patterns in LLM work have three roots. First, the LLM-as-judge benchmarking pattern from MT-Bench (Zheng et al., 2023) and AlpacaEval, which used GPT-4 to score other model outputs against rubrics. Second, Constitutional AI (Bai et al., Anthropic, 2022), where a critic model critiques and revises outputs against principles. Third, the Reflexion paper (Shinn et al., 2023), which showed that self-reflection loops — agent critiques its own output and retries — improve performance on coding and reasoning benchmarks.

Production agent systems synthesized these into the Evaluator Gate pattern: a separate evaluator (not the worker), a structured rubric (not a vague prompt), a binary gate (pass or retry, not a suggestion), wired into the orchestration graph. Anthropic's "Building Effective Agents" post names this the "evaluator-optimizer" workflow pattern. By 2026, any multi-agent system aiming for production reliability has an evaluator gate somewhere in the flow.

Why evaluators work

Three reasons the gate beats letting the specialist self-judge:

Separation of concerns. The specialist is optimizing for producing a good output. The evaluator is optimizing for catching bad outputs. Different objectives, different prompts, different blind spots.
Fresh context. The specialist's context is full of its attempts, reasoning, and partial work. The evaluator's context is fresh — just the final output and the rubric. Clearer signal.
Rubric focus. Rubrics compress the accumulated knowledge about failure modes. "Don't claim features that aren't in the product," "always include alt text," "match the brand's tone." The evaluator's prompt is a concentrated filter for exactly these.

How to design a good rubric

Cover the real failure modes. Audit past rejections. Write criteria for each recurring failure.
Keep it short. Five to seven criteria is the sweet spot. Past ten, the evaluator's attention thins.
Binary when possible. "Does every claim cite a source?" is better than "Is the sourcing good?" 1–5 scales work too but need explicit anchors.
Include pass/fail examples. Two worked examples — one that should pass, one that should fail — calibrates the evaluator dramatically.
State the output format. Structured JSON with a passes boolean, a reasons array of strings, and a severity enum. Parsable by the orchestrator.
Revise on a cadence. Rubrics drift as the product evolves. Fold revisions into the meta-loop.

When to place the gate

Not every specialist output needs a gate. Over-gating makes the system slow and expensive. Rules of thumb:

Gate outputs that cross a trust boundary. Anything that will be seen by a customer, published externally, written to production, or used as input to a high-stakes downstream step.
Gate outputs where failure is expensive. Code that will be deployed. Emails that will be sent to a list. Legal claims that could create liability.
Skip gates for disposable intermediate work. A research synthesis that only the content specialist will read probably doesn't need a full evaluator pass — a quick schema check is enough.

Real-world example

Content specialist drafts a landing page. The Evaluator Gate runs this rubric:

Rubric: Landing-page evaluation
1. Voice match: output matches brand voice samples (pass/fail)
2. Claim accuracy: every product claim reflects features
   listed in the product manifest (pass/fail)
3. Call-to-action: one primary CTA above the fold (pass/fail)
4. Accessibility: alt text, heading hierarchy, color
   contrast in tokens (pass/fail)
5. Legal: no unsupported performance claims, no testimonials
   without attribution (pass/fail)
6. SEO basics: title < 60 chars, description 120–155 (pass/fail)

Output:
  passes (bool), failures (array of criterion + reason + severity)

On a draft that's missing alt text and makes a claim the product manifest doesn't support, the evaluator returns passes: false with two failures. The CEO routes the draft back to Content with the specific notes. Second pass clears. The page goes to preview. No customer sees the broken draft.

How Black Box implements the Evaluator Gate

Black Box ships a dedicated Evaluator specialist. Its constitution file defines standard rubrics for each specialist's output shape (coding, content, research, design, business ops). Skill Packs extend the rubric catalog with domain-specific criteria — the Newsletter Skill Pack adds deliverability checks; the Real Estate Skill Pack adds fair-housing compliance.

Every evaluator verdict — pass or reject, with reasons — appears in the Action Feed, so owners can inspect the quality checks the system ran on their behalf. When the evaluator systematically mis-calls a class of outputs, the meta-loop catches it and revises the rubric. See features for the Evaluator's live role in sessions.

Key takeaways

The Evaluator Gate is a mandatory quality check at a workflow boundary.
Separation from the producer beats self-assessment on catch rate.
Rubric design — short, binary, with examples — is most of the engineering work.
Gate where failure is expensive; skip where intermediate work is disposable.
Black Box ships a dedicated Evaluator specialist; rubrics extend via Skill Packs and the meta-loop.

Frequently asked questions

What's LLM-as-judge?

Using a model to evaluate the output of another model against a rubric. The base technique behind evaluator agents.

Why not self-judge?

Self-marking is lenient. Separation beats it on catch rate.

How do I write a rubric?

5–7 binary criteria covering real failure modes, with pass/fail examples.

Same model as the specialist?

Not required. Best practice is a different model or at least a different prompt.

What if the evaluator is wrong?

Surfaces in the Action Feed. Owners override. Meta-loop revises the rubric.