The Evaluator Gate
Every finished deliverable passes an independent review before it reaches you. Enforced in code, not just in the prompt, with an audit trail for every verdict and every bypass.
How it works
Cards of type report_ready represent finished
deliverables that go to the owner — a live landing page, a
shipped email, a drafted contract. Unlike trace events or
milestone notifications, these carry stakes. The evaluator
gate is the safeguard: a report_ready card
cannot leave the engine unless the associated deliverable
has been reviewed by the Evaluator specialist and received
a PASS verdict.
The enforcement lives at the bottom of
apps/engine/src/ceo/tools.ts, in the
emit_owner_card handler. Every report card
must include a metadata.deliverableId — an
identifier the CEO previously passed into
review_deliverable_with_evaluator. That tool\'s
handler calls recordEvaluatorVerdict(deliverableId, verdict),
which writes the verdict into a module-scoped Map. When
emit_owner_card fires with
type: "report_ready", the handler checks the
Map. Missing ID or non-PASS verdict → the tool call is
rejected with a structured error telling the CEO what\'s
wrong.
Every check is written to
~/.blackbox/memory/evaluator-audit.jsonl with
the card id, deliverable id, verdict, and bypass flag.
Bypasses (metadata.evaluatorBypass: true) are
allowed, but they log a WARN line to stderr too, so a bypass
is loud by design. You can trust that a report card you see
in your Inbox either passed review or is explicitly marked
as unreviewed.
This is a rare thing in agent frameworks: a code-enforced gate between the agent\'s output and the human\'s inbox. Prompts alone can be ignored, compressed out, or hallucinated around. The map lookup is not.
What you see in the UI
Nothing, usually — that\'s the point. A passing deliverable
surfaces as a clean report_ready card. But if
the CEO tried to ship something without review, the failure
appears as a system error in the Diagnostics view, and the
CEO is instructed to route back through the Evaluator. No
half-reviewed reports make it to your Inbox.
The audit log is accessible to you — open Diagnostics to see every verdict the Evaluator has given and every bypass the CEO requested. A useful forensic tool when you\'re evaluating whether to trust the system with more autonomy.
A concrete example
The Landing Page Bootstrap playbook builds a site, deploys
it to Railway, and calls
review_deliverable_with_evaluator with
deliverableId: "landing-page-bootstrap-1714000000000"
and a specific criteria list (HTTP 200, keyword present,
no TODO placeholders, mobile-viewport usability, semantic
markup). The Evaluator returns PASS. The CEO then calls
emit_owner_card with type: "report_ready"
and metadata.deliverableId set to the same
string. The code checks the Map, finds PASS, lets the card
through. You see "Your landing page is live!" — and you know
it actually passed the checks.
Technical details
// apps/engine/src/ceo/tools.ts
const deliverableIdToEvaluatorVerdict =
new Map<string, 'PASS' | 'FAIL' | 'NEEDS_REVISION'>();
// Inside emit_owner_card handler, when type === 'report_ready':
if (!evaluatorBypass) {
if (typeof deliverableId !== 'string' || !deliverableId) {
return { isError: true, content: [{ type: 'text', text:
'report_ready card rejected: no deliverableId in metadata. ' +
'Include metadata.deliverableId (the identifier from your ' +
'review_deliverable_with_evaluator call).' }] };
}
if (!isDeliverableEvaluatorPassed(deliverableId)) {
// …reject with guidance to call review_deliverable_with_evaluator first…
}
}
The Evaluator itself is barred from loading Skill Packs
(see the exclusion in
SKILL_PACK_TARGET_SPECIALISTS). It must stay an
independent skeptic — if customers could extend it, the
incentive to pressure-test would weaken.
Related features
- The 18 specialists — the Evaluator is specialist #19, deliberately kept separate.
- Landing Page Bootstrap — the hero playbook that uses the gate.
- Circuit breaker — the other layer of the safety story.
Related concepts
FAQ
Why enforce in code and not just in the prompt?
Prompts can be ignored. Code can't. The orchestration IP is load-bearing, and "the CEO forgot to call the evaluator" is not an acceptable failure mode for a card marked report_ready. Every other event type is unrestricted; report_ready is the one with stakes.
Can the CEO bypass the gate?
Yes — with an explicit metadata.evaluatorBypass: true flag. It's logged to ~/.blackbox/memory/evaluator-audit.jsonl with a WARN stderr line so a bypass can't be hidden. Designed for rare cases where the owner explicitly waives review.
What does the evaluator actually check?
Depends on the deliverable. For Landing Page Bootstrap: HTTP 200 at the URL, businessKind keyword in the body, no TODO/LOREM placeholders, mobile-viewport usability, single <h1>, <main> landmark, alt text on images. The CEO passes criteria into review_deliverable_with_evaluator; the Evaluator returns PASS, NEEDS_REVISION, or FAIL.
What happens on FAIL?
The CEO re-delegates to the producing specialist with the evaluator's feedback embedded in the brief. Playbooks cap this at 2 retries — a third failure escalates with error_alert. The retry loop is deterministic, not improvised.
Try Black Box
The finished work you see in your Inbox has already been reviewed. That\'s the default, not the premium feature.