The Evaluator Gate

Every finished deliverable passes an independent review before it reaches you. Enforced in code, not just in the prompt, with an audit trail for every verdict and every bypass.

How it works

Cards of type report_ready represent finished deliverables that go to the owner — a live landing page, a shipped email, a drafted contract. Unlike trace events or milestone notifications, these carry stakes. The evaluator gate is the safeguard: a report_ready card cannot leave the engine unless the associated deliverable has been reviewed by the Evaluator specialist and received a PASS verdict.

The enforcement lives at the bottom of apps/engine/src/ceo/tools.ts, in the emit_owner_card handler. Every report card must include a metadata.deliverableId — an identifier the CEO previously passed into review_deliverable_with_evaluator. That tool\'s handler calls recordEvaluatorVerdict(deliverableId, verdict), which writes the verdict into a module-scoped Map. When emit_owner_card fires with type: "report_ready", the handler checks the Map. Missing ID or non-PASS verdict → the tool call is rejected with a structured error telling the CEO what\'s wrong.

Every check is written to ~/.blackbox/memory/evaluator-audit.jsonl with the card id, deliverable id, verdict, and bypass flag. Bypasses (metadata.evaluatorBypass: true) are allowed, but they log a WARN line to stderr too, so a bypass is loud by design. You can trust that a report card you see in your Inbox either passed review or is explicitly marked as unreviewed.

This is a rare thing in agent frameworks: a code-enforced gate between the agent\'s output and the human\'s inbox. Prompts alone can be ignored, compressed out, or hallucinated around. The map lookup is not.

What you see in the UI

Nothing, usually — that\'s the point. A passing deliverable surfaces as a clean report_ready card. But if the CEO tried to ship something without review, the failure appears as a system error in the Diagnostics view, and the CEO is instructed to route back through the Evaluator. No half-reviewed reports make it to your Inbox.

The audit log is accessible to you — open Diagnostics to see every verdict the Evaluator has given and every bypass the CEO requested. A useful forensic tool when you\'re evaluating whether to trust the system with more autonomy.

A concrete example

The Landing Page Bootstrap playbook builds a site, deploys it to Railway, and calls review_deliverable_with_evaluator with deliverableId: "landing-page-bootstrap-1714000000000" and a specific criteria list (HTTP 200, keyword present, no TODO placeholders, mobile-viewport usability, semantic markup). The Evaluator returns PASS. The CEO then calls emit_owner_card with type: "report_ready" and metadata.deliverableId set to the same string. The code checks the Map, finds PASS, lets the card through. You see "Your landing page is live!" — and you know it actually passed the checks.

Technical details

// apps/engine/src/ceo/tools.ts
const deliverableIdToEvaluatorVerdict =
  new Map<string, 'PASS' | 'FAIL' | 'NEEDS_REVISION'>();

// Inside emit_owner_card handler, when type === 'report_ready':
if (!evaluatorBypass) {
  if (typeof deliverableId !== 'string' || !deliverableId) {
    return { isError: true, content: [{ type: 'text', text:
      'report_ready card rejected: no deliverableId in metadata. ' +
      'Include metadata.deliverableId (the identifier from your ' +
      'review_deliverable_with_evaluator call).' }] };
  }
  if (!isDeliverableEvaluatorPassed(deliverableId)) {
    // …reject with guidance to call review_deliverable_with_evaluator first…
  }
}

The Evaluator itself is barred from loading Skill Packs (see the exclusion in SKILL_PACK_TARGET_SPECIALISTS). It must stay an independent skeptic — if customers could extend it, the incentive to pressure-test would weaken.

Related features

The 18 specialists — the Evaluator is specialist #19, deliberately kept separate.
Landing Page Bootstrap — the hero playbook that uses the gate.
Circuit breaker — the other layer of the safety story.

Related concepts

FAQ

Why enforce in code and not just in the prompt?

Prompts can be ignored. Code can't. The orchestration IP is load-bearing, and "the CEO forgot to call the evaluator" is not an acceptable failure mode for a card marked report_ready. Every other event type is unrestricted; report_ready is the one with stakes.

Can the CEO bypass the gate?

Yes — with an explicit metadata.evaluatorBypass: true flag. It's logged to ~/.blackbox/memory/evaluator-audit.jsonl with a WARN stderr line so a bypass can't be hidden. Designed for rare cases where the owner explicitly waives review.

What does the evaluator actually check?

Depends on the deliverable. For Landing Page Bootstrap: HTTP 200 at the URL, businessKind keyword in the body, no TODO/LOREM placeholders, mobile-viewport usability, single <h1>, <main> landmark, alt text on images. The CEO passes criteria into review_deliverable_with_evaluator; the Evaluator returns PASS, NEEDS_REVISION, or FAIL.

What happens on FAIL?

The CEO re-delegates to the producing specialist with the evaluator's feedback embedded in the brief. Playbooks cap this at 2 retries — a third failure escalates with error_alert. The retry loop is deterministic, not improvised.

Try Black Box

The finished work you see in your Inbox has already been reviewed. That\'s the default, not the premium feature.

See pricing How it works