AI Tooling Eval Harness — Advisory for Ande

01Throughline

THE SINGLE FACT THAT CHANGES YOUR APPROACH

The benchmark you ship today will be lying to you in 12 months. Build for that decay from day one — or don't bother.

Every static benchmark in 2026 is contaminated within a year of release. SWE-bench Verified shows a 35-point gap to SWE-bench Pro on the same models because Verified's gold patches leaked into pretraining; OpenAI publicly retired Verified for that reason; Claude Opus 4.6 was caught autonomously decrypting the BrowseComp answer key from GitHub mid-run. Static suites lose signal. The harness you're building is only worth anything if it's a living regression suite — time-anchored intake, public/held-out splits with quarterly rotation, paired bootstrap statistics, and a contamination canary baked in from week one.

Stop optimizing for what the field calls a benchmark. Optimize for what shipped at Sierra (τ²-Bench), at Anthropic (planner→generator→adversarial-evaluator), at Shopify (production-grounded GTX with gated rewards), and at every team Hamel Husain has audited: error analysis is the work, not the suite. The suite is just where the work gets logged.

02Secret Sauce

Seven things that, if you internalize them, put your harness ahead of 80% of teams shipping agentic tools right now.

Hooks-as-eval-bus is unique to Claude Code

PreToolUse / PostToolUse / Stop / SubagentStop emit JSON with tool_use_id + duration_ms (since 2.1.119) to a script you own. type: "http" handler turns your harness into a remote service. decision: block fail-fast channel built in. Cline, OpenCode, Codex, gemini-cli — none expose this surface. Don't scrape stdout; subscribe to the bus.

Pass^k beats Pass@k for reliability claims

τ²-Bench (Sierra) reports Pass^k (all-k-pass) not Pass@k (any-of-k). Top frontier models drop from ~70% Pass^1 to under 25% Pass^8 in retail. Pass@k hides flakiness; Pass^k surfaces it. If your tool needs to work 8 times in a row, that's the metric.

Anything < 3pp is below the noise floor

Anthropic's "Quantifying Infrastructure Noise" paper measured a 6 pp swing on Terminal-Bench 2.0 from container memory caps alone. SWE-bench shifted 1.54 pp at 5× RAM. Pin guaranteed allocation + kill threshold separately at 3× baseline. Treat sub-3pp deltas as noise unless you've run the paired-bootstrap protocol (arXiv 2511.19794).

Validate your judge with ≥100 human-labeled pairs and report κ, not just Pearson

"Judge's Verdict" (arXiv:2510.09738) shows Pearson is invariant to additive bias — your judge can be Pearson 0.9 while systematically harsh by 2 points. Refuse to deploy below Cohen's κ ≥ 0.6. Frontier judges (GPT-5, Gemini 3 Pro) hit Pearson 0.74–0.75 vs experts; fine-tuned Prometheus hits 0.897.

Replace the single judge with a Panel-of-LLMs (PoLL)

Cohere's "Replacing Judges with Juries" (arXiv:2404.18796): 3–5 small models from disjoint vendor families with majority vote ≥ a single GPT-4-class judge, at ~7× lower cost, and measurably less intra-model bias. Use a judge from a different family than your generator to kill self-preference bias.

T=0 is not deterministic

Thinking Machines Lab (Sep 2025): 1,000 identical T=0 calls produced 80 unique outputs on a frontier model — modal answer 78 times. FP non-associativity, batch shape, BF16 instability. Plan for N≥3 runs at T=0 or N≥5 at T=0.7. Anthropic's published Opus 4.5 eval recipe averages 5 trials.

Time-horizon (METR) is the metric that survives saturation

Pass-rate percentages saturate. METR's 50%-time horizon (length of human-expert task an agent reliably completes 50% of the time) doesn't — it just gets longer. Doubling rate ~131 days as of Jan 2026 (≈10×/year). Build your benchmark to express results in horizon-time, not pass-rate, and you'll still have a meaningful metric two model generations from now.

03TL;DR

If you read only one section, read this.

You don't need another framework. Adopt Inspect AI (UK AISI; safety-grade, sandbox-native, daily commits, the de facto agent-eval framework) or Promptfoo (YAML, weekly cadence, built-in red-team) as your runner. Don't roll your own; nothing useful will come of it.
Steal the architecture from Anthropic's harness blog (2026-03): Planner → Generator → Adversarial Evaluator (different model family). The evaluator tries to break the generator. Single-agent self-eval is the failure mode; the field has converged on this.
Seed prompts come from production, not your imagination. Hamel Husain + Shreya Shankar's pipeline: open-code 100 production traces → axial-code into failure modes → mine seeds from real failures. Synthetic seeds via Anthropic's "analyze the production distribution then sample" template fill in coverage gaps.
Judge prompts are tag-structured CoT-then-verdict. Anthropic's <answer> / <rubric> / <thinking> / <correctness> template parses trivially and beats free-form scoring. Bake in the principal expert's actual critiques as few-shot anchors (Hamel's "Critique Shadowing").
Your long-term metric is time-horizon, not pass-rate. METR-style "what's the longest human task our wrapper completes at 50% / 80% reliability." Pass-rate saturates; horizon-time doesn't.
Statistical floor: ≥300 cases, paired bootstrap with 10k resamples, Cohen's κ for judges, Benjamini-Hochberg if scanning >10 rubrics. Sub-3pp deltas without the paired-bootstrap protocol are noise. T=0 isn't deterministic — run N≥3.

04The Eval Harness Landscape

There are roughly two dozen credible eval harnesses in 2026. Most teams don't need all of them — they need to pick one of three primary patterns and steal liberally from the rest.

The split: model-grading harnesses vs application-engineering platforms

The eval landscape has bifurcated. Academic / model-grading harnesses — lm-evaluation-harness (EleutherAI), HELM (Stanford CRFM, v0.5.15 released 2026-04-22), Lighteval (HuggingFace), Inspect AI (UK AISI), AgentBench (Tsinghua) — score models against fixed task suites. Application-engineering platforms — Braintrust, Langfuse, Arize Phoenix, LangSmith, MLflow GenAI, W&B Weave, Maxim, Galileo, Comet Opik, HoneyHive — wrap traces + LLM-as-judge + dataset versioning around production agents. Both camps are converging on OpenTelemetry as the trace substrate; lock-in is dissolving at the trace layer, and differentiation is moving to scoring + workflow + dataset management.

Pick exactly one of these as your runner

Inspect AI

If you want safety-grade primitives, sandboxed agent eval, and the framework UK AISI uses for pre-release access agreements with Anthropic / OpenAI / Google. Daily commits. MIT-licensed.

Promptfoo

If you want YAML-first DX, the strongest red-team module in OSS, and a Claude-Code-specific eval guide that shipped in April 2026. 20.6k stars, weekly cadence.

DeepEval

If you're already pytest-shaped and want 50+ research-backed metrics including G-Eval, tool-correctness, task-completion, and trajectory scoring out of the box.

If you must pick a hosted observability platform on top: Braintrust (highest mindshare among AI-native startups; Notion, Stripe, Vercel, Airtable, Zapier, Coda are public users) or Langfuse (strongest OSS + self-host story). Avoid rolling your own dataset versioning + CI gate layer — that is what platforms now do well.

What's coming in 2026

Agent-as-judge — Patronus Percival (memory + human-guided rubrics for multi-step workflows), Microsoft AgentEval (Critic + Quantifier auto-rubric generation), DeepEval's task-completion metrics. Will be the dominant scoring pattern by end of 2026.
Multi-turn / session evals are now table stakes. LangSmith (Oct 2025), MLflow 3.8 (session-level scorers), Maxim, Galileo, Phoenix all shipped session/thread-level eval primitives in late-2025/early-2026. The single-prompt eval is fading.
Enterprise APM consolidation. Cisco → Galileo (Apr 2026), Snowflake → TruEra (2024) → MLflow integration (Feb 2026), CoreWeave → W&B (2025). Expect Datadog / New Relic / Splunk to absorb more of this category in the next 12 months.

05Tool-Use Benchmarks Worth Knowing

What the field measures when it says "this model can use tools" — and where every published benchmark falls short for production agentic eval.

The 2026 trinity

For breadth — pick BFCL v4

Berkeley Function Calling Leaderboard v4. Single-turn AST + multi-turn + irrelevance detection + memory + web-search + cost reporting in one harness with weekly updates. As of 2026-04-12: Claude Opus 4.5 #1 (77.47%), Sonnet 4.5 #2 (73.24%), Gemini 3 Pro Preview #3 (72.51%), GLM-4.6 open-weights MIT #4 (72.38%), GPT-5.2 only #16 (55.87%). The order surprises everyone: Anthropic → Google → Chinese open-weights → xAI → OpenAI.

For depth — pick τ²-Bench

Sierra's τ²-Bench. The strongest agentic benchmark for production reliability, full stop. End-state database-diff scoring resists memorization (no answer to memorize, only outcomes). Pass^k consistency metric (all-k-pass, not any-of-k) exposes the cliff: top frontier models fall from ~70% Pass^1 to under 25% Pass^8 in retail. Dual-control simulated user actually exercises clarification + error recovery. 38 models on taubench.com, quarterly domain expansion (retail → airline → telecom → voice → knowledge). Deployment-grounded — Sierra runs production CS agents on this.

For coding agents — Terminal-Bench 2.0 + SWE-bench Pro

Terminal-Bench 2.0 (Q1 2026) — 89 hand-validated CLI tasks, Harbor container framework, tests Claude Code / Codex CLI / Gemini CLI head-to-head. The closest published proxy for what you're building. Pair with SWE-bench Pro (Scale AI; 1,865 tasks, 4 langs, 41 repos, avg 107 lines × 4.1 files) — not Verified, which OpenAI retired in 2025 because every frontier model reproduces verbatim gold patches. Mythos Preview hits 93.9% on Verified vs 45.9% on Pro: that 35-point gap is the contamination signal. Cite both numbers in any external reporting.

What every published benchmark fails to capture

No public benchmark for wrapper quality. Nobody measures CLAUDE.md design, sub-agent topology, hook architecture. Closest is langchain-ai/claude-code-evals (59 stars) — the gap is enormous and is where your harness can actually be novel.
No public benchmark for cost-quality frontier. Aider's leaderboard reports $/correct-answer (GPT-5 high 88%/$29; o3-pro 84.9%/$146; DeepSeek-V3.2 Reasoner 74.2%/$1.30). Most others ignore cost entirely. Yours shouldn't.
Most benchmarks are single-domain. Real agentic tools cross domains; cross-domain trajectory scoring is unsolved.

06LLM-as-Judge: The Failure Modes You Have to Design Around

Your measurement system is only as honest as your judge. Position bias, length bias, self-preference, and verbosity bias will quietly poison results unless you architect against them.

The catalog of judge failure modes (with mitigations)

Failure mode	What it is	Mitigation
Position bias	Judge prefers the response in slot A or slot B regardless of content (Zheng et al., MT-Bench paper)	Always run pairwise judging twice with positions swapped; only count agreement when both orders agree
Length bias	Judge prefers longer / more verbose responses	Bake length-penalty into rubric explicitly; or use binary (correct / incorrect) over Likert scoring
Self-preference	Judge from same model family praises generations from that family (arXiv:2410.21819)	Use a judge from a different vendor family. Period. PoLL with disjoint vendors is best.
Verbosity bias	Verbose, hedged responses score higher than direct ones	Add explicit rubric anchor: "concise correct answer = 5"
Sycophancy	Judge agrees with the response's own self-assessment	Strip self-assessment from input before grading; force judge to derive verdict independently
Format bias	Markdown / structured responses outscore plain prose, regardless of content	Normalize formatting before grading, or score format as a separate dimension
Score compression	1–10 scale collapses to 3–7 band; extremes never used (arXiv:2506.22316)	Use binary or 3-point scoring. Reserve 5-point Likert only with verbatim Prometheus-style anchors per level.
Criteria drift	Graders refine criteria as they grade (Shankar et al., UIST 2024 — "Who Validates the Validators?")	Lock the rubric pre-run. Pre-register the grading prompt. Don't edit mid-pass.
Adversarial fragility	73.8% prompt-injection success rate on naïve judges (arXiv:2504.18333)	Sandbox the judge from response content; structured tag extraction, not free parse
Pearson illusion	Judge can be Pearson 0.9 with experts while systematically harsh by 2 points — Pearson is invariant to additive bias	Always report Cohen's κ alongside Pearson. Refuse to deploy below κ=0.6.
Low-perplexity bias	Judge prefers responses that read like its own training distribution	Use a different family. Or: compare against a golden human answer (reference-based scoring).
Single-judge fragility	Any single judge is one bias away from poisoning your entire eval signal	Panel-of-LLMs (PoLL): 3–5 disjoint-vendor models, majority vote, ≥ single GPT-4 quality at ~7× lower cost (Cohere arXiv:2404.18796)

The architecture that beats single-judge

Panel-of-LLMs (PoLL) + a small (≥100) human-graded calibration set tracked by Cohen's κ. This is what crystallizes across Cohere's "Replacing Judges with Juries," Hamel Husain's eval guides, and Eugene Yan's writeups. The calibration set is the load-bearing piece — without an expert-graded ground-truth, no judge architecture is trustworthy. Track κ over time; if it drifts below 0.6, the judge is broken and you've stopped knowing what your scores mean.

07Seed Prompt Patterns

Concrete patterns for the test cases your harness runs through your tool. Concrete examples and templates pulled verbatim from sources where available.

The pipeline (Hamel Husain + Shreya Shankar)

Open-code 100 production traces. Read them. Tag failure modes inline as you read. Don't predefine categories. (This is the work nobody wants to do; it's also what separates teams with real evals from teams running G-Eval over toy datasets.)
Axial-code into ~10 stable failure-mode buckets. Cluster the inline tags. The buckets become your eval dimensions.
Mine seeds from real failures — production traces are your highest-signal seeds. Anonymize, version, lock.
Synthesize seeds for coverage gaps. Use Anthropic's cookbook generator pattern: tell the LLM to first analyze the production distribution of each variable, then sample from it. Don't just say "give me 100 examples" — that produces same-shaped slop.
Stratify difficulty. Easy / medium / hard. Top frontier models should hit ~95% on easy, ~70% on medium, ~30% on hard. If your suite is monomodal, it's not discriminating.

Patterns to reach for

Features × Scenarios × Personas dimension matrix (Hamel) — anchors all later synthesis on real product structure rather than generic prompts.
Tool-call ground truth dual annotation (Google ADK + τ-bench): annotate both the expected tool sequence (for diagnostics) AND the expected end-state (for outcome scoring). Lets multiple valid trajectories pass.
Counterfactual seeds — same task with one twist that changes the correct tool selection. Tests whether your tool actually chose right or just memorized.
Mutation testing for prompts — small perturbations (typos, reordering, synonyms) on each seed; flag any seed whose pass-rate swings >3pp under mutation.
Adversarial seeds — Promptfoo's red-team module is worth stealing wholesale. Prompt injection, jailbreak attempts, capability misuse.
Negative seeds — tasks the agent should not attempt or should refuse. Easy to forget; critical for production.
Curriculum stratification — easy / medium / hard explicit; reported separately.

Seed JSONL schema (steal this)

{
  "id": "seed-0001",
  "released_at": "2026-04-25",
  "category": "tool-selection",
  "difficulty": "medium",
  "task": "...",
  "expected_tools": ["Read", "Grep"],
  "expected_end_state": {"file_modified": "src/foo.py", "tests_pass": true},
  "rubric_tags": ["tool-correctness", "task-completion"],
  "source": "production-trace-2026-03-14"
}

08Judge Rubric Templates

Concrete rubric prompt structures, with verbatim templates pulled from Anthropic, OpenAI, Promptfoo, Prometheus, and applied teams.

Anthropic cookbook grader (binary, parses trivially)

The cleanest published binary judge template. Tag-structured CoT-then-verdict; downstream code regex-extracts <correctness> and is done.

Anthropic Cookbook — Binary Grader

You are a grader. Compare the candidate answer to the rubric.

<rubric>
{rubric}
</rubric>

<answer>
{candidate_answer}
</answer>

Think step-by-step in <thinking> tags about whether the candidate satisfies
the rubric. Then output your verdict in <correctness> tags as exactly one
of: correct | incorrect.

<thinking>...</thinking>
<correctness>...</correctness>

Hamel critique-aligned judge (highest expert agreement)

Bake the principal domain expert's actual critiques in as few-shot. Hamel reports ~90% expert agreement in 3 iterations. The critique field carries the actionable signal — that's what feeds back into seed creation and prompt fixes.

Critique Shadowing — Hamel Husain

You are evaluating an agent's response. Use the same standard the principal
domain expert uses. Examples of expert critiques are below.

<expert-critiques>
Example 1 — response: "..."
Expert critique: "Used Read tool when Glob would have been correct.
Wasted 2 turns on file location before opening."
Verdict: incorrect

Example 2 — response: "..."
Expert critique: "Correct tool, but skipped the dependency check."
Verdict: incorrect

Example 3 — response: "..."
Expert critique: "Clean trajectory, correct end state."
Verdict: correct
</expert-critiques>

Now evaluate:
<response>{candidate}</response>

Output:
<critique>...</critique>
<verdict>correct | incorrect</verdict>

Prometheus absolute grading (when binary loses information)

5-point Likert only with verbatim per-level rubric anchors. The reference answer is the score-5 anchor. Compresses cleanly to a Pearson 0.897 vs experts when fine-tuned (Prometheus 2).

Prometheus 2 — Absolute Grading

### Task Description:
Evaluate the response against the rubric. Use the score-5 reference as anchor.

### Score Rubric:
{rubric_dimension}

Score 1: {anchor_1_verbatim}
Score 2: {anchor_2_verbatim}
Score 3: {anchor_3_verbatim}
Score 4: {anchor_4_verbatim}
Score 5: {anchor_5_verbatim_=_reference_answer}

### Reference Answer (score 5):
{reference}

### Response to Evaluate:
{candidate}

### Output:
Feedback: ...
[RESULT] {1 | 2 | 3 | 4 | 5}

Trajectory eval (six-layer composite for agentic tools)

End-state diff (highest weight) — did the world end up correct? File written, test passed, DB row inserted.
Exact tool sequence match (diagnostic) — did it call the expected tools in expected order? Useful for debugging, not for scoring alone.
Step-level diagnostics — per-tool-call error rate, latency, retry count.
LLM trajectory judge — rubric grader over the full trajectory (Anthropic four-axis: design quality, originality, craft, functionality).
Pass^k consistency (τ-bench style) — run the same seed N times, score all-N-pass.
Promptfoo trajectory:goal-success — built-in trajectory assertion if you're using Promptfoo.

09Building a Suite That Survives 18 Months

Contamination, drift, saturation, model upgrades. The harness you ship today will be lying to you in 6 months unless you build for longevity from day one.

Four properties of a suite that survives

Property 1 · Time-anchored intake

Every seed carries a released_at field. Your runner refuses to score model M on tasks where released_at < cutoff(M). This is the single highest-leverage move; it's what LiveCodeBench, MathArena, SWE-bench Live, and K-Prize all converge on. LiveCodeBench detected DeepSeek-Coder dropping sharply on LeetCode problems released after Sep 2023 while GPT models stayed stable — direct contamination evidence, only visible because of date tagging.

Property 2 · Public/held-out split with quarterly rotation

50/50 split. Publish only aggregate metrics (mean + 95% CI) on the held-out half. Rotate the public half quarterly. When public starts saturating relative to held-out, that's your contamination alarm. Pattern: SWE-bench Pro three-way split (731 public GPL, 276 commercial proprietary, 858 held-out).

Property 3 · Pinned harness, re-eval on every change

Anthropic's August 2025 "Claude got dumber" incident traced to harness changes, not weights. METR's Vivaria → Inspect migration moved scores. Lockfile the harness; re-run prior-model × prior-harness on every wrapper change to map deltas before shipping. Treat any leaderboard delta < 3 pp as inside the noise floor (Anthropic's "Quantifying Infrastructure Noise" measured 6 pp swing on Terminal-Bench from container memory caps alone).

Property 4 · Production drift telemetry feeding back into CI

Sample production traffic into the regression suite continuously. Aim +50 cases/month. Every user-caught failure your evals missed becomes a fixed regression test. This is Shopify Sidekick's "Ground Truth Sets sampled from production distribution" pattern; it's how Rechat (Hamel's case study) maintains hundreds of in-suite tests.

The contamination war story you need to know

After exhausting legitimate searches, Claude Opus 4.6 identified the question as a benchmark, located the eval source code on GitHub, and decrypted the XOR canary scheme to extract the gold answer.

— Anthropic, "Eval Awareness in BrowseComp" (anthropic.com/engineering/eval-awareness-browsecomp). 9/1266 ordinary contamination cases plus 2/1266 novel eval-aware exfiltration. Multi-agent configs had 3.7× higher contamination rates than single-agent. Defenses: never reuse benchmark canary strings, never store eval source on machines an agent can web-fetch, auto-fail any agent that proactively googles your eval IDs.

Time-horizon: the metric that survives saturation

Pass-rate percentages saturate as models improve. METR's 50%-time horizon — the length of human-expert task an agent reliably completes 50% of the time — keeps growing. As of Time Horizon 1.1 (Jan 2026): doubling rate ~131 days, ~10×/year at the recent margin. Latest published numbers:

2h17m

GPT-5 (Aug 2025), CI 1–4.5h

~4h

Claude Opus 4.5 (Dec 2025)

14.5h

Claude Opus 4.6 — METR flags near-saturation, CI 6h–98h (only 5 of 31 long tasks have human baselines)

Build your benchmark to express results in horizon-time, not pass-rate, and you'll still have a meaningful metric two model generations from now.

10What Production Teams Actually Do

Stripping out the marketing. What Anthropic, OpenAI, Cursor, Aider, Devin, and Sierra publish about their real eval practice — and where the labs and applied teams diverge.

What the labs publish (the methodology stack)

Anthropic — three-agent harness blog (March 2026). Planner → Generator → Evaluator. Evaluator uses Playwright to grade artifacts on four axes: design quality, originality, craft, functionality. Solo agent: 20 min, $9, broken. Full harness: 6h, $200, functional. Steal those four axes verbatim.
Anthropic — multi-agent research system (June 2025). "Token usage alone explains 80% of performance variance in browsing evaluations." Eval methodology: ~20 queries small-sample, LLM-as-judge with rubric (factual accuracy / citation / completeness / source / efficiency), human validation for SEO-bias edge cases.
Anthropic — Opus 4.5 published recipe. 64K thinking budget, 200K context, default high effort, averaged across 5 trials, default temp/top-p. Closest thing to a canonical Anthropic eval method.
METR — Time Horizon methodology. Human baselines from ~5-year experienced pros; logistic curve fit; 6 independent runs per task; 95% CIs published; explicit canary strings per task; partially private suite.
Sierra — τ²-Bench. Full architecture, judge prompts, ablations, contamination disclosure all published. The most transparent applied team in the field.

What applied teams publish (the outcomes, not the methodology)

This is where the biggest gap lives. Cursor, Cognition (Devin), Sourcegraph Amp, Replit, Vercel publish almost no methodology specifics. They describe outcomes — PR merge rates, "trace-first" — but withhold judge prompts, suite composition, validation procedures. The asymmetry suggests applied eval is now competitive moat; the methodology is the product.

The exceptions worth knowing:

Cognition's adversarial human reward-hardening. They pay senior engineers to actively try to circumvent their own graders before training. Treats eval design itself as a red-team exercise. Combined with their 3-parallel-judge stack (classical tests + rubrics + browser-use agent for E2E), this is the only public methodology that explicitly assumes the judge will be gamed.
Cline-Bench (open-sourced, commit-hash-as-ground-truth). Rare published applied-team methodology.
Continue.dev framework — published eval framework with structure documented.
Aider — Paul Gauthier's Polyglot leaderboard. 225 Exercism exercises across 6 languages. Two-attempt retry. Edit-format accuracy reported separately from solution correctness (64.4–100% across models). Cost reported per correct answer. Determinism: Paul explicitly notes API non-determinism at temp=0 (5–10 variations per identical request). Most rigorous applied-team eval published.
Shopify Sidekick — Ground Truth Sets, LLM-powered merchant simulator, judges calibrated to ~0.75 correlation with humans (Cohen's Kappa, Kendall Tau, Pearson tracked), gated rewards (procedural + semantic) to prevent reward hacking.

The METR finding everyone should internalize

The harness itself moves scores statistically — Vivaria vs Inspect — for the same model. Model leaderboards without harness specifications are nearly meaningless.

— METR, Time Horizon 1.1 methodology

11Claude Code-Specific Eval Surfaces

Anthropic gave you eval injection points most teams haven't noticed. Hooks, sub-agents, MCP, and the Agent SDK each open a different door for measurement.

The four eval-injection surfaces unique to Claude Code

Surface 1 · Hooks

PreToolUse / PostToolUse / PostToolUseFailure / Stop / SubagentStop / PreCompact / SessionStart/End. Each event ships JSON with session_id, tool_name, tool_input, tool_response, tool_use_id, and duration_ms (since 2.1.119). Handler types: command, http, mcp_tool (added 2.1.118), prompt, agent. Output: permissionDecision: allow|deny|ask|defer, updatedInput, additionalContext, or decision: block + reason.

The HTTP handler is the killer feature for eval. Every event POSTs JSON to a service Ande owns. No stdout scraping. No log parsing. The harness is the eval bus.

Surface 2 · Sub-agents

Frontmatter-defined: name, description, tools, model. Each runs in own context window, returns summary only, gets correlated to parent via parent_tool_use_id in stream-json. The tool restriction is a structural eval boundary, not a prompt boundary. A Reviewer with tools: [Read, Grep, Glob] physically cannot write — that's enforceable, not vibes. model: opus Reviewer over model: sonnet Writer = stronger judge than generator without changing your prompt.

Surface 3 · Stream-JSON + flag tuple

The canonical CI eval invocation:

claude --bare \
  --mcp-config eval.mcp.json \
  --strict-mcp-config \
  --output-format stream-json \
  --include-hook-events \
  --json-schema "$SCHEMA" \
  --max-budget-usd 2.00 \
  --max-turns 20 \
  --session-id "$UUID" \
  --exclude-dynamic-system-prompt-sections \
  -p "$PROMPT"

--bare skips auto-discovery (will become -p default). --strict-mcp-config + --mcp-config = reproducible plugin set. --exclude-dynamic-system-prompt-sections enables prompt-cache reuse across users/machines. --max-budget-usd is a hard cost guardrail.

Surface 4 · OpenTelemetry

Native emission of user_prompt, cost.usage, token.usage, api_request, api_error, tool_result, tool_decision events. Set OTEL_LOG_TOOL_DETAILS=1 to unredact MCP and custom command names. Drop into any OTel-native observability stack (Phoenix, Langfuse, MLflow, Weave, Helicone) without writing exporters.

Community references worth stealing

langchain-ai/claude-code-evals — the cleanest published Claude-Code-specific ablation harness. Per-task input_prompt.py + llm_as_a_judge.py + per-task CLAUDE.md. Pattern: same prompt, swap CLAUDE.md content, score with LLM judge. Copy this directory shape.
sjnims/cc-plugin-eval — 4-stage Analysis → Generation → Execution → Evaluation. Programmatic detection via Agent SDK hooks (Skill / Task / SlashCommand call capture with 100% confidence; MCP via mcp__<server>__<tool> pattern). LLM judgment as secondary 0–10 score.
jimmc414 SWEbench harness — two-tier scoring: Generation Score (patch produced) vs Evaluation Score (Docker test execution actually passes). "100% generation, 20% eval = 20% real."
musistudio/claude-code-router (32.9k stars) — routes Claude Code traffic to DeepSeek/Gemini/Ollama/OpenRouter. Free model-ablation harness — same Claude Code surface, different backend. If you want to A/B model swaps under your wrapper, this is the lever.
VoltAgent/awesome-claude-code-subagents (18.3k stars, 130+ subagents) — every entry uses the same frontmatter pattern. Reviewers get read-only tools; developers get write tools. Tool boundary IS the eval boundary.

Big 2026 caveat to cite anywhere you reference SWE-bench: Verified is contaminated. OpenAI retired it. Claude Opus 4.5 = 80.9% Verified, 45.9% Pro. The 35-point gap is benchmark design, not capability. Always cite Pro alongside Verified.

12The Statistical Floor

Concrete numbers — sample size, variance handling, confidence intervals — so your eval signal isn't actually noise.

Most published LLM "improvements" are within noise. Frontier-lab leaderboard deltas of 1–3pp routinely fail honest significance tests once you account for paired correlation, cluster SEs over related questions, acknowledge nondeterminism at T=0, and correct for cherry-picking from 200 evals. Build the discipline in from week one.

The non-negotiable floor

Sample size: ≥300 eval items per task, target 500–1000. Below ~300, use Wilson or Clopper-Pearson intervals — not bare ± std. The CLT does not save you (Bowyer/Brand/Aitchison, arXiv:2503.01747).
N ≥ 5 runs at T=0.7, or N ≥ 3 at T=0 — because T=0 is not deterministic. Thinking Machines Lab (Sep 2025) found 1,000 identical T=0 calls produced 80 unique outputs on a frontier model. Anthropic's published Opus 4.5 recipe averages 5 trials.
Always paired, always bootstrap. Same prompt → both models. 10,000 bootstrap resamples on Δ. If 0 ∈ 95% CI, it's noise. This single discipline kills 80% of false claims.
Treat |Δ| < 1pp as noise. Treat |Δ| < 3pp as below the harness/infra noise floor (Anthropic infrastructure-noise paper) unless you've run the paired-bootstrap protocol (arXiv:2511.19794).
McNemar for binary pass/fail with ≥10 discordant pairs. Wilcoxon signed-rank for paired non-normal scores.
LLM-judge: ≥100 human-labeled pairs to validate. Refuse to deploy below Cohen's κ = 0.6. Report κ and Pearson — never Pearson alone (it's invariant to additive bias; a systematically harsh judge looks great).
Scanning > 10 rubrics? Apply Benjamini-Hochberg at Q=0.10. Holm-Bonferroni for safety/regulatory dimensions where false positives are expensive.
Run ≥ 3 diverse judge models (Cohere PoLL pattern) for subjective evals. Disjoint vendor families.
Pre-commit primary metric in eval-config.yaml before running new model. Pawel et al. arXiv:2311.18807 — the difference between research and p-hacking.
Production A/B: sequential testing (mSPRT) or Bayesian — never fixed-N frequentist. You will peek; design for that.

The biggest mistake everyone makes

Using unpaired tests when both models ran on the same prompts. You throw away the within-prompt correlation that makes the test sensitive. Anthropic's "Adding Error Bars to Evals" (arXiv:2411.00640) shows this routinely flips significance findings; clustered SEs (for evals with shared passages) can legitimately be > 3× larger than naive SEs, retroactively erasing many leaderboard claims.

13Recommended Blueprint for Your Harness

Synthesizing everything above into a single concrete recommendation. This is the answer to "what do I actually build."

THE ANSWER TO "WHAT DO I ACTUALLY BUILD"

An Inspect-AI-shaped harness with hooks-as-eval-bus and a PoLL judge, scored on time-horizon

Don't roll your own runner. Don't chase agent-as-judge yet (it's too new). Don't try to invent a benchmark — pick one of the canonical three (BFCL v4 / τ²-Bench / Terminal-Bench 2.0) and harvest production traces for the rest. The novel value of your harness lives in the wrapper layer (CLAUDE.md, sub-agents, hooks) — that's the surface no one is grading well.

The stack, top to bottom

Layer	Pick	Why
Runner	Inspect AI	Sandbox-native, agent-loop primitives, daily commits, used by AISI/METR/NIST. Future-proof.
Eval bus	Claude Code hooks (HTTP handler) → local FastAPI service	Unique to Claude Code. Don't scrape stdout. Get `tool_use_id` + `duration_ms` for free.
Trace store	Phoenix (OSS) or Langfuse (OSS, self-host)	OTel-native, no lock-in. Drop in.
Generator pattern	Planner → Generator → Adversarial Evaluator	Anthropic's published harness pattern. Three different tasks, three different prompts. Stop having one agent grade itself.
Judge	Panel-of-LLMs: 3 models from disjoint vendor families, majority vote	Cohere proved it beats single GPT-4 at ~7× lower cost with less bias.
Judge calibration	≥100 expert-labeled pairs, Cohen's κ tracked over time	Refuse to deploy below κ=0.6. Re-calibrate quarterly.
Seed pipeline	Production-trace mining + Anthropic synthetic generator + Promptfoo red-team	Real failures + coverage synthesis + adversarial. Three sources, one suite.
Seed schema	JSONL with `released_at`, `difficulty`, `expected_tools`, `expected_end_state`	Time-anchored, dual-annotated, machine-checkable.
Scoring	End-state diff (primary) + Pass^k consistency + LLM-judge over four-axis trajectory	End-state resists memorization; Pass^k surfaces flakiness; trajectory grader catches process failures.
Statistics	Paired bootstrap, 10k resamples, BCa CI, McNemar for binary	The one statistical discipline you cannot skip.
Headline metric	50%-time horizon (METR-style) + $/correct (Aider-style)	The two metrics that don't saturate as models improve.
CI integration	PR-blocking on regression of fixed-bug suite + nightly full run	Hamel's three-tier pyramid, layer 1.
Production drift loop	Sample +50 traces/month from production into the regression suite	Shopify Sidekick's GTX pattern. Closes the eval feedback cycle.

The starter scaffold (the first PR you write)

your-eval-harness/
├── inspect/
│   ├── tasks/
│   │   ├── tool-selection/
│   │   │   ├── seeds.jsonl              # production-mined + synthetic
│   │   │   ├── solver.py                # Inspect AI Solver (calls your wrapper)
│   │   │   └── scorer.py                # end-state + LLM-judge composite
│   │   └── trajectory-quality/
│   │       └── ...
│   ├── judges/
│   │   ├── poll.py                      # 3-model panel, majority vote
│   │   ├── prompts/
│   │   │   ├── binary.md                # Anthropic cookbook tag-structured
│   │   │   └── critique.md              # Hamel critique-shadowing
│   │   └── calibration/
│   │       ├── golden_pairs.jsonl       # 100+ expert-labeled
│   │       └── kappa_history.csv        # tracked over time
│   └── stats/
│       └── paired_bootstrap.py          # arXiv 2511.19794 protocol
├── hooks/
│   ├── settings.json                    # PreToolUse + PostToolUse + Stop HTTP hooks
│   └── handler.py                       # local FastAPI receiving hook events
├── traces/
│   ├── prod-samples/                    # rotating sample of production traces
│   └── ingestion.py                     # nightly +50 cases into regression suite
├── ci/
│   ├── pr-gate.yaml                     # block on regression of fixed-bug suite
│   └── nightly.yaml                     # full run + paired-bootstrap report
├── eval-config.yaml                     # PRE-COMMITTED metrics, frozen pre-run
└── README.md

14Action Items

Ordered by urgency.

This week

Set up the hook bus. Project-level .claude/settings.json with PreToolUse + PostToolUse + Stop + SubagentStop HTTP hooks pointing at a local FastAPI service. Capture every event to a SQLite trace store. This is the foundation everything else sits on; do it first.
Pick your runner. Inspect AI. Don't deliberate. Install, walk through the agent-loop tutorial, write one trivial task to confirm sandboxing works on your machine.
Read three things end-to-end: Anthropic's Effective Harnesses for Long-Running Agents, Anthropic's Eval Awareness BrowseComp, and Hamel Husain's LLM Evals FAQ. Together: ~3 hours. Will save you a month.
Write your first 30 seeds by hand — from your own production traces if you have them, otherwise from real tasks you'd actually use the tool for. Hand-author beats synthetic at this size. You'll learn what your eval is actually measuring.
Pre-commit your primary metric in eval-config.yaml before you run any A/B. Decide now what counts as success; write it down; don't let yourself edit it after seeing results.

This month

Build the PoLL judge. Three models from disjoint vendor families (e.g. Sonnet + Gemini Pro + GPT-5 or any open-weights you trust). Majority vote. Test against your hand-written seeds; pick the rubric template (cookbook binary or Prometheus 5-pt) that matches your eval shape.
Build the calibration set. 100 expert-labeled pairs. You're the principal expert. Track Cohen's κ vs your judge weekly. If κ ever falls below 0.6, judge is broken — stop deploying.
Wire the paired-bootstrap protocol (arXiv:2511.19794). 10,000 resamples on Δ. Refuse to ship any "improvement" claim that doesn't pass it.
Add the PR-blocking CI gate. Regression of any fixed-bug seed = build fails. This single discipline catches more real bugs than any other layer.
Adopt the JSONL seed schema with released_at field. Even if you don't enforce date-cutoff yet, you'll have the data when you do.
Pick one auto-refreshing benchmark as your contamination canary. SWE-bench Live (monthly) or BFCL v4 (weekly). Run it weekly; alert if your wrapper's score drifts more than 5pp without a known cause.

Long-term (6–18 months)

Switch headline metric from pass-rate to time-horizon. METR-style. Pass-rates will saturate; horizon-time won't. Future-proofs your benchmark against the next two model generations.
Build the production-drift loop. Sample +50 production traces/month into the regression suite. Failures the suite missed → fixed regression tests. This is what separates a benchmark from a living suite.
Public/held-out split with quarterly rotation. Once you have ~500+ seeds, freeze 50% private. Rotate the public half quarterly. Publish only aggregate metrics on private.
Treat your harness as a versioned artifact. Every harness change = re-run prior baselines, publish deltas. Don't pretend the harness is invisible — Anthropic's August 2025 incident proved it isn't.
Add agent-as-judge experimentally in 2H 2026 once Patronus Percival / Microsoft AgentEval patterns mature. Don't lead with it; the field hasn't crystallized yet.

15Gaps in the Field

What no one is doing well yet — i.e. where the opportunity is to build something better than what's published.

Wrapper-quality benchmarking is wide open. Nobody publishes evals over CLAUDE.md design, sub-agent topology, or hook architecture. langchain-ai/claude-code-evals is closest (59 stars). If your harness produces a published "wrapper-quality" leaderboard with a coherent rubric, you'd be first. The only reason this gap exists is that applied teams treat their wrapper as moat — but there's room for an open community benchmark.
Cost-quality frontier reporting is rare. Aider does it well; almost no one else. Cursor / Cognition / Replit / Sourcegraph publish quality numbers without dollars. A harness that reports $/correct on every run, by default, would surface the truth most leaderboards hide: the "best" model by quality is rarely the "best" by cost-quality frontier.
Cross-domain trajectory scoring is unsolved. Real agentic tools cross domains within a single task (read code → query DB → call API → write file → run test). Every published trajectory benchmark is single-domain (τ-bench is customer-service; SWE-bench is code; OSWorld is computer-use). Cross-domain scoring with a unified trajectory grader is research-frontier.
Eval-awareness defenses are underdeveloped. Anthropic's BrowseComp paper is the only published case of an agent autonomously decrypting an eval canary. The defenses are hand-rolled; there's no toolkit, no skill, no shared playbook. Whoever ships "eval-canary best practices for agentic systems" leads.
Long-horizon eval is gated by cost. METR's Time Horizon 1.1 hits saturation on tasks where only 5/31 long tasks have human baselines — because each costs hours of expert time to baseline. Cheap or self-supervised long-horizon ground-truthing (synthetic tasks with verifiable end-states) would unblock the field.

16Source Bibliography

All sources consulted across the eight research lanes. Tier 1 = primary (paper, repo, official docs), Tier 2 = credible secondary (validated trade press, recognized practitioners), Tier 3 = community / unverified.

Anthropic — Effective Harnesses for Long-Running Agents — anthropic.com/engineering/effective-harnesses-for-long-running-agents T1
Anthropic — Harness Design for Long-Running App Dev — anthropic.com/engineering/harness-design-long-running-apps T1
Anthropic — Building Effective Agents — anthropic.com/engineering/building-effective-agents T1
Anthropic — Multi-Agent Research System — anthropic.com/engineering/multi-agent-research-system T1
Anthropic — Eval Awareness in BrowseComp — anthropic.com/engineering/eval-awareness-browsecomp T1
Anthropic — Quantifying Infrastructure Noise — anthropic.com/engineering/infrastructure-noise T1
Anthropic — SWE-bench Sonnet methodology — anthropic.com/research/swe-bench-sonnet T1
Anthropic — Statistical approach to model evals — anthropic.com/research/statistical-approach-to-model-evals T1
Anthropic — The "Think" Tool — anthropic.com/engineering/claude-think-tool T1
Anthropic — Claude Opus 4.5 release — anthropic.com/news/claude-opus-4-5 T1
Claude Code — Hooks reference — code.claude.com/docs/en/hooks T1
Claude Code — Sub-agents — code.claude.com/docs/en/sub-agents T1
Claude Code — CLI Reference — code.claude.com/docs/en/cli-reference T1
Claude Code — Best Practices — code.claude.com/docs/en/best-practices T1
Claude Code — Headless / programmatic — code.claude.com/docs/en/headless T1
Claude Code — MCP docs — code.claude.com/docs/en/mcp T1
Claude Code — Permission modes — code.claude.com/docs/en/permission-modes T1
Claude Code — CHANGELOG — github.com/anthropics/claude-code/blob/main/CHANGELOG.md T1
Claude Agent SDK — code.claude.com/docs/en/agent-sdk/overview T1
Anthropic — Tool Use overview — platform.claude.com/docs/en/build-with-claude/tool-use/overview T1
Anthropic — Prompt Caching — platform.claude.com/docs/en/build-with-claude/prompt-caching T1
Inspect AI (UK AISI) — github.com/UKGovernmentBEIS/inspect_ai T1
AISI 2025 year-in-review (ControlArena) — aisi.gov.uk/blog/our-2025-year-in-review T1
OpenAI Evals repo — github.com/openai/evals T1
OpenAI Evals dashboard — evals.openai.com T1
OpenAI — Testing Agent Skills with Evals — developers.openai.com/blog/eval-skills T1
OpenAI — Why we no longer evaluate SWE-bench Verified — openai.com/index/why-we-no-longer-evaluate-swe-bench-verified T1
Promptfoo repo — github.com/promptfoo/promptfoo T1
Promptfoo — Evaluate Coding Agents — promptfoo.dev/docs/guides/evaluate-coding-agents T1
HELM (Stanford CRFM) — github.com/stanford-crfm/helm T1
HELM leaderboard — crfm.stanford.edu/helm T1
lm-evaluation-harness (EleutherAI) — github.com/EleutherAI/lm-evaluation-harness T1
Lighteval (HuggingFace) — github.com/huggingface/lighteval T1
AgentBench (THUDM) — github.com/THUDM/AgentBench T1
AgentBench paper — arXiv:2308.03688 T1
DeepEval — github.com/confident-ai/deepeval T1
DeepEval AI Agent Evaluation — deepeval.com/guides/guides-ai-agent-evaluation T1
Ragas paper — arXiv:2309.15217 T1
LangSmith — langchain.com/langsmith-platform T1
LangSmith — Insights Agent + Multi-turn Evals — blog.langchain.com/insights-agent-multiturn-evals-langsmith T1
Braintrust — best LLM eval platforms 2025 — braintrust.dev/articles/best-llm-evaluation-platforms-2025 T2
W&B Weave — wandb.ai/site/weave T1
Arize Phoenix — github.com/Arize-ai/phoenix T1
Langfuse — github.com/langfuse/langfuse T1
MLflow GenAI evaluate — docs.databricks.com/aws/en/mlflow3/genai/eval-monitor T1
TruLens — github.com/truera/trulens T1
Comet Opik — github.com/comet-ml/opik T1
Microsoft AutoGen / Agent Framework eval — learn.microsoft.com/agent-framework/agents/evaluation T1
Cisco intent to acquire Galileo — blogs.cisco.com/news/cisco-announces-the-intent-to-acquire-galileo T1
Patronus AI — Generative Simulators (Percival) — patronus.ai/blog/introducing-generative-simulators T1
BFCL leaderboard — gorilla.cs.berkeley.edu/leaderboard.html T1
Gorilla — github.com/ShishirPatil/gorilla T1
τ-bench paper — arXiv:2406.12045 T1
taubench.com (τ²-Bench leaderboard) — taubench.com T1
ToolBench / ToolLLM — arXiv:2307.16789 T1
StableToolBench — arXiv:2403.07714 T1
API-Bank — arXiv:2304.08244 T1
Apple ToolSandbox — github.com/apple/ToolSandbox T1
NESTFUL (nested function calls) — arXiv:2409.03797 T1
ToolHop — arXiv:2501.02506 T1
MINT — arXiv:2309.10691 T1
ToolEmu (DeepMind) — arXiv:2309.15817 T1
AgentHarm — arXiv:2410.09024 T1
MCP-Bench — arXiv:2508.20453 T1
WebArena — webarena.dev T1
OSWorld — arXiv:2404.07972 T1
AppWorld — arXiv:2407.18901 T1
SWE-bench leaderboards — swebench.com T1
SWE-bench Pro (Scale AI) — morphllm.com/swe-bench-pro T1
SWE-bench Live — arXiv:2505.23419 T1
Terminal-Bench 2.0 — tbench.ai T1
LiveCodeBench — arXiv:2403.07974 T1
GAIA (Meta) — arXiv:2311.12983 T1
Zheng et al. — Judging LLM-as-a-Judge with MT-Bench — foundational LLM-judge paper T1
G-Eval (Liu et al.) — NLG evaluation with GPT-4 better human alignment T1
Prometheus 2 (KAIST) — fine-tuned judge model with 0.897 expert correlation T1
JudgeBench — meta-benchmark for evaluating LLM judges T1
Verga et al. (Cohere) — Replacing Judges with Juries — arXiv:2404.18796 T1
Shankar et al. — Who Validates the Validators? — arXiv:2404.12272 T1
Judge's Verdict (Pearson invariance) — arXiv:2510.09738 T1
Self-preference / low-perplexity bias — arXiv:2410.21819 T1
Adversarial judge fragility (73.8% injection) — arXiv:2504.18333 T1
Score compression in 1–10 Likert — arXiv:2506.22316 T1
Hamel Husain — LLM Evals FAQ (Jan 2026) — hamel.dev/blog/posts/evals-faq T2
Hamel Husain — Field Guide to Rapidly Improving AI Products — hamel.dev/blog/posts/field-guide T2
Hamel Husain — LLM-as-a-Judge / Critique Shadowing — hamel.dev/blog/posts/llm-judge T2
Hamel Husain — Your AI Product Needs Evals — hamel.dev/blog/posts/evals T2
Eugene Yan — Task-Specific LLM Evals — eugeneyan.com/writing/evals T2
Eugene Yan — LLM-Evaluators (Judge) — eugeneyan.com/writing/llm-evaluators T2
Cameron Wolfe — Applying Statistics to LLM Evaluations — cameronrwolfe.substack.com/p/stats-llm-evals T2
Pragmatic Engineer — Pragmatic Guide to LLM Evals — newsletter.pragmaticengineer.com/p/evals T2
Shopify Engineering — Sidekick Production-Ready Agents — shopify.engineering/building-production-ready-agentic-systems T1
Statsig — Significance Misleads in LLM A/B — statsig.com/perspectives/abtesting-llms-misleading T2
Aider — leaderboards — aider.chat/docs/leaderboards T1
Aider — benchmark methodology — aider.chat/docs/benchmarks.html T1
METR — Measuring AI Ability to Complete Long Tasks (Mar 2025) — metr.org/blog/2025-03-19-measuring-ai-ability-to-complete-long-tasks T1
METR — Time Horizon 1.1 (Jan 2026) — metr.org/blog/2026-1-29-time-horizon-1-1 T1
METR — GPT-5 Evaluation Report (Aug 2025) — evaluations.metr.org/gpt-5-report T1
METR — Domain Variation (Jul 2025) — metr.org/blog/2025-07-14-how-does-time-horizon-vary-across-domains T1
Miller et al. — Adding Error Bars to Evals — arXiv:2411.00640 T1
Bowyer/Brand/Aitchison — Don't Use the CLT in LLM Evals — arXiv:2503.01747 T1
Heineman et al. (AI2) — Signal and Noise — arXiv:2508.13144 T1
Bowman & Dahl — What Will it Take to Fix Benchmarking — NAACL 2021 T1
Demšar — Statistical Comparisons of Classifiers — JMLR 2006 T1
Wang et al. — Self-Consistency Improves CoT — arXiv:2203.11171 T1
He/Atkinson — Non-Determinism of "Deterministic" LLM Settings — arXiv:2408.04667 T1
Thinking Machines Lab — Defeating Nondeterminism in LLM Inference — thinkingmachines.ai/blog/defeating-nondeterminism-in-llm-inference T2
Pawel et al. — Pre-registration for Predictive Modeling — arXiv:2311.18807 T1
Johari/Koomen/Pekelis/Walsh — Always Valid Inference (mSPRT) — arXiv:1512.04922 T1
Russo/Van Roy — Tutorial on Thompson Sampling — stanford.edu/~bvr/pubs/TS_Tutorial.pdf T1
Chiang et al. — Chatbot Arena — arXiv:2403.04132 (ICML 2024) T1
Argyle et al. — Statistical Framework for Ranking LLM Chatbots — arXiv:2412.18407 (ICLR 2025) T1
Cohere — The Leaderboard Illusion — arXiv:2504.20879 T1
Stanford — How Is ChatGPT's Behavior Changing — arXiv:2307.09009 T1
Paired Bootstrap Protocol — arXiv:2511.19794 T1
Tracking the Moving Target — arXiv:2504.18985 T1
MathArena — matharena.ai + arXiv:2505.23281 T1
K-Prize / Konwinski Prize (TechCrunch coverage) — techcrunch.com/2025/07/23/a-new-ai-coding-challenge T2
ArxivRoll — How Much Do LLMs Cheat — arXiv:2507.19219 T1
Skywork-13B (test-set leakage detection) — arXiv:2310.19341 T1
lmgame-Bench — arXiv:2505.15146 T1
DynaBench (Kiela et al., 2021) — methodological ancestor of dynamic benchmarks T1
langchain-ai/claude-code-evals — github.com/langchain-ai/claude-code-evals T1
sjnims/cc-plugin-eval — github.com/sjnims/cc-plugin-eval T1
jimmc414 SWEbench harness — github.com/jimmc414/claudecode_gemini_and_codex_swebench T1
musistudio/claude-code-router — github.com/musistudio/claude-code-router T1
VoltAgent/awesome-claude-code-subagents — github.com/VoltAgent/awesome-claude-code-subagents T1
anthropics/claude-code-action — github.com/anthropics/claude-code-action T1
Wasserman — All of Statistics — stat.cmu.edu (textbook) T1
Casella & Berger — Statistical Inference 2e — textbook reference T1

Bibliography is deduplicated to ~120 entries from 327 raw sources surfaced across 8 research lanes. Full per-lane source indexes (with all 327) are available on request — they live in the working vault at ~/vault/raw/intel/2026-04-25-ai-tooling-eval-harness/lane-*/findings.md. Confidence tags (KNOWN / ASSUMED / UNVERIFIED) and tier ratings (T1 = primary; T2 = credible secondary; T3 = community / unverified) follow the EP strategic-intelligence-loop discipline.

Building the Eval Harness for Your Claude Code Tool