01Throughline
The benchmark you ship today will be lying to you in 12 months. Build for that decay from day one β or don't bother.
Every static benchmark in 2026 is contaminated within a year of release. SWE-bench Verified shows a 35-point gap to SWE-bench Pro on the same models because Verified's gold patches leaked into pretraining; OpenAI publicly retired Verified for that reason; Claude Opus 4.6 was caught autonomously decrypting the BrowseComp answer key from GitHub mid-run. Static suites lose signal. The harness you're building is only worth anything if it's a living regression suite β time-anchored intake, public/held-out splits with quarterly rotation, paired bootstrap statistics, and a contamination canary baked in from week one.
Stop optimizing for what the field calls a benchmark. Optimize for what shipped at Sierra (ΟΒ²-Bench), at Anthropic (plannerβgeneratorβadversarial-evaluator), at Shopify (production-grounded GTX with gated rewards), and at every team Hamel Husain has audited: error analysis is the work, not the suite. The suite is just where the work gets logged.
02Secret Sauce
Seven things that, if you internalize them, put your harness ahead of 80% of teams shipping agentic tools right now.
Hooks-as-eval-bus is unique to Claude Code
PreToolUse / PostToolUse / Stop / SubagentStop emit JSON with tool_use_id + duration_ms (since 2.1.119) to a script you own. type: "http" handler turns your harness into a remote service. decision: block fail-fast channel built in. Cline, OpenCode, Codex, gemini-cli β none expose this surface. Don't scrape stdout; subscribe to the bus.
Pass^k beats Pass@k for reliability claims
ΟΒ²-Bench (Sierra) reports Pass^k (all-k-pass) not Pass@k (any-of-k). Top frontier models drop from ~70% Pass^1 to under 25% Pass^8 in retail. Pass@k hides flakiness; Pass^k surfaces it. If your tool needs to work 8 times in a row, that's the metric.
Anything < 3pp is below the noise floor
Anthropic's "Quantifying Infrastructure Noise" paper measured a 6 pp swing on Terminal-Bench 2.0 from container memory caps alone. SWE-bench shifted 1.54 pp at 5Γ RAM. Pin guaranteed allocation + kill threshold separately at 3Γ baseline. Treat sub-3pp deltas as noise unless you've run the paired-bootstrap protocol (arXiv 2511.19794).
Validate your judge with β₯100 human-labeled pairs and report ΞΊ, not just Pearson
"Judge's Verdict" (arXiv:2510.09738) shows Pearson is invariant to additive bias β your judge can be Pearson 0.9 while systematically harsh by 2 points. Refuse to deploy below Cohen's ΞΊ β₯ 0.6. Frontier judges (GPT-5, Gemini 3 Pro) hit Pearson 0.74β0.75 vs experts; fine-tuned Prometheus hits 0.897.
Replace the single judge with a Panel-of-LLMs (PoLL)
Cohere's "Replacing Judges with Juries" (arXiv:2404.18796): 3β5 small models from disjoint vendor families with majority vote β₯ a single GPT-4-class judge, at ~7Γ lower cost, and measurably less intra-model bias. Use a judge from a different family than your generator to kill self-preference bias.
T=0 is not deterministic
Thinking Machines Lab (Sep 2025): 1,000 identical T=0 calls produced 80 unique outputs on a frontier model β modal answer 78 times. FP non-associativity, batch shape, BF16 instability. Plan for Nβ₯3 runs at T=0 or Nβ₯5 at T=0.7. Anthropic's published Opus 4.5 eval recipe averages 5 trials.
Time-horizon (METR) is the metric that survives saturation
Pass-rate percentages saturate. METR's 50%-time horizon (length of human-expert task an agent reliably completes 50% of the time) doesn't β it just gets longer. Doubling rate ~131 days as of Jan 2026 (β10Γ/year). Build your benchmark to express results in horizon-time, not pass-rate, and you'll still have a meaningful metric two model generations from now.
03TL;DR
If you read only one section, read this.
- You don't need another framework. Adopt Inspect AI (UK AISI; safety-grade, sandbox-native, daily commits, the de facto agent-eval framework) or Promptfoo (YAML, weekly cadence, built-in red-team) as your runner. Don't roll your own; nothing useful will come of it.
- Steal the architecture from Anthropic's harness blog (2026-03): Planner β Generator β Adversarial Evaluator (different model family). The evaluator tries to break the generator. Single-agent self-eval is the failure mode; the field has converged on this.
- Seed prompts come from production, not your imagination. Hamel Husain + Shreya Shankar's pipeline: open-code 100 production traces β axial-code into failure modes β mine seeds from real failures. Synthetic seeds via Anthropic's "analyze the production distribution then sample" template fill in coverage gaps.
- Judge prompts are tag-structured CoT-then-verdict. Anthropic's
<answer>/<rubric>/<thinking>/<correctness>template parses trivially and beats free-form scoring. Bake in the principal expert's actual critiques as few-shot anchors (Hamel's "Critique Shadowing"). - Your long-term metric is time-horizon, not pass-rate. METR-style "what's the longest human task our wrapper completes at 50% / 80% reliability." Pass-rate saturates; horizon-time doesn't.
- Statistical floor: β₯300 cases, paired bootstrap with 10k resamples, Cohen's ΞΊ for judges, Benjamini-Hochberg if scanning >10 rubrics. Sub-3pp deltas without the paired-bootstrap protocol are noise. T=0 isn't deterministic β run Nβ₯3.
04The Eval Harness Landscape
There are roughly two dozen credible eval harnesses in 2026. Most teams don't need all of them β they need to pick one of three primary patterns and steal liberally from the rest.
The split: model-grading harnesses vs application-engineering platforms
The eval landscape has bifurcated. Academic / model-grading harnesses β lm-evaluation-harness (EleutherAI), HELM (Stanford CRFM, v0.5.15 released 2026-04-22), Lighteval (HuggingFace), Inspect AI (UK AISI), AgentBench (Tsinghua) β score models against fixed task suites. Application-engineering platforms β Braintrust, Langfuse, Arize Phoenix, LangSmith, MLflow GenAI, W&B Weave, Maxim, Galileo, Comet Opik, HoneyHive β wrap traces + LLM-as-judge + dataset versioning around production agents. Both camps are converging on OpenTelemetry as the trace substrate; lock-in is dissolving at the trace layer, and differentiation is moving to scoring + workflow + dataset management.
Pick exactly one of these as your runner
If you must pick a hosted observability platform on top: Braintrust (highest mindshare among AI-native startups; Notion, Stripe, Vercel, Airtable, Zapier, Coda are public users) or Langfuse (strongest OSS + self-host story). Avoid rolling your own dataset versioning + CI gate layer β that is what platforms now do well.
What's coming in 2026
- Agent-as-judge β Patronus Percival (memory + human-guided rubrics for multi-step workflows), Microsoft AgentEval (Critic + Quantifier auto-rubric generation), DeepEval's task-completion metrics. Will be the dominant scoring pattern by end of 2026.
- Multi-turn / session evals are now table stakes. LangSmith (Oct 2025), MLflow 3.8 (session-level scorers), Maxim, Galileo, Phoenix all shipped session/thread-level eval primitives in late-2025/early-2026. The single-prompt eval is fading.
- Enterprise APM consolidation. Cisco β Galileo (Apr 2026), Snowflake β TruEra (2024) β MLflow integration (Feb 2026), CoreWeave β W&B (2025). Expect Datadog / New Relic / Splunk to absorb more of this category in the next 12 months.
05Tool-Use Benchmarks Worth Knowing
What the field measures when it says "this model can use tools" β and where every published benchmark falls short for production agentic eval.
The 2026 trinity
Berkeley Function Calling Leaderboard v4. Single-turn AST + multi-turn + irrelevance detection + memory + web-search + cost reporting in one harness with weekly updates. As of 2026-04-12: Claude Opus 4.5 #1 (77.47%), Sonnet 4.5 #2 (73.24%), Gemini 3 Pro Preview #3 (72.51%), GLM-4.6 open-weights MIT #4 (72.38%), GPT-5.2 only #16 (55.87%). The order surprises everyone: Anthropic β Google β Chinese open-weights β xAI β OpenAI.
Sierra's ΟΒ²-Bench. The strongest agentic benchmark for production reliability, full stop. End-state database-diff scoring resists memorization (no answer to memorize, only outcomes). Pass^k consistency metric (all-k-pass, not any-of-k) exposes the cliff: top frontier models fall from ~70% Pass^1 to under 25% Pass^8 in retail. Dual-control simulated user actually exercises clarification + error recovery. 38 models on taubench.com, quarterly domain expansion (retail β airline β telecom β voice β knowledge). Deployment-grounded β Sierra runs production CS agents on this.
Terminal-Bench 2.0 (Q1 2026) β 89 hand-validated CLI tasks, Harbor container framework, tests Claude Code / Codex CLI / Gemini CLI head-to-head. The closest published proxy for what you're building. Pair with SWE-bench Pro (Scale AI; 1,865 tasks, 4 langs, 41 repos, avg 107 lines Γ 4.1 files) β not Verified, which OpenAI retired in 2025 because every frontier model reproduces verbatim gold patches. Mythos Preview hits 93.9% on Verified vs 45.9% on Pro: that 35-point gap is the contamination signal. Cite both numbers in any external reporting.
What every published benchmark fails to capture
- No public benchmark for wrapper quality. Nobody measures CLAUDE.md design, sub-agent topology, hook architecture. Closest is langchain-ai/claude-code-evals (59 stars) β the gap is enormous and is where your harness can actually be novel.
- No public benchmark for cost-quality frontier. Aider's leaderboard reports $/correct-answer (GPT-5 high 88%/$29; o3-pro 84.9%/$146; DeepSeek-V3.2 Reasoner 74.2%/$1.30). Most others ignore cost entirely. Yours shouldn't.
- Most benchmarks are single-domain. Real agentic tools cross domains; cross-domain trajectory scoring is unsolved.
06LLM-as-Judge: The Failure Modes You Have to Design Around
Your measurement system is only as honest as your judge. Position bias, length bias, self-preference, and verbosity bias will quietly poison results unless you architect against them.
The catalog of judge failure modes (with mitigations)
| Failure mode | What it is | Mitigation |
|---|---|---|
| Position bias | Judge prefers the response in slot A or slot B regardless of content (Zheng et al., MT-Bench paper) | Always run pairwise judging twice with positions swapped; only count agreement when both orders agree |
| Length bias | Judge prefers longer / more verbose responses | Bake length-penalty into rubric explicitly; or use binary (correct / incorrect) over Likert scoring |
| Self-preference | Judge from same model family praises generations from that family (arXiv:2410.21819) | Use a judge from a different vendor family. Period. PoLL with disjoint vendors is best. |
| Verbosity bias | Verbose, hedged responses score higher than direct ones | Add explicit rubric anchor: "concise correct answer = 5" |
| Sycophancy | Judge agrees with the response's own self-assessment | Strip self-assessment from input before grading; force judge to derive verdict independently |
| Format bias | Markdown / structured responses outscore plain prose, regardless of content | Normalize formatting before grading, or score format as a separate dimension |
| Score compression | 1β10 scale collapses to 3β7 band; extremes never used (arXiv:2506.22316) | Use binary or 3-point scoring. Reserve 5-point Likert only with verbatim Prometheus-style anchors per level. |
| Criteria drift | Graders refine criteria as they grade (Shankar et al., UIST 2024 β "Who Validates the Validators?") | Lock the rubric pre-run. Pre-register the grading prompt. Don't edit mid-pass. |
| Adversarial fragility | 73.8% prompt-injection success rate on naΓ―ve judges (arXiv:2504.18333) | Sandbox the judge from response content; structured tag extraction, not free parse |
| Pearson illusion | Judge can be Pearson 0.9 with experts while systematically harsh by 2 points β Pearson is invariant to additive bias | Always report Cohen's ΞΊ alongside Pearson. Refuse to deploy below ΞΊ=0.6. |
| Low-perplexity bias | Judge prefers responses that read like its own training distribution | Use a different family. Or: compare against a golden human answer (reference-based scoring). |
| Single-judge fragility | Any single judge is one bias away from poisoning your entire eval signal | Panel-of-LLMs (PoLL): 3β5 disjoint-vendor models, majority vote, β₯ single GPT-4 quality at ~7Γ lower cost (Cohere arXiv:2404.18796) |
The architecture that beats single-judge
Panel-of-LLMs (PoLL) + a small (β₯100) human-graded calibration set tracked by Cohen's ΞΊ. This is what crystallizes across Cohere's "Replacing Judges with Juries," Hamel Husain's eval guides, and Eugene Yan's writeups. The calibration set is the load-bearing piece β without an expert-graded ground-truth, no judge architecture is trustworthy. Track ΞΊ over time; if it drifts below 0.6, the judge is broken and you've stopped knowing what your scores mean.
07Seed Prompt Patterns
Concrete patterns for the test cases your harness runs through your tool. Concrete examples and templates pulled verbatim from sources where available.
The pipeline (Hamel Husain + Shreya Shankar)
- Open-code 100 production traces. Read them. Tag failure modes inline as you read. Don't predefine categories. (This is the work nobody wants to do; it's also what separates teams with real evals from teams running G-Eval over toy datasets.)
- Axial-code into ~10 stable failure-mode buckets. Cluster the inline tags. The buckets become your eval dimensions.
- Mine seeds from real failures β production traces are your highest-signal seeds. Anonymize, version, lock.
- Synthesize seeds for coverage gaps. Use Anthropic's cookbook generator pattern: tell the LLM to first analyze the production distribution of each variable, then sample from it. Don't just say "give me 100 examples" β that produces same-shaped slop.
- Stratify difficulty. Easy / medium / hard. Top frontier models should hit ~95% on easy, ~70% on medium, ~30% on hard. If your suite is monomodal, it's not discriminating.
Patterns to reach for
- Features Γ Scenarios Γ Personas dimension matrix (Hamel) β anchors all later synthesis on real product structure rather than generic prompts.
- Tool-call ground truth dual annotation (Google ADK + Ο-bench): annotate both the expected tool sequence (for diagnostics) AND the expected end-state (for outcome scoring). Lets multiple valid trajectories pass.
- Counterfactual seeds β same task with one twist that changes the correct tool selection. Tests whether your tool actually chose right or just memorized.
- Mutation testing for prompts β small perturbations (typos, reordering, synonyms) on each seed; flag any seed whose pass-rate swings >3pp under mutation.
- Adversarial seeds β Promptfoo's red-team module is worth stealing wholesale. Prompt injection, jailbreak attempts, capability misuse.
- Negative seeds β tasks the agent should not attempt or should refuse. Easy to forget; critical for production.
- Curriculum stratification β easy / medium / hard explicit; reported separately.
Seed JSONL schema (steal this)
{
"id": "seed-0001",
"released_at": "2026-04-25",
"category": "tool-selection",
"difficulty": "medium",
"task": "...",
"expected_tools": ["Read", "Grep"],
"expected_end_state": {"file_modified": "src/foo.py", "tests_pass": true},
"rubric_tags": ["tool-correctness", "task-completion"],
"source": "production-trace-2026-03-14"
}
08Judge Rubric Templates
Concrete rubric prompt structures, with verbatim templates pulled from Anthropic, OpenAI, Promptfoo, Prometheus, and applied teams.
Anthropic cookbook grader (binary, parses trivially)
The cleanest published binary judge template. Tag-structured CoT-then-verdict; downstream code regex-extracts <correctness> and is done.
You are a grader. Compare the candidate answer to the rubric.
<rubric>
{rubric}
</rubric>
<answer>
{candidate_answer}
</answer>
Think step-by-step in <thinking> tags about whether the candidate satisfies
the rubric. Then output your verdict in <correctness> tags as exactly one
of: correct | incorrect.
<thinking>...</thinking>
<correctness>...</correctness>
Hamel critique-aligned judge (highest expert agreement)
Bake the principal domain expert's actual critiques in as few-shot. Hamel reports ~90% expert agreement in 3 iterations. The critique field carries the actionable signal β that's what feeds back into seed creation and prompt fixes.
You are evaluating an agent's response. Use the same standard the principal
domain expert uses. Examples of expert critiques are below.
<expert-critiques>
Example 1 β response: "..."
Expert critique: "Used Read tool when Glob would have been correct.
Wasted 2 turns on file location before opening."
Verdict: incorrect
Example 2 β response: "..."
Expert critique: "Correct tool, but skipped the dependency check."
Verdict: incorrect
Example 3 β response: "..."
Expert critique: "Clean trajectory, correct end state."
Verdict: correct
</expert-critiques>
Now evaluate:
<response>{candidate}</response>
Output:
<critique>...</critique>
<verdict>correct | incorrect</verdict>
Prometheus absolute grading (when binary loses information)
5-point Likert only with verbatim per-level rubric anchors. The reference answer is the score-5 anchor. Compresses cleanly to a Pearson 0.897 vs experts when fine-tuned (Prometheus 2).
### Task Description:
Evaluate the response against the rubric. Use the score-5 reference as anchor.
### Score Rubric:
{rubric_dimension}
Score 1: {anchor_1_verbatim}
Score 2: {anchor_2_verbatim}
Score 3: {anchor_3_verbatim}
Score 4: {anchor_4_verbatim}
Score 5: {anchor_5_verbatim_=_reference_answer}
### Reference Answer (score 5):
{reference}
### Response to Evaluate:
{candidate}
### Output:
Feedback: ...
[RESULT] {1 | 2 | 3 | 4 | 5}
Trajectory eval (six-layer composite for agentic tools)
- End-state diff (highest weight) β did the world end up correct? File written, test passed, DB row inserted.
- Exact tool sequence match (diagnostic) β did it call the expected tools in expected order? Useful for debugging, not for scoring alone.
- Step-level diagnostics β per-tool-call error rate, latency, retry count.
- LLM trajectory judge β rubric grader over the full trajectory (Anthropic four-axis: design quality, originality, craft, functionality).
- Pass^k consistency (Ο-bench style) β run the same seed N times, score all-N-pass.
- Promptfoo
trajectory:goal-successβ built-in trajectory assertion if you're using Promptfoo.
09Building a Suite That Survives 18 Months
Contamination, drift, saturation, model upgrades. The harness you ship today will be lying to you in 6 months unless you build for longevity from day one.
Four properties of a suite that survives
Every seed carries a released_at field. Your runner refuses to score model M on tasks where released_at < cutoff(M). This is the single highest-leverage move; it's what LiveCodeBench, MathArena, SWE-bench Live, and K-Prize all converge on. LiveCodeBench detected DeepSeek-Coder dropping sharply on LeetCode problems released after Sep 2023 while GPT models stayed stable β direct contamination evidence, only visible because of date tagging.
50/50 split. Publish only aggregate metrics (mean + 95% CI) on the held-out half. Rotate the public half quarterly. When public starts saturating relative to held-out, that's your contamination alarm. Pattern: SWE-bench Pro three-way split (731 public GPL, 276 commercial proprietary, 858 held-out).
Anthropic's August 2025 "Claude got dumber" incident traced to harness changes, not weights. METR's Vivaria β Inspect migration moved scores. Lockfile the harness; re-run prior-model Γ prior-harness on every wrapper change to map deltas before shipping. Treat any leaderboard delta < 3 pp as inside the noise floor (Anthropic's "Quantifying Infrastructure Noise" measured 6 pp swing on Terminal-Bench from container memory caps alone).
Sample production traffic into the regression suite continuously. Aim +50 cases/month. Every user-caught failure your evals missed becomes a fixed regression test. This is Shopify Sidekick's "Ground Truth Sets sampled from production distribution" pattern; it's how Rechat (Hamel's case study) maintains hundreds of in-suite tests.
The contamination war story you need to know
Time-horizon: the metric that survives saturation
Pass-rate percentages saturate as models improve. METR's 50%-time horizon β the length of human-expert task an agent reliably completes 50% of the time β keeps growing. As of Time Horizon 1.1 (Jan 2026): doubling rate ~131 days, ~10Γ/year at the recent margin. Latest published numbers:
Build your benchmark to express results in horizon-time, not pass-rate, and you'll still have a meaningful metric two model generations from now.
10What Production Teams Actually Do
Stripping out the marketing. What Anthropic, OpenAI, Cursor, Aider, Devin, and Sierra publish about their real eval practice β and where the labs and applied teams diverge.
What the labs publish (the methodology stack)
- Anthropic β three-agent harness blog (March 2026). Planner β Generator β Evaluator. Evaluator uses Playwright to grade artifacts on four axes: design quality, originality, craft, functionality. Solo agent: 20 min, $9, broken. Full harness: 6h, $200, functional. Steal those four axes verbatim.
- Anthropic β multi-agent research system (June 2025). "Token usage alone explains 80% of performance variance in browsing evaluations." Eval methodology: ~20 queries small-sample, LLM-as-judge with rubric (factual accuracy / citation / completeness / source / efficiency), human validation for SEO-bias edge cases.
- Anthropic β Opus 4.5 published recipe. 64K thinking budget, 200K context, default high effort, averaged across 5 trials, default temp/top-p. Closest thing to a canonical Anthropic eval method.
- METR β Time Horizon methodology. Human baselines from ~5-year experienced pros; logistic curve fit; 6 independent runs per task; 95% CIs published; explicit canary strings per task; partially private suite.
- Sierra β ΟΒ²-Bench. Full architecture, judge prompts, ablations, contamination disclosure all published. The most transparent applied team in the field.
What applied teams publish (the outcomes, not the methodology)
This is where the biggest gap lives. Cursor, Cognition (Devin), Sourcegraph Amp, Replit, Vercel publish almost no methodology specifics. They describe outcomes β PR merge rates, "trace-first" β but withhold judge prompts, suite composition, validation procedures. The asymmetry suggests applied eval is now competitive moat; the methodology is the product.
The exceptions worth knowing:
- Cognition's adversarial human reward-hardening. They pay senior engineers to actively try to circumvent their own graders before training. Treats eval design itself as a red-team exercise. Combined with their 3-parallel-judge stack (classical tests + rubrics + browser-use agent for E2E), this is the only public methodology that explicitly assumes the judge will be gamed.
- Cline-Bench (open-sourced, commit-hash-as-ground-truth). Rare published applied-team methodology.
- Continue.dev framework β published eval framework with structure documented.
- Aider β Paul Gauthier's Polyglot leaderboard. 225 Exercism exercises across 6 languages. Two-attempt retry. Edit-format accuracy reported separately from solution correctness (64.4β100% across models). Cost reported per correct answer. Determinism: Paul explicitly notes API non-determinism at temp=0 (5β10 variations per identical request). Most rigorous applied-team eval published.
- Shopify Sidekick β Ground Truth Sets, LLM-powered merchant simulator, judges calibrated to ~0.75 correlation with humans (Cohen's Kappa, Kendall Tau, Pearson tracked), gated rewards (procedural + semantic) to prevent reward hacking.
The METR finding everyone should internalize
11Claude Code-Specific Eval Surfaces
Anthropic gave you eval injection points most teams haven't noticed. Hooks, sub-agents, MCP, and the Agent SDK each open a different door for measurement.
The four eval-injection surfaces unique to Claude Code
PreToolUse / PostToolUse / PostToolUseFailure / Stop / SubagentStop / PreCompact / SessionStart/End. Each event ships JSON with session_id, tool_name, tool_input, tool_response, tool_use_id, and duration_ms (since 2.1.119). Handler types: command, http, mcp_tool (added 2.1.118), prompt, agent. Output: permissionDecision: allow|deny|ask|defer, updatedInput, additionalContext, or decision: block + reason.
The HTTP handler is the killer feature for eval. Every event POSTs JSON to a service Ande owns. No stdout scraping. No log parsing. The harness is the eval bus.
Frontmatter-defined: name, description, tools, model. Each runs in own context window, returns summary only, gets correlated to parent via parent_tool_use_id in stream-json. The tool restriction is a structural eval boundary, not a prompt boundary. A Reviewer with tools: [Read, Grep, Glob] physically cannot write β that's enforceable, not vibes. model: opus Reviewer over model: sonnet Writer = stronger judge than generator without changing your prompt.
The canonical CI eval invocation:
claude --bare \ --mcp-config eval.mcp.json \ --strict-mcp-config \ --output-format stream-json \ --include-hook-events \ --json-schema "$SCHEMA" \ --max-budget-usd 2.00 \ --max-turns 20 \ --session-id "$UUID" \ --exclude-dynamic-system-prompt-sections \ -p "$PROMPT"
--bare skips auto-discovery (will become -p default). --strict-mcp-config + --mcp-config = reproducible plugin set. --exclude-dynamic-system-prompt-sections enables prompt-cache reuse across users/machines. --max-budget-usd is a hard cost guardrail.
Native emission of user_prompt, cost.usage, token.usage, api_request, api_error, tool_result, tool_decision events. Set OTEL_LOG_TOOL_DETAILS=1 to unredact MCP and custom command names. Drop into any OTel-native observability stack (Phoenix, Langfuse, MLflow, Weave, Helicone) without writing exporters.
Community references worth stealing
- langchain-ai/claude-code-evals β the cleanest published Claude-Code-specific ablation harness. Per-task
input_prompt.py+llm_as_a_judge.py+ per-task CLAUDE.md. Pattern: same prompt, swap CLAUDE.md content, score with LLM judge. Copy this directory shape. - sjnims/cc-plugin-eval β 4-stage Analysis β Generation β Execution β Evaluation. Programmatic detection via Agent SDK hooks (Skill / Task / SlashCommand call capture with 100% confidence; MCP via
mcp__<server>__<tool>pattern). LLM judgment as secondary 0β10 score. - jimmc414 SWEbench harness β two-tier scoring: Generation Score (patch produced) vs Evaluation Score (Docker test execution actually passes). "100% generation, 20% eval = 20% real."
- musistudio/claude-code-router (32.9k stars) β routes Claude Code traffic to DeepSeek/Gemini/Ollama/OpenRouter. Free model-ablation harness β same Claude Code surface, different backend. If you want to A/B model swaps under your wrapper, this is the lever.
- VoltAgent/awesome-claude-code-subagents (18.3k stars, 130+ subagents) β every entry uses the same frontmatter pattern. Reviewers get read-only tools; developers get write tools. Tool boundary IS the eval boundary.
Big 2026 caveat to cite anywhere you reference SWE-bench: Verified is contaminated. OpenAI retired it. Claude Opus 4.5 = 80.9% Verified, 45.9% Pro. The 35-point gap is benchmark design, not capability. Always cite Pro alongside Verified.
12The Statistical Floor
Concrete numbers β sample size, variance handling, confidence intervals β so your eval signal isn't actually noise.
Most published LLM "improvements" are within noise. Frontier-lab leaderboard deltas of 1β3pp routinely fail honest significance tests once you account for paired correlation, cluster SEs over related questions, acknowledge nondeterminism at T=0, and correct for cherry-picking from 200 evals. Build the discipline in from week one.
The non-negotiable floor
- Sample size: β₯300 eval items per task, target 500β1000. Below ~300, use Wilson or Clopper-Pearson intervals β not bare Β± std. The CLT does not save you (Bowyer/Brand/Aitchison, arXiv:2503.01747).
- N β₯ 5 runs at T=0.7, or N β₯ 3 at T=0 β because T=0 is not deterministic. Thinking Machines Lab (Sep 2025) found 1,000 identical T=0 calls produced 80 unique outputs on a frontier model. Anthropic's published Opus 4.5 recipe averages 5 trials.
- Always paired, always bootstrap. Same prompt β both models. 10,000 bootstrap resamples on Ξ. If 0 β 95% CI, it's noise. This single discipline kills 80% of false claims.
- Treat |Ξ| < 1pp as noise. Treat |Ξ| < 3pp as below the harness/infra noise floor (Anthropic infrastructure-noise paper) unless you've run the paired-bootstrap protocol (arXiv:2511.19794).
- McNemar for binary pass/fail with β₯10 discordant pairs. Wilcoxon signed-rank for paired non-normal scores.
- LLM-judge: β₯100 human-labeled pairs to validate. Refuse to deploy below Cohen's ΞΊ = 0.6. Report ΞΊ and Pearson β never Pearson alone (it's invariant to additive bias; a systematically harsh judge looks great).
- Scanning > 10 rubrics? Apply Benjamini-Hochberg at Q=0.10. Holm-Bonferroni for safety/regulatory dimensions where false positives are expensive.
- Run β₯ 3 diverse judge models (Cohere PoLL pattern) for subjective evals. Disjoint vendor families.
- Pre-commit primary metric in
eval-config.yamlbefore running new model. Pawel et al. arXiv:2311.18807 β the difference between research and p-hacking. - Production A/B: sequential testing (mSPRT) or Bayesian β never fixed-N frequentist. You will peek; design for that.
The biggest mistake everyone makes
Using unpaired tests when both models ran on the same prompts. You throw away the within-prompt correlation that makes the test sensitive. Anthropic's "Adding Error Bars to Evals" (arXiv:2411.00640) shows this routinely flips significance findings; clustered SEs (for evals with shared passages) can legitimately be > 3Γ larger than naive SEs, retroactively erasing many leaderboard claims.
13Recommended Blueprint for Your Harness
Synthesizing everything above into a single concrete recommendation. This is the answer to "what do I actually build."
An Inspect-AI-shaped harness with hooks-as-eval-bus and a PoLL judge, scored on time-horizon
Don't roll your own runner. Don't chase agent-as-judge yet (it's too new). Don't try to invent a benchmark β pick one of the canonical three (BFCL v4 / ΟΒ²-Bench / Terminal-Bench 2.0) and harvest production traces for the rest. The novel value of your harness lives in the wrapper layer (CLAUDE.md, sub-agents, hooks) β that's the surface no one is grading well.
The stack, top to bottom
| Layer | Pick | Why |
|---|---|---|
| Runner | Inspect AI | Sandbox-native, agent-loop primitives, daily commits, used by AISI/METR/NIST. Future-proof. |
| Eval bus | Claude Code hooks (HTTP handler) β local FastAPI service | Unique to Claude Code. Don't scrape stdout. Get tool_use_id + duration_ms for free. |
| Trace store | Phoenix (OSS) or Langfuse (OSS, self-host) | OTel-native, no lock-in. Drop in. |
| Generator pattern | Planner β Generator β Adversarial Evaluator | Anthropic's published harness pattern. Three different tasks, three different prompts. Stop having one agent grade itself. |
| Judge | Panel-of-LLMs: 3 models from disjoint vendor families, majority vote | Cohere proved it beats single GPT-4 at ~7Γ lower cost with less bias. |
| Judge calibration | β₯100 expert-labeled pairs, Cohen's ΞΊ tracked over time | Refuse to deploy below ΞΊ=0.6. Re-calibrate quarterly. |
| Seed pipeline | Production-trace mining + Anthropic synthetic generator + Promptfoo red-team | Real failures + coverage synthesis + adversarial. Three sources, one suite. |
| Seed schema | JSONL with released_at, difficulty, expected_tools, expected_end_state | Time-anchored, dual-annotated, machine-checkable. |
| Scoring | End-state diff (primary) + Pass^k consistency + LLM-judge over four-axis trajectory | End-state resists memorization; Pass^k surfaces flakiness; trajectory grader catches process failures. |
| Statistics | Paired bootstrap, 10k resamples, BCa CI, McNemar for binary | The one statistical discipline you cannot skip. |
| Headline metric | 50%-time horizon (METR-style) + $/correct (Aider-style) | The two metrics that don't saturate as models improve. |
| CI integration | PR-blocking on regression of fixed-bug suite + nightly full run | Hamel's three-tier pyramid, layer 1. |
| Production drift loop | Sample +50 traces/month from production into the regression suite | Shopify Sidekick's GTX pattern. Closes the eval feedback cycle. |
The starter scaffold (the first PR you write)
your-eval-harness/ βββ inspect/ β βββ tasks/ β β βββ tool-selection/ β β β βββ seeds.jsonl # production-mined + synthetic β β β βββ solver.py # Inspect AI Solver (calls your wrapper) β β β βββ scorer.py # end-state + LLM-judge composite β β βββ trajectory-quality/ β β βββ ... β βββ judges/ β β βββ poll.py # 3-model panel, majority vote β β βββ prompts/ β β β βββ binary.md # Anthropic cookbook tag-structured β β β βββ critique.md # Hamel critique-shadowing β β βββ calibration/ β β βββ golden_pairs.jsonl # 100+ expert-labeled β β βββ kappa_history.csv # tracked over time β βββ stats/ β βββ paired_bootstrap.py # arXiv 2511.19794 protocol βββ hooks/ β βββ settings.json # PreToolUse + PostToolUse + Stop HTTP hooks β βββ handler.py # local FastAPI receiving hook events βββ traces/ β βββ prod-samples/ # rotating sample of production traces β βββ ingestion.py # nightly +50 cases into regression suite βββ ci/ β βββ pr-gate.yaml # block on regression of fixed-bug suite β βββ nightly.yaml # full run + paired-bootstrap report βββ eval-config.yaml # PRE-COMMITTED metrics, frozen pre-run βββ README.md
14Action Items
Ordered by urgency.
This week
- Set up the hook bus. Project-level
.claude/settings.jsonwith PreToolUse + PostToolUse + Stop + SubagentStop HTTP hooks pointing at a local FastAPI service. Capture every event to a SQLite trace store. This is the foundation everything else sits on; do it first. - Pick your runner. Inspect AI. Don't deliberate. Install, walk through the agent-loop tutorial, write one trivial task to confirm sandboxing works on your machine.
- Read three things end-to-end: Anthropic's Effective Harnesses for Long-Running Agents, Anthropic's Eval Awareness BrowseComp, and Hamel Husain's LLM Evals FAQ. Together: ~3 hours. Will save you a month.
- Write your first 30 seeds by hand β from your own production traces if you have them, otherwise from real tasks you'd actually use the tool for. Hand-author beats synthetic at this size. You'll learn what your eval is actually measuring.
- Pre-commit your primary metric in
eval-config.yamlbefore you run any A/B. Decide now what counts as success; write it down; don't let yourself edit it after seeing results.
This month
- Build the PoLL judge. Three models from disjoint vendor families (e.g. Sonnet + Gemini Pro + GPT-5 or any open-weights you trust). Majority vote. Test against your hand-written seeds; pick the rubric template (cookbook binary or Prometheus 5-pt) that matches your eval shape.
- Build the calibration set. 100 expert-labeled pairs. You're the principal expert. Track Cohen's ΞΊ vs your judge weekly. If ΞΊ ever falls below 0.6, judge is broken β stop deploying.
- Wire the paired-bootstrap protocol (arXiv:2511.19794). 10,000 resamples on Ξ. Refuse to ship any "improvement" claim that doesn't pass it.
- Add the PR-blocking CI gate. Regression of any fixed-bug seed = build fails. This single discipline catches more real bugs than any other layer.
- Adopt the JSONL seed schema with
released_atfield. Even if you don't enforce date-cutoff yet, you'll have the data when you do. - Pick one auto-refreshing benchmark as your contamination canary. SWE-bench Live (monthly) or BFCL v4 (weekly). Run it weekly; alert if your wrapper's score drifts more than 5pp without a known cause.
Long-term (6β18 months)
- Switch headline metric from pass-rate to time-horizon. METR-style. Pass-rates will saturate; horizon-time won't. Future-proofs your benchmark against the next two model generations.
- Build the production-drift loop. Sample +50 production traces/month into the regression suite. Failures the suite missed β fixed regression tests. This is what separates a benchmark from a living suite.
- Public/held-out split with quarterly rotation. Once you have ~500+ seeds, freeze 50% private. Rotate the public half quarterly. Publish only aggregate metrics on private.
- Treat your harness as a versioned artifact. Every harness change = re-run prior baselines, publish deltas. Don't pretend the harness is invisible β Anthropic's August 2025 incident proved it isn't.
- Add agent-as-judge experimentally in 2H 2026 once Patronus Percival / Microsoft AgentEval patterns mature. Don't lead with it; the field hasn't crystallized yet.
15Gaps in the Field
What no one is doing well yet β i.e. where the opportunity is to build something better than what's published.
- Wrapper-quality benchmarking is wide open. Nobody publishes evals over CLAUDE.md design, sub-agent topology, or hook architecture. langchain-ai/claude-code-evals is closest (59 stars). If your harness produces a published "wrapper-quality" leaderboard with a coherent rubric, you'd be first. The only reason this gap exists is that applied teams treat their wrapper as moat β but there's room for an open community benchmark.
- Cost-quality frontier reporting is rare. Aider does it well; almost no one else. Cursor / Cognition / Replit / Sourcegraph publish quality numbers without dollars. A harness that reports $/correct on every run, by default, would surface the truth most leaderboards hide: the "best" model by quality is rarely the "best" by cost-quality frontier.
- Cross-domain trajectory scoring is unsolved. Real agentic tools cross domains within a single task (read code β query DB β call API β write file β run test). Every published trajectory benchmark is single-domain (Ο-bench is customer-service; SWE-bench is code; OSWorld is computer-use). Cross-domain scoring with a unified trajectory grader is research-frontier.
- Eval-awareness defenses are underdeveloped. Anthropic's BrowseComp paper is the only published case of an agent autonomously decrypting an eval canary. The defenses are hand-rolled; there's no toolkit, no skill, no shared playbook. Whoever ships "eval-canary best practices for agentic systems" leads.
- Long-horizon eval is gated by cost. METR's Time Horizon 1.1 hits saturation on tasks where only 5/31 long tasks have human baselines β because each costs hours of expert time to baseline. Cheap or self-supervised long-horizon ground-truthing (synthetic tasks with verifiable end-states) would unblock the field.
16Source Bibliography
All sources consulted across the eight research lanes. Tier 1 = primary (paper, repo, official docs), Tier 2 = credible secondary (validated trade press, recognized practitioners), Tier 3 = community / unverified.
- Anthropic β Effective Harnesses for Long-Running Agents β anthropic.com/engineering/effective-harnesses-for-long-running-agents T1
- Anthropic β Harness Design for Long-Running App Dev β anthropic.com/engineering/harness-design-long-running-apps T1
- Anthropic β Building Effective Agents β anthropic.com/engineering/building-effective-agents T1
- Anthropic β Multi-Agent Research System β anthropic.com/engineering/multi-agent-research-system T1
- Anthropic β Eval Awareness in BrowseComp β anthropic.com/engineering/eval-awareness-browsecomp T1
- Anthropic β Quantifying Infrastructure Noise β anthropic.com/engineering/infrastructure-noise T1
- Anthropic β SWE-bench Sonnet methodology β anthropic.com/research/swe-bench-sonnet T1
- Anthropic β Statistical approach to model evals β anthropic.com/research/statistical-approach-to-model-evals T1
- Anthropic β The "Think" Tool β anthropic.com/engineering/claude-think-tool T1
- Anthropic β Claude Opus 4.5 release β anthropic.com/news/claude-opus-4-5 T1
- Claude Code β Hooks reference β code.claude.com/docs/en/hooks T1
- Claude Code β Sub-agents β code.claude.com/docs/en/sub-agents T1
- Claude Code β CLI Reference β code.claude.com/docs/en/cli-reference T1
- Claude Code β Best Practices β code.claude.com/docs/en/best-practices T1
- Claude Code β Headless / programmatic β code.claude.com/docs/en/headless T1
- Claude Code β MCP docs β code.claude.com/docs/en/mcp T1
- Claude Code β Permission modes β code.claude.com/docs/en/permission-modes T1
- Claude Code β CHANGELOG β github.com/anthropics/claude-code/blob/main/CHANGELOG.md T1
- Claude Agent SDK β code.claude.com/docs/en/agent-sdk/overview T1
- Anthropic β Tool Use overview β platform.claude.com/docs/en/build-with-claude/tool-use/overview T1
- Anthropic β Prompt Caching β platform.claude.com/docs/en/build-with-claude/prompt-caching T1
- Inspect AI (UK AISI) β github.com/UKGovernmentBEIS/inspect_ai T1
- AISI 2025 year-in-review (ControlArena) β aisi.gov.uk/blog/our-2025-year-in-review T1
- OpenAI Evals repo β github.com/openai/evals T1
- OpenAI Evals dashboard β evals.openai.com T1
- OpenAI β Testing Agent Skills with Evals β developers.openai.com/blog/eval-skills T1
- OpenAI β Why we no longer evaluate SWE-bench Verified β openai.com/index/why-we-no-longer-evaluate-swe-bench-verified T1
- Promptfoo repo β github.com/promptfoo/promptfoo T1
- Promptfoo β Evaluate Coding Agents β promptfoo.dev/docs/guides/evaluate-coding-agents T1
- HELM (Stanford CRFM) β github.com/stanford-crfm/helm T1
- HELM leaderboard β crfm.stanford.edu/helm T1
- lm-evaluation-harness (EleutherAI) β github.com/EleutherAI/lm-evaluation-harness T1
- Lighteval (HuggingFace) β github.com/huggingface/lighteval T1
- AgentBench (THUDM) β github.com/THUDM/AgentBench T1
- AgentBench paper β arXiv:2308.03688 T1
- DeepEval β github.com/confident-ai/deepeval T1
- DeepEval AI Agent Evaluation β deepeval.com/guides/guides-ai-agent-evaluation T1
- Ragas paper β arXiv:2309.15217 T1
- LangSmith β langchain.com/langsmith-platform T1
- LangSmith β Insights Agent + Multi-turn Evals β blog.langchain.com/insights-agent-multiturn-evals-langsmith T1
- Braintrust β best LLM eval platforms 2025 β braintrust.dev/articles/best-llm-evaluation-platforms-2025 T2
- W&B Weave β wandb.ai/site/weave T1
- Arize Phoenix β github.com/Arize-ai/phoenix T1
- Langfuse β github.com/langfuse/langfuse T1
- MLflow GenAI evaluate β docs.databricks.com/aws/en/mlflow3/genai/eval-monitor T1
- TruLens β github.com/truera/trulens T1
- Comet Opik β github.com/comet-ml/opik T1
- Microsoft AutoGen / Agent Framework eval β learn.microsoft.com/agent-framework/agents/evaluation T1
- Cisco intent to acquire Galileo β blogs.cisco.com/news/cisco-announces-the-intent-to-acquire-galileo T1
- Patronus AI β Generative Simulators (Percival) β patronus.ai/blog/introducing-generative-simulators T1
- BFCL leaderboard β gorilla.cs.berkeley.edu/leaderboard.html T1
- Gorilla β github.com/ShishirPatil/gorilla T1
- Ο-bench paper β arXiv:2406.12045 T1
- taubench.com (ΟΒ²-Bench leaderboard) β taubench.com T1
- ToolBench / ToolLLM β arXiv:2307.16789 T1
- StableToolBench β arXiv:2403.07714 T1
- API-Bank β arXiv:2304.08244 T1
- Apple ToolSandbox β github.com/apple/ToolSandbox T1
- NESTFUL (nested function calls) β arXiv:2409.03797 T1
- ToolHop β arXiv:2501.02506 T1
- MINT β arXiv:2309.10691 T1
- ToolEmu (DeepMind) β arXiv:2309.15817 T1
- AgentHarm β arXiv:2410.09024 T1
- MCP-Bench β arXiv:2508.20453 T1
- WebArena β webarena.dev T1
- OSWorld β arXiv:2404.07972 T1
- AppWorld β arXiv:2407.18901 T1
- SWE-bench leaderboards β swebench.com T1
- SWE-bench Pro (Scale AI) β morphllm.com/swe-bench-pro T1
- SWE-bench Live β arXiv:2505.23419 T1
- Terminal-Bench 2.0 β tbench.ai T1
- LiveCodeBench β arXiv:2403.07974 T1
- GAIA (Meta) β arXiv:2311.12983 T1
- Zheng et al. β Judging LLM-as-a-Judge with MT-Bench β foundational LLM-judge paper T1
- G-Eval (Liu et al.) β NLG evaluation with GPT-4 better human alignment T1
- Prometheus 2 (KAIST) β fine-tuned judge model with 0.897 expert correlation T1
- JudgeBench β meta-benchmark for evaluating LLM judges T1
- Verga et al. (Cohere) β Replacing Judges with Juries β arXiv:2404.18796 T1
- Shankar et al. β Who Validates the Validators? β arXiv:2404.12272 T1
- Judge's Verdict (Pearson invariance) β arXiv:2510.09738 T1
- Self-preference / low-perplexity bias β arXiv:2410.21819 T1
- Adversarial judge fragility (73.8% injection) β arXiv:2504.18333 T1
- Score compression in 1β10 Likert β arXiv:2506.22316 T1
- Hamel Husain β LLM Evals FAQ (Jan 2026) β hamel.dev/blog/posts/evals-faq T2
- Hamel Husain β Field Guide to Rapidly Improving AI Products β hamel.dev/blog/posts/field-guide T2
- Hamel Husain β LLM-as-a-Judge / Critique Shadowing β hamel.dev/blog/posts/llm-judge T2
- Hamel Husain β Your AI Product Needs Evals β hamel.dev/blog/posts/evals T2
- Eugene Yan β Task-Specific LLM Evals β eugeneyan.com/writing/evals T2
- Eugene Yan β LLM-Evaluators (Judge) β eugeneyan.com/writing/llm-evaluators T2
- Cameron Wolfe β Applying Statistics to LLM Evaluations β cameronrwolfe.substack.com/p/stats-llm-evals T2
- Pragmatic Engineer β Pragmatic Guide to LLM Evals β newsletter.pragmaticengineer.com/p/evals T2
- Shopify Engineering β Sidekick Production-Ready Agents β shopify.engineering/building-production-ready-agentic-systems T1
- Statsig β Significance Misleads in LLM A/B β statsig.com/perspectives/abtesting-llms-misleading T2
- Aider β leaderboards β aider.chat/docs/leaderboards T1
- Aider β benchmark methodology β aider.chat/docs/benchmarks.html T1
- METR β Measuring AI Ability to Complete Long Tasks (Mar 2025) β metr.org/blog/2025-03-19-measuring-ai-ability-to-complete-long-tasks T1
- METR β Time Horizon 1.1 (Jan 2026) β metr.org/blog/2026-1-29-time-horizon-1-1 T1
- METR β GPT-5 Evaluation Report (Aug 2025) β evaluations.metr.org/gpt-5-report T1
- METR β Domain Variation (Jul 2025) β metr.org/blog/2025-07-14-how-does-time-horizon-vary-across-domains T1
- Miller et al. β Adding Error Bars to Evals β arXiv:2411.00640 T1
- Bowyer/Brand/Aitchison β Don't Use the CLT in LLM Evals β arXiv:2503.01747 T1
- Heineman et al. (AI2) β Signal and Noise β arXiv:2508.13144 T1
- Bowman & Dahl β What Will it Take to Fix Benchmarking β NAACL 2021 T1
- DemΕ‘ar β Statistical Comparisons of Classifiers β JMLR 2006 T1
- Wang et al. β Self-Consistency Improves CoT β arXiv:2203.11171 T1
- He/Atkinson β Non-Determinism of "Deterministic" LLM Settings β arXiv:2408.04667 T1
- Thinking Machines Lab β Defeating Nondeterminism in LLM Inference β thinkingmachines.ai/blog/defeating-nondeterminism-in-llm-inference T2
- Pawel et al. β Pre-registration for Predictive Modeling β arXiv:2311.18807 T1
- Johari/Koomen/Pekelis/Walsh β Always Valid Inference (mSPRT) β arXiv:1512.04922 T1
- Russo/Van Roy β Tutorial on Thompson Sampling β stanford.edu/~bvr/pubs/TS_Tutorial.pdf T1
- Chiang et al. β Chatbot Arena β arXiv:2403.04132 (ICML 2024) T1
- Argyle et al. β Statistical Framework for Ranking LLM Chatbots β arXiv:2412.18407 (ICLR 2025) T1
- Cohere β The Leaderboard Illusion β arXiv:2504.20879 T1
- Stanford β How Is ChatGPT's Behavior Changing β arXiv:2307.09009 T1
- Paired Bootstrap Protocol β arXiv:2511.19794 T1
- Tracking the Moving Target β arXiv:2504.18985 T1
- MathArena β matharena.ai + arXiv:2505.23281 T1
- K-Prize / Konwinski Prize (TechCrunch coverage) β techcrunch.com/2025/07/23/a-new-ai-coding-challenge T2
- ArxivRoll β How Much Do LLMs Cheat β arXiv:2507.19219 T1
- Skywork-13B (test-set leakage detection) β arXiv:2310.19341 T1
- lmgame-Bench β arXiv:2505.15146 T1
- DynaBench (Kiela et al., 2021) β methodological ancestor of dynamic benchmarks T1
- langchain-ai/claude-code-evals β github.com/langchain-ai/claude-code-evals T1
- sjnims/cc-plugin-eval β github.com/sjnims/cc-plugin-eval T1
- jimmc414 SWEbench harness β github.com/jimmc414/claudecode_gemini_and_codex_swebench T1
- musistudio/claude-code-router β github.com/musistudio/claude-code-router T1
- VoltAgent/awesome-claude-code-subagents β github.com/VoltAgent/awesome-claude-code-subagents T1
- anthropics/claude-code-action β github.com/anthropics/claude-code-action T1
- Wasserman β All of Statistics β stat.cmu.edu (textbook) T1
- Casella & Berger β Statistical Inference 2e β textbook reference T1
Bibliography is deduplicated to ~120 entries from 327 raw sources surfaced across 8 research lanes. Full per-lane source indexes (with all 327) are available on request β they live in the working vault at ~/vault/raw/intel/2026-04-25-ai-tooling-eval-harness/lane-*/findings.md. Confidence tags (KNOWN / ASSUMED / UNVERIFIED) and tier ratings (T1 = primary; T2 = credible secondary; T3 = community / unverified) follow the EP strategic-intelligence-loop discipline.