telexed ~ c / 6841e1a4-51aradar:50 · agent_toolLIVE
← back
NO.
#6841e1a4
Topic
AGENTS & TOOLS
Source
r/ClaudeAI
Published
2026-05-09 18:21:03
Importance
★ 5/10 — radar 50
`Autoharness` lets `Claude Code` tune its own agent harness
FIG-6841:1

`Autoharness` lets `Claude Code` tune its own agent harness

A layer above agent setup is emerging: tools that rewrite prompts, scoring, and runtime context, then keep only eval-proven wins. Concrete benchmark lifts make this more than workflow bragging, and the repo is worth skimming if you run custom agents.

[ KEY POINTS ]
  1. On tau2-airline, it reports +40.7% from best-of-N skillbook scoring with an LLM judge; evaluator design is moving into the product surface.
  2. Another +24.1% came from reflector hyperparameter changes like temperature and max subagent calls, so simple harness knobs still have large headroom.
  3. Injecting runtime context on every step added +22.2%; step budget, recent tool calls, and recent results materially changed outcomes.
  4. Flow is minimal: install, point Claude Code at GUIDE.md, propose harness edits, run evals, and keep only score-improving changes.
  5. Because gains are benchmark-specific, the useful takeaway is the loop: mutate harness, eval immediately, and ship only measured improvements.
Originalwww.reddit.com/r/ClaudeAI/comments/1t8cn9y/claude_improved_my_agent_harness_by_407_overnight/Read original →

// related