`SmallCode` hits 87/100 coding-agent tasks with an active 4B model
Reliability comes from the harness, not raw model size. The benchmark is self-reported, but the agent patterns are immediately reusable for local-first coding tools.
- Compound tools collapse search-read-edit-verify into one call, cutting the multi-step drift that breaks small models after 3+ tool calls.
- The fix loop runs compile/lint immediately after edits and feeds errors back, so the model only needs to repair concrete failures.
- On repeated failure, tasks shrink from broad file edits to line-level fixes; that is a practical recipe for weaker local models.
- Cloud escalation is scoped to the stuck task when an
OpenAIorClaudekey exists, keeping most work local without hard failure.