`DeepSWE` Flags Leakage and False Negatives in AI Coding Benchmarks

Coding leaderboards can reward memorized answers and reject valid patches. Use SWE-bench-style rankings as a filter, not procurement truth.

[ KEY POINTS ]

DeepSWE ranks GPT-5.5 first, but the sharper signal is benchmark reliability, not the winner.
Answer leakage can inflate agent scores. A high leaderboard rank does not guarantee better repo-level work.
Valid solutions may be rejected, so benchmarks can undervalue agents that solve tasks in non-canonical ways.
Agent behavior suppression matters: tests built for narrow outputs can punish useful exploration and tool use.

Originalwww.theneuron.ai/explainer-articles/datacurves-deepswe-exposes-a-weird-new-problem-with-ai-coding-leaderboards/Read original →

// related

#0001
#0001Agents & tools Google AI Forum3 hours ago
`Gemini Code Assist` Individual IDE/CLI Access May End June 18
80radar
Gemini Code AssistAI coding assistant — Gemini in IDE extensions and CLI
Individual Pro, Ultra, and free access stops on June 18, 2026. VS Code users should line up alternate auth or tools before agent workflows break.
- The notice names Gemini CLI and Gemini Code Assist IDE extensions. Individual Pro, Ultra, and free accounts are the affected paths.
- The cutoff date is June 18, 2026. Any VS Code workflow wired to personal Gemini access needs a fallback before then.
- The forum post does not confirm whether an Antigravity-specific VS Code extension continues. Treat it as unresolved until Google posts a clearer migration path.
Source: discuss.ai.google.dev/t/will-there-no-longer-be-a-geminiRead original →
FIG-0011:1
80radar
FIG-0011:1
#0002
#0002Agents & tools Hacker News · Show HN AI3 hours ago
`AISlop`, a CLI that catches AI-generated code smells
60radar
AISlopAI code-quality CLI — local code-smell scanning
Scans for patterns tests often miss: empty catches, dead code, duplicate helpers, and useless comments. Local-only hook integration makes it worth trying in agent-heavy repos.
- npx aislop scan runs locally, so source code is not uploaded. That lowers the barrier for private app repos.
- Targets AI-agent residue rather than syntax errors: empty catch, dead code, duplicated helpers, and low-value comments.
- Hook-based use after each tool call fits Claude Code, Codex, and opencode workflows where small slop accumulates fast.
Source: github.com/scanaislop/aislopRead original →
FIG-0021:1
60radar
FIG-0021:1
#0003
#0003Agents & tools Google AI Forum5 hours ago
`Antigravity Ultra` users hit persistent `MODEL_CAPACITY_EXHAUSTED` on Claude Opus thinking
60radar
AntigravityCoding agent IDE — uses Google's backend model routing
Paid agent sessions can fail before real work starts. Treat Antigravity Ultra as capacity-risky for Claude-heavy workflows until routing improves.
- The backend returns 503 `MODEL_CAPACITY_EXHAUSTED` from cloudcode-pa.googleapis.com, so this is server capacity, not local quota.
- Failures recur after 1-29 seconds on multi-step agent tasks like file edits, bash runs, and code generation. Long tasks are unreliable.
- Server-Timing shows about 13s spent waiting for model capacity before failure. Retry loops burn time without changing the result.
- Multiple similar forum threads over nearly a month point to account or routing-level fragility, not a one-off client setup problem.
Source: discuss.ai.google.dev/t/persistent-model-capacity-exhausRead original →
FIG-0031:1
60radar
FIG-0031:1

`DeepSWE` Flags Leakage and False Negatives in AI Coding Benchmarks

// related

`Gemini Code Assist` Individual IDE/CLI Access May End June 18

`AISlop`, a CLI that catches AI-generated code smells

`Antigravity Ultra` users hit persistent `MODEL_CAPACITY_EXHAUSTED` on Claude Opus thinking