Self-Training With Verifiable Rewards Pushes `Qwen 2.5` 7B to 112/164 on HumanEval

A self-generated code-and-tests loop produced a large jump without human-written training pairs. Cheap enough to replicate, but still a one-off experiment rather than a product-ready recipe.

[ KEY POINTS ]

The loop is simple: generate problems, sample multiple solutions, keep (failed attempt, fixed attempt) pairs, and let a Python interpreter score them.
After fixing a grading bug, Qwen 2.5 7B moved from 25 to 112/164 on HumanEval; that is a big enough jump to treat as a real benchmark signal.
A Qwen 2.5 14B run used 100 mined pairs, took 95 minutes on an H100, and cost $3.50; the barrier here is much lower than typical RL folklore.
Control training on fake pairs gave 25/164, identical to base, which suggests the lift came from correction data rather than format imitation.

Originalwww.reddit.com/r/LocalLLaMA/comments/1tde3m1/i_let_a_small_model_train_on_its_own_mistakes_it/Read original →

// related

#0001
#0001Other r/MachineLearningyesterday
Hugging Face revives `PapersWithCode` with AI-parsed leaderboards
50radar
PapersWithCodeAI paper tracker — links code and benchmarks
The rebuilt site tracks trending papers, methods, citations, repos, artifacts, and benchmark results. Useful for model scouting, but still manually verified and early-stage.
- Default ranking uses GitHub star velocity, so it surfaces research projects gaining developer attention, not just citation-heavy papers.
- Coverage starts with high-impact items like Qwen 3.5, RF-DETR, DINOv3, MTEB, Open ASR Leaderboard, and coding-agent benchmarks.
- Paper pages auto-link GitHub repos, project URLs, artifacts, PDFs, and external non-Arxiv papers; multiple repos per paper are supported.
- Leaderboards exist by benchmark and domain, including MMTEB, COCO val 2017, and Terminal Bench; handy for fast model/vendor filtering.
- Result extraction uses AI agents, but verification is still manual. Treat it as a shortlist generator, not a source of record yet.
Source: www.reddit.com/r/MachineLearning/comments/1tgmwqr/reviviRead original →
50radar
PHOTO
FIG-0011:1
#0002
#0002Other GeekNewsyesterday
`rkdebian` turns an $80 RK3562 Android tablet into a Debian workstation
40radar
rkdebianDebian image build system — built for Doogee U10
A cheap locked-down device can become a bootable Debian 12 machine. Useful for low-cost Linux experiments, but the device scope is narrow and prerelease status keeps it niche.
- Targets the Rockchip RK3562-based Doogee U10; reuse value depends almost entirely on owning that exact hardware.
- Builds bootable Debian 12 Bookworm images, so the value is hardware repurposing more than a general dev-tool upgrade.
- Public prerelease build is dated May 14, 2026; treat it as an experiment box, not a dependable main workstation.
Source: news.hada.io/topic?id=29622Read original →
FIG-0021:1
40radar
FIG-0021:1
#0003
#0003Other GeekNews2 days ago
Stay Native Until Text Forces Your Hand
40radar
SwiftUI can handle Markdown chat UI until document-wide selection enters scope. Jumping to NSTextView brings TextKit 2 complexity and streaming CPU spikes, so delay it.
- SwiftUI gives acceptable baseline performance for Markdown chat, but full-document text selection is hard to support cleanly.
- Moving to NSTextView and TextKit 2 trades native UI simplicity for lower-level text control and more performance work.
- Streaming input can trigger CPU spikes in the text stack. Chat apps should benchmark incremental rendering before committing.
Source: news.hada.io/topic?id=29602Read original →
FIG-0031:1
40radar
FIG-0031:1

Self-Training With Verifiable Rewards Pushes `Qwen 2.5` 7B to **112/164** on HumanEval

// related

Hugging Face revives `PapersWithCode` with AI-parsed leaderboards

`rkdebian` turns an $80 RK3562 Android tablet into a Debian workstation

Stay Native Until Text Forces Your Hand

Self-Training With Verifiable Rewards Pushes `Qwen 2.5` 7B to 112/164 on HumanEval