telexed ~ c / 76422314-0ffradar:50 · otherLIVE
← back
NO.
#76422314
Topic
OTHER
Source
r/LocalLLaMA
Published
2026-05-14 22:55:05
Importance
★ 5/10 — radar 50

Self-Training With Verifiable Rewards Pushes `Qwen 2.5` 7B to **112/164** on HumanEval

A self-generated code-and-tests loop produced a large jump without human-written training pairs. Cheap enough to replicate, but still a one-off experiment rather than a product-ready recipe.

[ KEY POINTS ]
  1. The loop is simple: generate problems, sample multiple solutions, keep (failed attempt, fixed attempt) pairs, and let a Python interpreter score them.
  2. After fixing a grading bug, Qwen 2.5 7B moved from 25 to 112/164 on HumanEval; that is a big enough jump to treat as a real benchmark signal.
  3. A Qwen 2.5 14B run used 100 mined pairs, took 95 minutes on an H100, and cost $3.50; the barrier here is much lower than typical RL folklore.
  4. Control training on fake pairs gave 25/164, identical to base, which suggests the lift came from correction data rather than format imitation.
Originalwww.reddit.com/r/LocalLLaMA/comments/1tde3m1/i_let_a_small_model_train_on_its_own_mistakes_it/Read original →

// related