telexed ~ c / fa481518-1f6radar:40 · otherLIVE
← back
NO.
#fa481518
Topic
OTHER
Source
r/LocalLLaMA
Published
2026-05-21 11:09:47
Importance
★ 4/10 — radar 40

`ik_llama.cpp` pushes `Qwen3.6 35B A3B` near 110 tok/s on 12GB VRAM

MTP plus CPU offload can make a local MoE model feel interactive on consumer hardware. Useful for private coding or batch jobs, but still a setup-specific benchmark.

[ KEY POINTS ]
  1. Same IQ4_XS quant averaged 89.76 tok/s on regular llama.cpp; ik_llama.cpp samples reached roughly 105-110 tok/s.
  2. Hardware was RTX 4070 Super 12GB, Ryzen 7 9700X, and 48GB DDR5. CPU offload quality matters as much as VRAM.
  3. Benchmark used --ctx-size 131072, q8 KV cache, and draft-mtp; long-context local workflows remain memory-sensitive.
  4. Treat it as a tuning lead, not a buying guide. Kernel, quant, and fork versions can swing results hard.
Originalwww.reddit.com/r/LocalLLaMA/comments/1tjh7az/110_toks_with_12gb_vram_on_qwen36_35b_a3b_and_ik/Read original →

// related