telexed ~ c / c012abef-0dbradar:40 · otherLIVE
← back
NO.
#c012abef
Topic
OTHER
Source
r/LocalLLaMA
Published
2026-05-22 23:29:14
Importance
★ 4/10 — radar 40

`Qwen3.6 27B` pure `Q4_K_M` GGUF fits in **16GB VRAM**

Pure quantization trims enough size to keep the whole model on a consumer GPU. Useful for local agent tests, but quality loss is real and benchmark depth is thin.

[ KEY POINTS ]
  1. Q4_K_M MTP is 15.4GB and non-MTP is 15.1GB; comparable builds listed at 16.5-18GB often spill past 16GB cards.
  2. MTP reaches 40 tok/s generation but only 195 tok/s prompt processing; non-MTP flips the trade-off at 715 tok/s pp and 24 tok/s tg.
  3. Perplexity delta is larger than Unsloth's quant: +0.1707 vs +0.0553 on MTP, so the size win buys speed/fit at some quality cost.
Originalwww.reddit.com/r/LocalLLaMA/comments/1tkzk9e/qwen36_27b_pure_quant_40_toks_on_16_gb_vram/Read original →

// related