telexed ~ c / 815c79df-de8radar:50 · otherLIVE
← back
NO.
#815c79df
Topic
OTHER
Source
r/LocalLLaMA
Published
2026-05-17 10:24:36
Importance
★ 5/10 — radar 50

`llama.cpp` fork enables quantized KV cache with tensor split

Tensor parallelism becomes usable with quantized KV cache on dual GPUs. Still a fork with MoE caveats, so it is a test-only local inference tweak.

[ KEY POINTS ]
  1. Benchmarked Qwen3.5 27B Q4_K_M at 30.05 tok/s with -sm tensor vs 21.22 tok/s without it for generation.
  2. The command uses -ctk q8_0 -ctv q8_0, removing the old tensor-split tradeoff of falling back to non-quantized KV cache.
  3. Author reports real use rising from about 25 tok/s to 40 tok/s on 3060 12GB + 4070 Super 12GB.
  4. MoE models currently break with -sm tensor; dense models like Qwen 27B/9B are the safer test target.
Originalwww.reddit.com/r/LocalLLaMA/comments/1tflngz/dual_gpu_llamacpp_speedup/Read original →

// related