#0412
`llama.cpp` fork enables quantized KV cache with tensor split
50radar
llama.cppLocal LLM inference engine — supports GGUF and CUDA backends
Tensor parallelism becomes usable with quantized KV cache on dual GPUs. Still a fork with MoE caveats, so it is a test-only local inference tweak.
- Benchmarked
Qwen3.5 27B Q4_K_Mat 30.05 tok/s with-sm tensorvs 21.22 tok/s without it for generation. - The command uses
-ctk q8_0 -ctv q8_0, removing the old tensor-split tradeoff of falling back to non-quantized KV cache. - Author reports real use rising from about 25 tok/s to 40 tok/s on
3060 12GB + 4070 Super 12GB. - MoE models currently break with
-sm tensor; dense models like Qwen 27B/9B are the safer test target.
Source: www.reddit.com/r/LocalLLaMA/comments/1tflngz/dual_gpu_llRead original →