← back NO. #815c79df

NO.: #815c79df
Topic: OTHER
Source: r/LocalLLaMA
Published: 2026-05-17 10:24:36
Importance: ★ 5/10 — radar 50
Original: www.reddit.com/r/LocalLLaMA/comments/1tflngz/dual_gpu_llamacpp_speedup/

`llama.cpp` fork enables quantized KV cache with tensor split

Tensor parallelism becomes usable with quantized KV cache on dual GPUs. Still a fork with MoE caveats, so it is a test-only local inference tweak.

[ KEY POINTS ]

Benchmarked Qwen3.5 27B Q4_K_M at 30.05 tok/s with -sm tensor vs 21.22 tok/s without it for generation.
The command uses -ctk q8_0 -ctv q8_0, removing the old tensor-split tradeoff of falling back to non-quantized KV cache.
Author reports real use rising from about 25 tok/s to 40 tok/s on 3060 12GB + 4070 Super 12GB.
MoE models currently break with -sm tensor; dense models like Qwen 27B/9B are the safer test target.

Originalwww.reddit.com/r/LocalLLaMA/comments/1tflngz/dual_gpu_llamacpp_speedup/Read original →

// related

#0001
#0001Other r/MachineLearningyesterday
Hugging Face revives `PapersWithCode` with AI-parsed leaderboards
50radar
PapersWithCodeAI paper tracker — links code and benchmarks
The rebuilt site tracks trending papers, methods, citations, repos, artifacts, and benchmark results. Useful for model scouting, but still manually verified and early-stage.
- Default ranking uses GitHub star velocity, so it surfaces research projects gaining developer attention, not just citation-heavy papers.
- Coverage starts with high-impact items like Qwen 3.5, RF-DETR, DINOv3, MTEB, Open ASR Leaderboard, and coding-agent benchmarks.
- Paper pages auto-link GitHub repos, project URLs, artifacts, PDFs, and external non-Arxiv papers; multiple repos per paper are supported.
- Leaderboards exist by benchmark and domain, including MMTEB, COCO val 2017, and Terminal Bench; handy for fast model/vendor filtering.
- Result extraction uses AI agents, but verification is still manual. Treat it as a shortlist generator, not a source of record yet.
Source: www.reddit.com/r/MachineLearning/comments/1tgmwqr/reviviRead original →
50radar
PHOTO
FIG-0011:1
#0002
#0002Other GeekNewsyesterday
`rkdebian` turns an $80 RK3562 Android tablet into a Debian workstation
40radar
rkdebianDebian image build system — built for Doogee U10
A cheap locked-down device can become a bootable Debian 12 machine. Useful for low-cost Linux experiments, but the device scope is narrow and prerelease status keeps it niche.
- Targets the Rockchip RK3562-based Doogee U10; reuse value depends almost entirely on owning that exact hardware.
- Builds bootable Debian 12 Bookworm images, so the value is hardware repurposing more than a general dev-tool upgrade.
- Public prerelease build is dated May 14, 2026; treat it as an experiment box, not a dependable main workstation.
Source: news.hada.io/topic?id=29622Read original →
FIG-0021:1
40radar
FIG-0021:1
#0003
#0003Other GeekNews2 days ago
Stay Native Until Text Forces Your Hand
40radar
SwiftUI can handle Markdown chat UI until document-wide selection enters scope. Jumping to NSTextView brings TextKit 2 complexity and streaming CPU spikes, so delay it.
- SwiftUI gives acceptable baseline performance for Markdown chat, but full-document text selection is hard to support cleanly.
- Moving to NSTextView and TextKit 2 trades native UI simplicity for lower-level text control and more performance work.
- Streaming input can trigger CPU spikes in the text stack. Chat apps should benchmark incremental rendering before committing.
Source: news.hada.io/topic?id=29602Read original →
FIG-0031:1
40radar
FIG-0031:1

`llama.cpp` fork enables quantized KV cache with tensor split

// related

Hugging Face revives `PapersWithCode` with AI-parsed leaderboards

`rkdebian` turns an $80 RK3562 Android tablet into a Debian workstation

Stay Native Until Text Forces Your Hand