telexed ~ c / dbdb4381-5ccradar:40 · agent_toolLIVE
← back
NO.
#dbdb4381
Topic
AGENTS & TOOLS
Source
r/LocalLLaMA
Published
2026-05-22 17:34:59
Importance
★ 4/10 — radar 40

BeeLlama v0.2.0 boosts inference speed by up to 4.9x on an RTX 3090

An inference engine that achieves up to a 4.9x token speedup over llama.cpp via DFlash. It makes high-throughput local LLMs more viable on consumer GPUs like the RTX 3090.

[ KEY POINTS ]
  1. Achieves 164 tokens/sec with Qwen 3.6 27B on a single RTX 3090, a 4.4x speedup compared to llama.cpp's 37.2 tps.
  2. DFlash, a form of speculative decoding, accelerates inference using a smaller draft model. While prompt processing speed is similar, token generation is significantly faster.
  3. The update adds full support for Gemma 4 31B and is compatible with the GGUF format, easing integration with the existing local LLM ecosystem.
  4. This makes fast prototyping or running small-scale services on owned hardware more feasible, especially for tasks involving long text generation, without cloud API costs.
Originalwww.reddit.com/r/LocalLLaMA/comments/1tkpz2y/beellama_v020_major_dflash_update_single_rtx_3090/Read original →

// related