telexed ~ c / 28bf2b62-e78radar:60 · model_apiLIVE
← back
NO.
#28bf2b62
Topic
MODELS & API
Source
Simon Willison
Published
2026-04-27 23:46:56
Importance
★ 6/10 — radar 60

Microsoft VibeVoice: Open-Source ASR with Built-In Speaker Diarization

This is a practical look at Microsoft's MIT-licensed speech-to-text model, VibeVoice, emphasizing built-in speaker diarization and local execution on Apple Silicon via MLX. For indie developers, the main appeal is owning transcription workflows without API dependency, but the hardware cost is substantial: multi-GB model downloads and very high RAM usage for long-form audio.

[ KEY POINTS ]
  1. Strong indie value: MIT license plus built-in diarization reduces reliance on paid transcription APIs and post-processing pipelines.
  2. Operational cost is the main constraint: the 4-bit MLX model is 5.71GB, the original model is 17.3GB, and observed RAM usage can exceed 30GB and even spike above 60GB.
  3. Long-form audio is viable but needs tuning: the default token limit only covers about 25 minutes, so longer recordings require increasing --max-tokens.
  4. Best fit appears to be privacy-sensitive or cost-sensitive transcription products for developers who already own high-end local hardware.
Originalsimonwillison.net/2026/Apr/27/vibevoice/#atom-everythingRead original →

// related