VibeVoice - Microsoft's Open Source Multi-Speaker Text-to-Speech AI Model for Podcasts and Long-Form Audio
What is VibeVoice - Microsoft's Multi-Speaker TTS Model
VibeVoice is Microsoft's open-source text-to-speech (TTS) model purpose-built for multi-speaker, long-form, conversation-style audio. It can generate up to ~90 minutes of natural, turn-taking dialogue with up to four speakers, making it ideal for podcasts, audiobooks, and e-learning narration.
Powered by continuous speech tokenizers (~7.5 Hz) and a next-token diffusion decoder, VibeVoice maintains strong speaker consistency and natural prosody over long sequences. For creators, it works as a podcast voice generator, supports long text-to-speech narration, and enables multi-speaker dialogue synthesis.
The project is MIT-licensed, so you can run it locally or try it via hosted demos.
VibeVoice Demos - AI Text-to-Speech in Action
Watch How VibeVoice Generates Natural Multi-Speaker Conversations
VibeVoice Setup Guide - Getting Started with Multi-Speaker TTS
VibeVoice Four-Speaker Conversation Demo
VibeVoice Advanced Features and Capabilities
Local VibeVoice Installation Tutorial
VibeVoice FAQ - Common Questions About Multi-Speaker TTS
How long and how many speakers per generation?
Up to ~90 minutes and up to 4 speakers in one pass, depending on the chosen variant, compute, and hosting limits.
Which languages are supported?
Primarily English and Chinese. Cross-lingual and singing abilities are emergent and may be unstable depending on script and prompts.
What are the typical use cases?
Podcast voice generator, interview/panel dialogues, audiobook conversations, long text-to-speech course narration, role-play, and customer-service simulations.
How is it different from traditional single-speaker TTS?
VibeVoice focuses on conversation TTS: multi-speaker, natural turn-taking, and long-duration stability. Traditional TTS often targets single-speaker short text and is weaker for dialogues and very long content.
How should I structure my script?
Label each line with a speaker (e.g., "Alice: …"), keep sentences short, follow natural turns, and prefer simple punctuation. Add pauses or stage directions only when necessary.
How do I reduce artifacts like background music or odd prosody?
Try a different voice/prompt, split long sentences, soften emotional cues, or post-process with light denoise. For very long projects, generate per chapter and stitch.
Does it support voice cloning or celebrity mimicry?
The public demos generally do not offer voice cloning. Do not mimic real people without consent; follow applicable laws and platform rules.
What export formats are available? Who owns the output?
You can download audio (commonly WAV/MP3, depending on the demo). You're responsible for ensuring copyright/compliance when using or publishing the output.