What is covered in VibeVoice Tutorial: Multi-Speaker TTS Setup & Usage Guide?

Deploy VibeVoice for multi-speaker TTS quickly. Step-by-step install, demo commands, and a checklist to generate 90-minute expressive audio.

VibeVoice Tutorial: Multi-Speaker TTS Setup & Usage Guide

Short answer

VibeVoice is Microsoft's multi-speaker text-to-speech framework that can synthesize up to 90 minutes of expressive audio with up to four distinct speakers. The official repository on GitHub and model releases on Hugging Face remain active. This guide shows how to set it up with Docker or manually, run demos, and create multi-speaker outputs.

What is VibeVoice?

VibeVoice is a TTS system built for long-form, multi-speaker conversational audio. It combines continuous speech tokenizers, a next-token diffusion framework, and an LLM for context management.

Key abilities:

Synthesize continuous audio up to 90 minutes with up to 4 distinct speakers.
Uses continuous speech tokenizers (7.5 Hz) and a next-token diffusion framework plus an LLM for context.
Supports English and Chinese and enables cross-lingual synthesis.

Official code and docs live at the Microsoft VibeVoice GitHub repository and the project site microsoft.github.io/VibeVoice.

Before you start: quick checklist

GPU with CUDA (NVIDIA recommended). Multi-speaker long-form audio requires VRAM; aim for a 24GB+ GPU for the largest models.
Docker with NVIDIA/container toolkit if using the Docker path.
Python 3.10+ and pip for manual install.
Enough disk space to store model weights (see the Hugging Face pages for model sizes).

Install (two ways)

Docker (recommended)

Docker is the fastest, most reproducible path. Use an NVIDIA container that bundles CUDA and PyTorch.

sudo docker run --privileged --net=host --ipc=host --ulimit memlock=-1:-1 --ulimit stack=-1:-1 --gpus all --rm -it nvcr.io/nvidia/pytorch:24.07-py3

Inside the container, clone the repository and install dependencies if needed.

Manual install

If you prefer a local Python environment, run these commands:

git clone https://github.com/microsoft/VibeVoice.git
cd VibeVoice
pip install -r requirements.txt
pip install flash-attn --no-build-isolation

Flash Attention can speed up training and inference; install it if your GPU supports it.

Run the demo interface

The repository includes a Gradio demo. Replace model paths with the Hugging Face model IDs or local paths.

python demo/gradio_demo.py --model_path microsoft/VibeVoice-1.5B --share
python demo/gradio_demo.py --model_path microsoft/VibeVoice-Large --share

The demo helps you test single- and multi-speaker flows quickly.

File-based inference: single and multi-speaker examples

Use the included inference script to synthesize from a text file.

python demo/inference_from_file.py --model_path microsoft/VibeVoice-Large --txt_path demo/text_examples/1p_abs.txt --speaker_names Alice

For two speakers, list both names in order:

python demo/inference_from_file.py --model_path microsoft/VibeVoice-Large --txt_path demo/text_examples/2p_music.txt --speaker_names Alice Frank

Think of speaker names like different actors reading parts of a script; each name tells VibeVoice who speaks next.

Tips for long-form & multi-speaker TTS

Segment long scripts into scenes or paragraphs to keep context manageable.
Match speaker names consistently; each distinct name creates a unique voice profile.
Use the official model releases on VibeVoice-1.5B and VibeVoice-Large. Preview builds have been superseded by final releases.
Cross-lingual lines: VibeVoice supports English and Chinese; expect variations in prosody and tokenization when switching languages mid-stream.

Troubleshooting common issues

Out of memory: Switch to a smaller model, reduce batch sizes, or segment the script into smaller parts.
Missing CUDA or GPU not found: Verify drivers and that Docker runs with --gpus all. Check CUDA version compatibility with your PyTorch build.
Flash-attn install errors: Ensure your compiler and CUDA toolkit versions match the extension's requirements or skip if not supported.
Slow or poor-quality audio: Confirm you're using the official weights and not deprecated preview models. Compare outputs with the Hugging Face model pages.
Want help from the community: Open an issue at the GitHub repo and include a short repro and log snippets.

Quick comparison: VibeVoice vs generic TTS

Feature	VibeVoice	Generic TTS
Long-form length	Up to 90 minutes	Usually much shorter
Multi-speaker	Up to 4 speakers	Often single voice or limited speaker switching
Architecture	Next-token diffusion + LLM	Varies (concatenative, neural, or TTS-only)

Deployment checklist

Pick model: 1.5B for lighter use or Large for higher quality.
Choose Docker for reproducibility or manual install for custom environments.
Test with the Gradio demo, then run file-based inference for batch scripts.
Segment long scripts and label speakers consistently.
Open issues with a short repro if you hit model bugs: GitHub.

FAQ

Are the official models still available?

Yes. Microsoft's official repository and the released models on Hugging Face remain accessible. Preview variants were replaced by final releases prior to any removal.

How many distinct voices can I get?

VibeVoice supports up to four distinct speakers in a single long-form generation session.

Can I use VibeVoice for podcasts?

Yes. It's well-suited for scripted podcasts and conversational audio. For best results, review the checklist and test segments first.

Where to learn more

Official repo: GitHub. Model pages: VibeVoice-1.5B, VibeVoice-Large. Project docs: microsoft.github.io/VibeVoice. A community write-up: MarkTechPost article.

Teach like a tutor: You can think of VibeVoice as a small radio-play engine that follows a script and calls the right actor to speak. Quick check: have you run the Gradio demo and generated a short two-speaker clip?