VibeVoice is a family of open-source voice AI models spanning long-form ASR, realtime streaming TTS, and research-era long-form multi-speaker TTS.

What can I try today?

The public ASR Playground and the realtime TTS Colab are the clearest public entry points highlighted by the VibeVoice release materials.

Is VibeVoice ready for production?

The public materials frame VibeVoice as research-first and recommend further testing before commercial or real-world deployment.

Why is the TTS model shown as archive status?

The VibeVoice materials note that the TTS code was removed from the repository after a responsible-use review, so it is presented here as part of the model history rather than a live public demo.

Open-Source Frontier Voice AI

VibeVoice ASR and realtime TTS in one open-source voice AI overview.

VibeVoice combines 60-minute single-pass ASR, streaming-friendly realtime TTS, and the research arc of long-form multi-speaker voice generation into one sharp public release narrative.

Independent editorial guide to the public VibeVoice release materials, demos, weights, and papers.

Try ASR Playground View on GitHub

Hugging Face Collection ASR Report TTS Report Realtime Colab

Single pass 60 min ASR

Language reach 50+ langs

First audio ~300 ms

Research TTS 90 min / 4 speakers

Model family ASR / Realtime / TTS Archive

Structured transcription, streaming synthesis, and long-form voice research in one visual stack.

Who / When / WhatStructured ASR output
Streaming text inputRealtime speech generation
Responsible release postureResearch-first positioning

Model Family

Three distinct release tracks, one consistent voice AI story.

The current public VibeVoice materials center on long-form recognition, realtime synthesis, and the archival context of the original long-form TTS release.

Live VibeVoice-ASR-7B

Long-form speech recognition built for one-pass context.

Accepts up to 60 minutes of audio in one run while preserving speaker continuity, timestamps, and domain-aware phrasing through customized hotwords.

Structured output across speaker, timestamp, and transcript content
Native multilingual coverage across 50+ languages
Public Playground plus Hugging Face access

Open Playground Model Weights

Live VibeVoice-Realtime-0.5B

Streaming TTS tuned for fast first audio and long-form continuity.

A deployment-friendlier realtime model that supports streaming text input, approximately 300 millisecond first audible latency, and longer running speech sessions.

Streaming text input with low perceived delay
Long-form robustness for multi-minute sessions
Public Colab path for quick experimentation

Launch Colab Model Weights

Archive VibeVoice-TTS-1.5B

Long-form multi-speaker TTS remains part of the release history.

The original research release introduced up to 90 minutes of generated speech with support for as many as four distinct speakers, but the code was later removed from the repository after a responsible-use review.

Long-form conversational audio and podcast-style generation
Multi-speaker coherence across extended sessions
Presented here as research context rather than a live demo

Weights Page Read Report

Capability Matrix

Optimized around context depth, low-latency synthesis, and release clarity.

Instead of presenting voice AI as a single monolith, VibeVoice exposes separate model lanes for long-context ASR, realtime generation, and long-form research synthesis.

ASR Context Window

60 min

Multilingual Reach

50+ langs

Realtime Response

~300 ms

Long-form TTS Range

90 min

Who / When / What

ASR output is structured, not just raw text.

The public ASR release emphasizes speaker identity, timestamps, and transcript content as one unified result surface.

Streaming path

Realtime TTS is lighter by design.

VibeVoice-Realtime-0.5B narrows the path to quick experimentation with streaming text input and faster audible response.

Research posture

Release notes matter as much as raw capability.

The model family is framed publicly as research-oriented, with explicit cautions around production use and synthetic audio misuse.

Release Timeline

From long-form TTS research to public ASR and realtime access points.

The current public story is best understood as a sequence of model releases, demos, and responsible-use adjustments rather than a single product drop.

2026-03-06

ASR enters the Transformers ecosystem.

The public materials note a Hugging Face Transformers release path for VibeVoice-ASR integration.

2026-01-21

VibeVoice-ASR becomes the headline public release.

Single-pass 60-minute transcription, multilingual support, and a public Playground become the clearest onboarding path.

2025-12-03

Realtime-0.5B opens a streaming TTS lane.

The lighter realtime model adds a public Colab route for quicker hands-on evaluation.

2025-09-05

TTS code is removed after a responsible-use review.

The release history explicitly calls out a change in repository scope tied to safer usage boundaries.

2025-08-25

The original TTS research release defines the long-form ambition.

Up to 90 minutes of generated speech with multiple distinct speakers sets the tone for the broader VibeVoice family.

Resources

Jump straight to docs, weights, demos, and source material.

This site packages the public entry points into one fast overview page, while the underlying model details remain with the official repository and papers.

Source GitHub Repository

Browse release notes, docs, and model references from the public Microsoft repository.

Models Hugging Face Collection

Move directly into the public weight pages for ASR, Realtime, and archived TTS artifacts.

Demo ASR Playground

Use the clearest public try-now path for the current VibeVoice release stack.

Demo Realtime Colab

Open the streaming TTS notebook to explore low-latency speech generation experiments.

Paper ASR Technique Report

Read the structured transcription and long-context ASR framing behind the public release.

Paper TTS Research Report

Review the original long-form multi-speaker synthesis research that seeded the family.

FAQ

Questions the public release naturally raises.

Short answers for what VibeVoice is, what you can try, and how the current scope should be interpreted.

What is VibeVoice?

VibeVoice is a family of voice AI models that spans long-form ASR, realtime streaming TTS, and the historical research release of long-form multi-speaker TTS.

What can I try right now?

The public ASR Playground and the Realtime Colab are the most direct, live entry points surfaced by the current release materials.

Is the original TTS model still a live public demo?

No live TTS demo is presented here. The TTS release remains part of the public research story, but the repository notes that the code was removed after a responsible-use review.

Can I treat this as production-ready voice infrastructure?

The public release guidance presents VibeVoice as research-first and recommends further testing before real-world or commercial deployment.