VibeVoice wordmark

Open-Source Frontier Voice AI

VibeVoice ASR and realtime TTS in one open-source voice AI overview.

VibeVoice combines 60-minute single-pass ASR, streaming-friendly realtime TTS, and the research arc of long-form multi-speaker voice generation into one sharp public release narrative.

Independent editorial guide to the public VibeVoice release materials, demos, weights, and papers.

Single pass 60 min ASR
Language reach 50+ langs
First audio ~300 ms
Research TTS 90 min / 4 speakers
Model family ASR / Realtime / TTS Archive

Structured transcription, streaming synthesis, and long-form voice research in one visual stack.

  • Who / When / WhatStructured ASR output
  • Streaming text inputRealtime speech generation
  • Responsible release postureResearch-first positioning

Model Family

Three distinct release tracks, one consistent voice AI story.

The current public VibeVoice materials center on long-form recognition, realtime synthesis, and the archival context of the original long-form TTS release.

Live VibeVoice-ASR-7B

Long-form speech recognition built for one-pass context.

Accepts up to 60 minutes of audio in one run while preserving speaker continuity, timestamps, and domain-aware phrasing through customized hotwords.

  • Structured output across speaker, timestamp, and transcript content
  • Native multilingual coverage across 50+ languages
  • Public Playground plus Hugging Face access
Live VibeVoice-Realtime-0.5B

Streaming TTS tuned for fast first audio and long-form continuity.

A deployment-friendlier realtime model that supports streaming text input, approximately 300 millisecond first audible latency, and longer running speech sessions.

  • Streaming text input with low perceived delay
  • Long-form robustness for multi-minute sessions
  • Public Colab path for quick experimentation
Archive VibeVoice-TTS-1.5B

Long-form multi-speaker TTS remains part of the release history.

The original research release introduced up to 90 minutes of generated speech with support for as many as four distinct speakers, but the code was later removed from the repository after a responsible-use review.

  • Long-form conversational audio and podcast-style generation
  • Multi-speaker coherence across extended sessions
  • Presented here as research context rather than a live demo

Capability Matrix

Optimized around context depth, low-latency synthesis, and release clarity.

Instead of presenting voice AI as a single monolith, VibeVoice exposes separate model lanes for long-context ASR, realtime generation, and long-form research synthesis.

ASR Context Window
60 min
Multilingual Reach
50+ langs
Realtime Response
~300 ms
Long-form TTS Range
90 min
Who / When / What

ASR output is structured, not just raw text.

The public ASR release emphasizes speaker identity, timestamps, and transcript content as one unified result surface.

Streaming path

Realtime TTS is lighter by design.

VibeVoice-Realtime-0.5B narrows the path to quick experimentation with streaming text input and faster audible response.

Research posture

Release notes matter as much as raw capability.

The model family is framed publicly as research-oriented, with explicit cautions around production use and synthetic audio misuse.

Release Timeline

From long-form TTS research to public ASR and realtime access points.

The current public story is best understood as a sequence of model releases, demos, and responsible-use adjustments rather than a single product drop.

2026-03-06

ASR enters the Transformers ecosystem.

The public materials note a Hugging Face Transformers release path for VibeVoice-ASR integration.

2026-01-21

VibeVoice-ASR becomes the headline public release.

Single-pass 60-minute transcription, multilingual support, and a public Playground become the clearest onboarding path.

2025-12-03

Realtime-0.5B opens a streaming TTS lane.

The lighter realtime model adds a public Colab route for quicker hands-on evaluation.

2025-09-05

TTS code is removed after a responsible-use review.

The release history explicitly calls out a change in repository scope tied to safer usage boundaries.

2025-08-25

The original TTS research release defines the long-form ambition.

Up to 90 minutes of generated speech with multiple distinct speakers sets the tone for the broader VibeVoice family.

Resources

Jump straight to docs, weights, demos, and source material.

This site packages the public entry points into one fast overview page, while the underlying model details remain with the official repository and papers.

FAQ

Questions the public release naturally raises.

Short answers for what VibeVoice is, what you can try, and how the current scope should be interpreted.

What is VibeVoice?

VibeVoice is a family of voice AI models that spans long-form ASR, realtime streaming TTS, and the historical research release of long-form multi-speaker TTS.

What can I try right now?

The public ASR Playground and the Realtime Colab are the most direct, live entry points surfaced by the current release materials.

Is the original TTS model still a live public demo?

No live TTS demo is presented here. The TTS release remains part of the public research story, but the repository notes that the code was removed after a responsible-use review.

Can I treat this as production-ready voice infrastructure?

The public release guidance presents VibeVoice as research-first and recommends further testing before real-world or commercial deployment.