


Voxtral Transcribe 2 is a next-generation speech-to-text model family from Mistral, delivering ultra-fast, highly accurate transcription with real-time capabilities and speaker diarization. It includes two models: Voxtral Mini Transcribe V2 for batch transcription and Voxtral Realtime for live applications. Together, they support 13 languages, word-level timestamps, context biasing, and privacy-first deployment—all at industry-leading speed and cost.
Purpose-built for live transcription, Voxtral Realtime uses a novel streaming architecture that transcribes audio as it arrives. It delivers configurable latency down to sub-200ms, enabling voice agents with near-offline accuracy. At 480ms delay, it stays within 1–2% word error rate, matching batch quality for real-time applications.
This batch model achieves state-of-the-art transcription quality at approximately 4% word error rate on the FLEURS benchmark and $0.003 per minute. It outperforms GPT-4o mini Transcribe, Gemini 2.5 Flash, Assembly Universal, and Deepgram Nova on accuracy, while processing audio about 3x faster than ElevenLabs’ Scribe v2 at one-fifth the cost.
Generate transcriptions with speaker labels and precise start/end times, ideal for meetings, interviews, and multi-party calls. Context biasing lets you provide up to 100 words or phrases to guide the model toward correct spellings of names, technical terms, or domain-specific vocabulary.
Voxtral Realtime ships under the Apache 2.0 license, deployable on edge for privacy-first applications. Both models natively support 13 languages, including English, Chinese, Hindi, Spanish, Arabic, French, Portuguese, Russian, German, Japanese, Korean, Italian, and Dutch.
Voxtral Transcribe 2 delivers the lowest word error rate at the lowest price point, with real-time latency down to sub-200ms.
This combination of accuracy, speed, and cost efficiency is unmatched in the current market. Voxtral Mini Transcribe V2 achieves state-of-the-art transcription at $0.003 per minute, while Voxtral Realtime enables a new class of voice-first applications with streaming architecture that doesn't compromise on quality. The open-weights release under Apache 2.0 further sets it apart, allowing privacy-sensitive deployments on edge devices.
You need a speech-to-text solution that balances ultra-low latency, high accuracy, and cost-effectiveness—especially for real-time voice agents, live transcription, or privacy-first applications. The open-weights model and multilingual support make it a strong choice for developers building across platforms and languages.
Other tools you might consider
Loading comments…
Maker
async_apple
Visit Website
mistral.ai/news/voxtral-transcribe-2
Project Info
Product Keywords
Achievement