Whisper-large-v3 is OpenAI’s most advanced multilingual automatic speech recognition (ASR) and speech translation model, featuring 1.54 billion parameters and trained on 5 million hours of labeled and pseudo-labeled audio. Built on a Transformer-based encoder-decoder architecture, it supports 99 languages and delivers significant improvements in transcription accuracy, robustness to noise, and handling of diverse accents. Compared to previous versions, v3 introduces a 128 Mel bin spectrogram input and better support for Cantonese, achieving up to 20% error reduction over Whisper-large-v2. It handles zero-shot transcription and translation, performs language detection automatically, and supports features like word-level timestamps and long-form audio processing. The model integrates well with Hugging Face Transformers and supports optimizations such as batching, SDPA, and Flash Attention 2.
Features
- Supports transcription and translation in 99 languages
- Trained on 5M+ hours of labeled and pseudo-labeled audio
- High accuracy with improved robustness to noise and accents
- Enables word- and sentence-level timestamps
- Supports long-form audio via chunked or sequential processing
- Compatible with PyTorch, JAX, and Transformers pipeline
- Optimized with Flash Attention 2, SDPA, and torch.compile
- Apache 2.0 licensed for flexible commercial and research use