wav2vec2-large-xlsr-53-portuguese is an automatic speech recognition (ASR) model fine-tuned on Portuguese using the Common Voice 6.1 dataset. It is based on Facebook’s wav2vec2-large-xlsr-53, a multilingual self-supervised learning model, and is optimized to transcribe Portuguese speech sampled at 16kHz. The model performs well without a language model, though adding one can improve word error rate (WER) and character error rate (CER). It achieves a WER of 11.3% (or 9.01% with LM) on Common Voice test data, demonstrating high accuracy for a single-language ASR model. Inference can be done using HuggingSound or via a custom PyTorch script using Hugging Face Transformers and Librosa. Training scripts and evaluation methods are open source and available on GitHub. It is released under the Apache 2.0 license and intended for ASR tasks in Brazilian Portuguese.
Features
- Fine-tuned on Common Voice 6.1 Portuguese dataset
- Based on Facebook’s XLSR-53 wav2vec2 large architecture
- Supports 16kHz audio input for optimal accuracy
- Works with or without a language model (LM)
- Available via HuggingSound and Hugging Face Transformers
- Provides example code for evaluation and inference
- Achieves 9.01% WER with LM and 3.21% CER with LM
- Apache-2.0 licensed and freely usable for commercial ASR systems