Omnilingual-ASR is a research codebase exploring automatic speech recognition that generalizes across a very large number of languages using shared modeling and training recipes. It focuses on leveraging self-supervised audio pretraining and scalable fine-tuning so low-resource languages can benefit from high-resource data. The project provides data preparation pipelines, training scripts, decoding utilities, and evaluation tools so researchers can reproduce results and extend to new language sets. It emphasizes modularity: acoustic modeling, language modeling, tokenization, and decoding are separable pieces you can swap or ablate. The repo is aimed at pushing practical multilingual ASR—robust to accents, code-switching, and domain shifts—rather than language-by-language systems. For practitioners, it’s a starting point to study transfer, zero-shot behavior, and trade-offs between model size, compute cost, and coverage.
Features
- End-to-end training recipes with self-supervised pretraining and multilingual fine-tuning
- Data prep scripts for large, heterogeneous corpora and multilingual tokenization
- Decoding pipelines with configurable beam search and language model fusion
- Evaluation utilities covering WER/CER and language-wise breakdowns
- Modular components to swap acoustic models, tokenizers, or decoders
- Support for distributed training to scale experiments on modern accelerators