mms-300m-1130-forced-aligner is a multilingual forced alignment model based on Meta’s MMS-300M wav2vec2 checkpoint, adapted for Hugging Face’s Transformers library. It supports forced alignment between audio and corresponding text across 158 languages, offering broad multilingual coverage. The model enables accurate word- or phoneme-level timestamping using Connectionist Temporal Classification (CTC) emissions. Unlike other tools, it provides significant memory efficiency compared to the TorchAudio forced alignment API. Users can integrate it easily through the Python package ctc-forced-aligner, and it supports GPU acceleration via PyTorch. The alignment pipeline includes audio processing, emission generation, tokenization, and span detection, making it suitable for speech analysis, transcription syncing, and dataset creation. This model is especially useful for researchers and developers working with low-resource languages or building multilingual speech systems.
Features
- Forced alignment using CTC-based wav2vec2 emissions
- Covers 158 ISO-639-3 languages
- Compatible with Hugging Face Transformers
- Memory-efficient compared to TorchAudio’s API
- Supports GPU acceleration via PyTorch
- Outputs word-level timestamps from audio
- Easy integration through a Python package
- Adapted from Meta’s MMS-300M checkpoint