Open Speech Corpora is a curated catalog of speech datasets intended to support research and development in automatic speech recognition, text-to-speech, and other speech technologies. The repository is organized as a set of tables that list corpora along with their languages, total hours, number of speakers, download links, and licenses, giving practitioners a quick way to find data that matches their needs. It emphasizes free and truly “open” datasets, favoring those released under Creative Commons or community-friendly data licenses, though it also lists corpora that are accessible for research and many commercial uses. The catalog covers well-known resources such as Mozilla Common Voice, Yesno, LJ Speech and numerous Nordic and parliamentary speech corpora, along with their license variants like CC-0 and CC-BY. It is actively maintained as a community resource: users are encouraged to propose new corpora via issues, and there is a backlog of datasets waiting to be integrated.
Features
- Centralized catalog of speech corpora for ASR, TTS and related tasks
- Detailed metadata including language, duration, speakers, download links and licenses
- Emphasis on free and open datasets suitable for research and many commercial uses
- Coverage of popular corpora like Common Voice, LJ Speech and multiple Nordic resources
- Community-driven updates via issues and pull requests to keep the list evolving
- License-based grouping (CC-0, CC-BY and more) to simplify compliance and dataset selection