Training data for acoustic models in English

Speech Recognition Toolkit

Brought to you by: air, arthchan2003, awb, bhiksha, and 5 others

This project can now be found here.

Training data for acoustic models in English

Forum: Speech Recognition Theory

Creator: Yannick Estève

Created: 2013-01-17

Updated: 2013-01-18

Yannick Estève - 2013-01-17

Hi all,
I have seen that new Sphinx LMs for English language are now distributed.
As they were evaluated in terms of perplexity on the TED talks from IWSLT 2012, it reminds me that we (at LIUM) have made available the corpus we used to train acoustic models for our participation to IWSLT 2011.

This corpus (we called it TED-LIUM) is composed of TED transcriptions with audio from talks: transcriptions were filtered, time-coded, and put into STM format files.
Contents:
- about 118h of speech
- 774 audio talks in NIST sphere format (SPH)
- 774 transcripts in STM format
- Dictionary with pronunciation (157617 words)

You can get it here (licensed under Creative Commons BY-NC-ND 3.0):
http://www-lium.univ-lemans.fr/en/content/corpus

If you use it, please give some feedbacks here.

Best,
Yannick

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nickolay V. Shmyrev - 2013-01-18

Hi Yannick

This is a great piece of data. The license is pretty restrictive, but I'm sure it will be useful. I was not aware of it.

It would be nice to run few experiments using this corpus.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Log in to post a comment.