Hi all,
I have seen that new Sphinx LMs for English language are now distributed.
As they were evaluated in terms of perplexity on the TED talks from IWSLT 2012, it reminds me that we (at LIUM) have made available the corpus we used to train acoustic models for our participation to IWSLT 2011.
This corpus (we called it TED-LIUM) is composed of TED transcriptions with audio from talks: transcriptions were filtered, time-coded, and put into STM format files.
Contents:
- about 118h of speech
- 774 audio talks in NIST sphere format (SPH)
- 774 transcripts in STM format
- Dictionary with pronunciation (157617 words)
Hi all,
I have seen that new Sphinx LMs for English language are now distributed.
As they were evaluated in terms of perplexity on the TED talks from IWSLT 2012, it reminds me that we (at LIUM) have made available the corpus we used to train acoustic models for our participation to IWSLT 2011.
This corpus (we called it TED-LIUM) is composed of TED transcriptions with audio from talks: transcriptions were filtered, time-coded, and put into STM format files.
Contents:
- about 118h of speech
- 774 audio talks in NIST sphere format (SPH)
- 774 transcripts in STM format
- Dictionary with pronunciation (157617 words)
You can get it here (licensed under Creative Commons BY-NC-ND 3.0):
http://www-lium.univ-lemans.fr/en/content/corpus
If you use it, please give some feedbacks here.
Best,
Yannick
Hi Yannick
This is a great piece of data. The license is pretty restrictive, but I'm sure it will be useful. I was not aware of it.
It would be nice to run few experiments using this corpus.