Menu

Training data for acoustic models in English

2013-01-17
2013-01-18
  • Yannick Estève

    Yannick Estève - 2013-01-17

    Hi all,
    I have seen that new Sphinx LMs for English language are now distributed.
    As they were evaluated in terms of perplexity on the TED talks from IWSLT 2012, it reminds me that we (at LIUM) have made available the corpus we used to train acoustic models for our participation to IWSLT 2011.

    This corpus (we called it TED-LIUM) is composed of TED transcriptions with audio from talks: transcriptions were filtered, time-coded, and put into STM format files.
    Contents:
    - about 118h of speech
    - 774 audio talks in NIST sphere format (SPH)
    - 774 transcripts in STM format
    - Dictionary with pronunciation (157617 words)

    You can get it here (licensed under Creative Commons BY-NC-ND 3.0):
    http://www-lium.univ-lemans.fr/en/content/corpus

    If you use it, please give some feedbacks here.

    Best,
    Yannick

     
  • Nickolay V. Shmyrev

    Hi Yannick

    This is a great piece of data. The license is pretty restrictive, but I'm sure it will be useful. I was not aware of it.

    It would be nice to run few experiments using this corpus.

     

Log in to post a comment.