Menu

Large Norwegian database freely available. Possible to modify for CMUSphinx?

Help
2016-03-13
2016-03-13
  • Henrik Lied

    Henrik Lied - 2016-03-13

    Hi guys!

    In the early 2000s there was a Norwegian startup (now bankrupt) doing speech recognition.

    They managed to do manuscript based recordings with over 1000 people before they went under, and all of this is now freely available for download through the Norwegian National Library.

    The data available includes tens if not hundreds of thousands of audio files from over thousand different people, with acompanying manuscripts.

    Another datafile includes what is supposed to be a sort of phonetic dictionary, filled with 784240 lines with this format:
    -modigste;JJ;PLU||NOM||SUV;-modigste;JJ;INFL;NOR;;;;;"mu:$dIg$st@;2;Standard;NOR;;;;;;;;;;;;;;506694;inflector_no;Neutral;INFLECTED;-modig|38866;a2a-viktig;108;;;;;;;;;;;;;-modigste;CE030204;;762343

    The phonetic dictionary is documented in this whitepaper, but I cannot understand how that compares to the phonetic dict used by CMUSphinx: http://www.nb.no/sbfil/dok/nst_leksdat_no.pdf

    Here is a whitepaper describing the data (in Norwegian, but Google Translate should do the trick): http://www.nb.no/sbfil/dok/nst_taledat_no.pdf

    I have limited knowledge of speech recognition, but I imagine a lot of the groundwork must have been layed down here.

    Is it plausible to modify these files in order to train them in CMUSphinx, you think? The audio files are about 100GB.

     
    • Nickolay V. Shmyrev

      Is it plausible to modify these files in order to train them in CMUSphinx, you think?

      Sure, you can use a simple Python script for that.

       

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.