CMU Sphinx / Forums / Help: Large Norwegian database freely available. Possible to modify for CMUSphinx?

Speech Recognition Toolkit

Large Norwegian database freely available. Possible to modify for CMUSphinx?

Forum: Help

Creator: Henrik Lied

Created: 2016-03-13

Updated: 2016-03-13

Henrik Lied - 2016-03-13

Hi guys!

In the early 2000s there was a Norwegian startup (now bankrupt) doing speech recognition.

They managed to do manuscript based recordings with over 1000 people before they went under, and all of this is now freely available for download through the Norwegian National Library.

The data available includes tens if not hundreds of thousands of audio files from over thousand different people, with acompanying manuscripts.

Another datafile includes what is supposed to be a sort of phonetic dictionary, filled with 784240 lines with this format:
-modigste;JJ;PLU||NOM||SUV;-modigste;JJ;INFL;NOR;;;;;"mu:$dIg$st@;2;Standard;NOR;;;;;;;;;;;;;;506694;inflector_no;Neutral;INFLECTED;-modig|38866;a2a-viktig;108;;;;;;;;;;;;;-modigste;CE030204;;762343

The phonetic dictionary is documented in this whitepaper, but I cannot understand how that compares to the phonetic dict used by CMUSphinx: http://www.nb.no/sbfil/dok/nst_leksdat_no.pdf

Here is a whitepaper describing the data (in Norwegian, but Google Translate should do the trick): http://www.nb.no/sbfil/dok/nst_taledat_no.pdf

I have limited knowledge of speech recognition, but I imagine a lot of the groundwork must have been layed down here.

Is it plausible to modify these files in order to train them in CMUSphinx, you think? The audio files are about 100GB.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Nickolay V. Shmyrev - 2016-03-13
  
  Is it plausible to modify these files in order to train them in CMUSphinx, you think?
  
  Sure, you can use a simple Python script for that.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Large Norwegian database freely available. Possible to modify for CMUSphinx?

Speech Recognition Toolkit

Forums

Help

Large Norwegian database freely available. Possible to modify for CMUSphinx? document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

Large Norwegian database freely available. Possible to modify for CMUSphinx?