In the early 2000s there was a Norwegian startup (now bankrupt) doing speech recognition.
They managed to do manuscript based recordings with over 1000 people before they went under, and all of this is now freely available for download through the Norwegian National Library.
The data available includes tens if not hundreds of thousands of audio files from over thousand different people, with acompanying manuscripts.
Another datafile includes what is supposed to be a sort of phonetic dictionary, filled with 784240 lines with this format:
-modigste;JJ;PLU||NOM||SUV;-modigste;JJ;INFL;NOR;;;;;"mu:$dIg$st@;2;Standard;NOR;;;;;;;;;;;;;;506694;inflector_no;Neutral;INFLECTED;-modig|38866;a2a-viktig;108;;;;;;;;;;;;;-modigste;CE030204;;762343
The phonetic dictionary is documented in this whitepaper, but I cannot understand how that compares to the phonetic dict used by CMUSphinx: http://www.nb.no/sbfil/dok/nst_leksdat_no.pdf
Hi guys!
In the early 2000s there was a Norwegian startup (now bankrupt) doing speech recognition.
They managed to do manuscript based recordings with over 1000 people before they went under, and all of this is now freely available for download through the Norwegian National Library.
The data available includes tens if not hundreds of thousands of audio files from over thousand different people, with acompanying manuscripts.
Another datafile includes what is supposed to be a sort of phonetic dictionary, filled with 784240 lines with this format:
-modigste;JJ;PLU||NOM||SUV;-modigste;JJ;INFL;NOR;;;;;"mu:$dIg$st@;2;Standard;NOR;;;;;;;;;;;;;;506694;inflector_no;Neutral;INFLECTED;-modig|38866;a2a-viktig;108;;;;;;;;;;;;;-modigste;CE030204;;762343
The phonetic dictionary is documented in this whitepaper, but I cannot understand how that compares to the phonetic dict used by CMUSphinx: http://www.nb.no/sbfil/dok/nst_leksdat_no.pdf
Here is a whitepaper describing the data (in Norwegian, but Google Translate should do the trick): http://www.nb.no/sbfil/dok/nst_taledat_no.pdf
I have limited knowledge of speech recognition, but I imagine a lot of the groundwork must have been layed down here.
Is it plausible to modify these files in order to train them in CMUSphinx, you think? The audio files are about 100GB.
Sure, you can use a simple Python script for that.