In browsing through the Language and Acoustic Models folders, I've found pronunciation dictionaries for different languages. Where did these come from? I'm pretty sure that the English language dictionary is human edited. The Dutch dictionary, however, has 1.4 million words from multiple languages. Is the other dictionaries human edited, or are they autogenerated somehow? Can anyone comment on the accuracy of the pronunciation data?
Most of dictionaries even hand-reviewed are not accurate at all. cmudict for example is half-phonetic half-phonemic and does not really reflect the way people speak. Dutch dictionary is probably from http://www.fon.hum.uva.nl/rob/Publications/IFAcorpusEurospeech2001.pdf or from CELEX but it is hard to tell these days https://catalog.ldc.upenn.edu/LDC96L14 . Now when Google forgets things actively it is very hard to tell. There is also Utwente-Kaldi project, they have another Dutch dictionary, probably more consistent.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
For the record though, I would attest that, for the English CMU dictionary, about 92%-98% of the words are correct. That is, they are correct given the constraints imposed by the phoneme set.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
In browsing through the Language and Acoustic Models folders, I've found pronunciation dictionaries for different languages. Where did these come from? I'm pretty sure that the English language dictionary is human edited. The Dutch dictionary, however, has 1.4 million words from multiple languages. Is the other dictionaries human edited, or are they autogenerated somehow? Can anyone comment on the accuracy of the pronunciation data?
Most of dictionaries even hand-reviewed are not accurate at all. cmudict for example is half-phonetic half-phonemic and does not really reflect the way people speak. Dutch dictionary is probably from http://www.fon.hum.uva.nl/rob/Publications/IFAcorpusEurospeech2001.pdf or from CELEX but it is hard to tell these days https://catalog.ldc.upenn.edu/LDC96L14 . Now when Google forgets things actively it is very hard to tell. There is also Utwente-Kaldi project, they have another Dutch dictionary, probably more consistent.
Thanks for the info.
For the record though, I would attest that, for the English CMU dictionary, about 92%-98% of the words are correct. That is, they are correct given the constraints imposed by the phoneme set.
No, even most common words are not really correct. See
Speaking in shorthand ± A syllable-centric perspective for understanding pronunciation variation
Steven Greenberg
http://www1.icsi.berkeley.edu/~steveng/PDF/SpeakingInShorthandMIME.pdf