I understand Arpabet is well-suited for English. I have some non-English dictionaries with - what seems to be - X-SAMPA pronounciations. Can these be used out of the box or do I need to map the characters to Arpabet friendly tokens? I.e. space-separated alpha-only characters?
E.g. I have this for the words "expert" and "institut":
ek$"spEt`
In$stI$"t}:t
Please advise!
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Given the resources mentioned here, what tools do I need to create something like en-us-phone.lm.bin but for e.g. Swedish? I understand that the phone LM file is a lot smaller than the original file. Not sure how it was created. Is there a tutorial somewhere to get me started?
How much work do you think would be required in terms of hours/days?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I understand Arpabet is well-suited for English. I have some non-English dictionaries with - what seems to be - X-SAMPA pronounciations. Can these be used out of the box or do I need to map the characters to Arpabet friendly tokens? I.e. space-separated alpha-only characters?
E.g. I have this for the words "expert" and "institut":
Please advise!
Yes you need to map.
Can I create my own set of tokens or do they need to match CMUDict? E.g.:
That is, where there is an equivalent tag I must use it? And then, when there is no match, I can invent my own?
You can create your own set of tokens.
You can also check http://www.openslr.org/29/ and https://github.com/kaldi-asr/kaldi/tree/master/egs/sprakbanken_swe
Thanks. Really appreciate your help :)
Given the resources mentioned here, what tools do I need to create something like en-us-phone.lm.bin but for e.g. Swedish? I understand that the phone LM file is a lot smaller than the original file. Not sure how it was created. Is there a tutorial somewhere to get me started?
How much work do you think would be required in terms of hours/days?