Hello I am trying to develop a Turkish language model for CMUSphinx. But there's a problem.
Turkish is an agglutinative language so there are billions of possible words thanks to the suffixes. So we can't include all those words to our dictionary one by one. I think the suffixes should be included in language model, as well as the roots and when a word is derived from the root and suffixes the language model should allow the recognizer to detect the word(?)
But it seems quite complex to me, as I actually don't know what to do with all these suffixes. So my question is, is it better if I isolate the words that my program will use and include them to my dictionary and my acoustic model wheter they are derived or simple words?
My program won't be a 10-20 word speech recognition program. I will probably need about 300-5000 words for the start, and then increase the words. So what should I do know? Shuld I just include the words that I will use (this way I unfortunately ignore the tenses, suffixes that will change the root's meaning), or should I go for the language model and try to include all possible words that can be created in Turkish?
For one last example let me show you how Turkish works:
"gel-mek": to come
"r": the suffix for present simple
"di": the suffix for past simple
"ecek": the suffix for future tense
"gelir": (he/she/it) comes
"geldi": (he/she/it) came
"gelecek": (he/she/it) will come
unfortunately it's not that simple all the time, you can spot the"i" before "r" in present simple tense exercise for example. some letters change, some disappear and some just appear for no reason as in the example while deriving words from roots and suffixes. also there are more than tense suffixes, you can consider dative, accusative etc. suffixes.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hello I am trying to develop a Turkish language model for CMUSphinx. But there's a problem.
Turkish is an agglutinative language so there are billions of possible words thanks to the suffixes. So we can't include all those words to our dictionary one by one. I think the suffixes should be included in language model, as well as the roots and when a word is derived from the root and suffixes the language model should allow the recognizer to detect the word(?)
But it seems quite complex to me, as I actually don't know what to do with all these suffixes. So my question is, is it better if I isolate the words that my program will use and include them to my dictionary and my acoustic model wheter they are derived or simple words?
My program won't be a 10-20 word speech recognition program. I will probably need about 300-5000 words for the start, and then increase the words. So what should I do know? Shuld I just include the words that I will use (this way I unfortunately ignore the tenses, suffixes that will change the root's meaning), or should I go for the language model and try to include all possible words that can be created in Turkish?
For one last example let me show you how Turkish works:
"gel-mek": to come
"r": the suffix for present simple
"di": the suffix for past simple
"ecek": the suffix for future tense
"gelir": (he/she/it) comes
"geldi": (he/she/it) came
"gelecek": (he/she/it) will come
unfortunately it's not that simple all the time, you can spot the"i" before "r" in present simple tense exercise for example. some letters change, some disappear and some just appear for no reason as in the example while deriving words from roots and suffixes. also there are more than tense suffixes, you can consider dative, accusative etc. suffixes.
You actually can, it is not a big problem to include couple of million words into the dictionary.
It is not different from other languages.
but the thing is there are billions of words in turkish. you can create thousands of words using the root "gel-mek"(to come)
Thats ok, the number of words is limited anyway.