CMU Sphinx / Forums / Speech Recognition Theory: A question about agglutinative languages

Speech Recognition Toolkit

A question about agglutinative languages

Burak Kaan Bilgehan - 2018-06-07

Hello I am trying to develop a Turkish language model for CMUSphinx. But there's a problem.

Turkish is an agglutinative language so there are billions of possible words thanks to the suffixes. So we can't include all those words to our dictionary one by one. I think the suffixes should be included in language model, as well as the roots and when a word is derived from the root and suffixes the language model should allow the recognizer to detect the word(?)
But it seems quite complex to me, as I actually don't know what to do with all these suffixes. So my question is, is it better if I isolate the words that my program will use and include them to my dictionary and my acoustic model wheter they are derived or simple words?
My program won't be a 10-20 word speech recognition program. I will probably need about 300-5000 words for the start, and then increase the words. So what should I do know? Shuld I just include the words that I will use (this way I unfortunately ignore the tenses, suffixes that will change the root's meaning), or should I go for the language model and try to include all possible words that can be created in Turkish?

For one last example let me show you how Turkish works:
"gel-mek": to come
"r": the suffix for present simple
"di": the suffix for past simple
"ecek": the suffix for future tense
"gelir": (he/she/it) comes
"geldi": (he/she/it) came
"gelecek": (he/she/it) will come
unfortunately it's not that simple all the time, you can spot the"i" before "r" in present simple tense exercise for example. some letters change, some disappear and some just appear for no reason as in the example while deriving words from roots and suffixes. also there are more than tense suffixes, you can consider dative, accusative etc. suffixes.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Nickolay V. Shmyrev - 2018-06-08
  
  So we can't include all those words to our dictionary one by one.
  
  You actually can, it is not a big problem to include couple of million words into the dictionary.
  
  or should I go for the language model and try to include all possible words that can be created in Turkish?
  
  It is not different from other languages.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Burak Kaan Bilgehan - 2018-06-12
    
    but the thing is there are billions of words in turkish. you can create thousands of words using the root "gel-mek"(to come)
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
    - Nickolay V. Shmyrev - 2019-03-20
      
      Thats ok, the number of words is limited anyway.
      
      If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

A question about agglutinative languages

Speech Recognition Toolkit

Forums

Help

A question about agglutinative languages document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

A question about agglutinative languages