Menu

Turkish Speech Recognition

Anonymous
2012-02-13
2012-09-21
  • Anonymous

    Anonymous - 2012-02-13

    We (me and some of my friends) have decided to create a mobile application to
    do Turkish speech recognition. We read the tutorials and information about
    pocketsphinx and tested pocketsphinx for English.

    For Turkish we started with creating a simple language model. Then, we tried
    to train the acoustic model according to this
    link
    . Now I have some
    questions:

    1. Turkish has some letters like ş, ç, Ö, Ü. Would these letter be a problem for acoustic model and language model?
    2. We couldn't really come up with a phoneset for Turkish. In Turkish we read as the same way as we write, but I think there is nothing to do with this feature considering phoneset. The link above has something about phones saying:

    If you don't have a phonetic book, you can just use the word's spelling and
    it gives very good results:
    ONE O N E
    TWO T W O

    Would this work for us? For example we both have c and ç letters. c can be
    represented with C, and the latter with CC.

    1. Last but not least, are we in the right way to recognize Turkish speech with pocketsphinx?
     
  • Nickolay V. Shmyrev

    We (me and some of my friends) have decided to create a mobile application
    to do Turkish speech recognition. We read the tutorials and information about
    pocketsphinx and tested pocketsphinx for English.

    That's great

    1. Turkish has some letters like ş, ç, Ö, Ü. Would these letter be a problem
      for acoustic model and language model?

    No

    Would this work for us? For example we both have c and ç letters. c can be
    represented with C, and the latter with CC.

    Yes

    1. Last but not least, are we in the right way to recognize Turkish speech
      with pocketsphinx?

    Yes

     
  • Anonymous

    Anonymous - 2012-02-19

    Well, we completed very first tests and everything went well. We created a
    small vocabulary with around 50 words consisting of all characters of the
    alphabet. As nshmyrev said, special characters didn't create a problem. And we
    realized how training data is important and the effect of parameters such as
    senones or so.

    Now I have some other things to ask. We are planning to create a vocabulary
    with 400-500 words (for a mobile application) to cover daily conversations.
    Then, we will try to record as much data as possible to train acoustic model
    (I am guessing around 100 people with 7-8 hours of recording).

    1. How many different sentences should we use for recordings (of course they should cover all the vocabulary)? Does the number really matter or the number of different combinations?

    2. Should we record all these sentences in a quiet environment with good quality or should we record some of them in such noisy environments?

    3. What can you recommend different pronunciations due to accents?

    4. Finally what should use for CFG_FINAL_NUM_DENSITIES (4 or 8) and CFG_N_TIED_STATES (2000 - 4000)?

     
  • Nickolay V. Shmyrev

    First of all I recommend you to read the tutorial

    http://cmusphinx.sourceforge.net/wiki/tutorialam

    It will answer some of your questions beforehand

    1. How many different sentences should we use for recordings (of course they
      should cover all the vocabulary)?

    Ideally they all should be different, that would help to increase diversity

    Does the number really matter or the number of different combinations?

    There is no strict dependency however more diversity is better than less
    diversity.

    1. Should we record all these sentences in a quiet environment with good
      quality or should we record some of them in such noisy environments?

    Noisy recordings are better

    1. What can you recommend different pronunciations due to accents?

    Sorry, it's hard to understand this question

    1. Finally what should use for CFG_FINAL_NUM_DENSITIES (4 or 8) and
      CFG_N_TIED_STATES (2000 - 4000)?

    You need to try all combinations and see which works better

     
  • Anonymous

    Anonymous - 2012-02-28

    Thanks for help. Everything goes well, but I have one more question about
    recording. Should we record the speech as if we are talking normally in a
    daily conversation or should we emphasize each word and wait a little bit
    between words?

     
  • Nickolay V. Shmyrev

    You should speak normally as you speak in usual conversations

     
  • Anonymous

    Anonymous - 2012-04-11

    We have created an acoustic model for Turkish according to this
    link
    . Currently we have
    around ~500 words, 30 phones, and ~2,5 hours of recording. After preparation
    of files etc., running acoustic model script took like ~4 minutes.

    As it is suggested in "Using the Model" section, we observed the folder with
    name <your_db_name>.cd_semi_<number_of senones="">. This folder is like 20KB. The
    whole model_parameters is ~2 MB. This made me feel unsafe. Because the
    accuracy is not as good as we expected. Would you suggest something for this
    issue? We have checked logdir, but nothing really pops out.</number_of></your_db_name>

     
  • Nickolay V. Shmyrev

    Because the accuracy is not as good as we expected. Would you suggest
    something for this issue? We have checked logdir, but nothing really pops out.

    Tutorial has troubleshooting section, please read it

    Tutorial also has recommendation for the amount of audio required to train the
    system. Please read it.

     

Log in to post a comment.

Auth0 Logo