We (me and some of my friends) have decided to create a mobile application to
do Turkish speech recognition. We read the tutorials and information about
pocketsphinx and tested pocketsphinx for English.
For Turkish we started with creating a simple language model. Then, we tried
to train the acoustic model according to this
link. Now I have some
questions:
Turkish has some letters like ş, ç, Ö, Ü. Would these letter be a problem for acoustic model and language model?
We couldn't really come up with a phoneset for Turkish. In Turkish we read as the same way as we write, but I think there is nothing to do with this feature considering phoneset. The link above has something about phones saying:
If you don't have a phonetic book, you can just use the word's spelling and
it gives very good results:
ONE O N E
TWO T W O
Would this work for us? For example we both have c and ç letters. c can be
represented with C, and the latter with CC.
Last but not least, are we in the right way to recognize Turkish speech with pocketsphinx?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
We (me and some of my friends) have decided to create a mobile application
to do Turkish speech recognition. We read the tutorials and information about
pocketsphinx and tested pocketsphinx for English.
That's great
Turkish has some letters like ş, ç, Ö, Ü. Would these letter be a problem
for acoustic model and language model?
No
Would this work for us? For example we both have c and ç letters. c can be
represented with C, and the latter with CC.
Yes
Last but not least, are we in the right way to recognize Turkish speech
with pocketsphinx?
Yes
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Anonymous
-
2012-02-19
Well, we completed very first tests and everything went well. We created a
small vocabulary with around 50 words consisting of all characters of the
alphabet. As nshmyrev said, special characters didn't create a problem. And we
realized how training data is important and the effect of parameters such as
senones or so.
Now I have some other things to ask. We are planning to create a vocabulary
with 400-500 words (for a mobile application) to cover daily conversations.
Then, we will try to record as much data as possible to train acoustic model
(I am guessing around 100 people with 7-8 hours of recording).
How many different sentences should we use for recordings (of course they should cover all the vocabulary)? Does the number really matter or the number of different combinations?
Should we record all these sentences in a quiet environment with good quality or should we record some of them in such noisy environments?
What can you recommend different pronunciations due to accents?
Finally what should use for CFG_FINAL_NUM_DENSITIES (4 or 8) and CFG_N_TIED_STATES (2000 - 4000)?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
How many different sentences should we use for recordings (of course they
should cover all the vocabulary)?
Ideally they all should be different, that would help to increase diversity
Does the number really matter or the number of different combinations?
There is no strict dependency however more diversity is better than less
diversity.
Should we record all these sentences in a quiet environment with good
quality or should we record some of them in such noisy environments?
Noisy recordings are better
What can you recommend different pronunciations due to accents?
Sorry, it's hard to understand this question
Finally what should use for CFG_FINAL_NUM_DENSITIES (4 or 8) and
CFG_N_TIED_STATES (2000 - 4000)?
You need to try all combinations and see which works better
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Anonymous
-
2012-02-28
Thanks for help. Everything goes well, but I have one more question about
recording. Should we record the speech as if we are talking normally in a
daily conversation or should we emphasize each word and wait a little bit
between words?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
You should speak normally as you speak in usual conversations
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Anonymous
-
2012-04-11
We have created an acoustic model for Turkish according to this
link. Currently we have
around ~500 words, 30 phones, and ~2,5 hours of recording. After preparation
of files etc., running acoustic model script took like ~4 minutes.
As it is suggested in "Using the Model" section, we observed the folder with
name <your_db_name>.cd_semi_<number_of senones="">. This folder is like 20KB. The
whole model_parameters is ~2 MB. This made me feel unsafe. Because the
accuracy is not as good as we expected. Would you suggest something for this
issue? We have checked logdir, but nothing really pops out.</number_of></your_db_name>
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
We (me and some of my friends) have decided to create a mobile application to
do Turkish speech recognition. We read the tutorials and information about
pocketsphinx and tested pocketsphinx for English.
For Turkish we started with creating a simple language model. Then, we tried
to train the acoustic model according to this
link. Now I have some
questions:
Would this work for us? For example we both have c and ç letters. c can be
represented with C, and the latter with CC.
That's great
No
Yes
Yes
Well, we completed very first tests and everything went well. We created a
small vocabulary with around 50 words consisting of all characters of the
alphabet. As nshmyrev said, special characters didn't create a problem. And we
realized how training data is important and the effect of parameters such as
senones or so.
Now I have some other things to ask. We are planning to create a vocabulary
with 400-500 words (for a mobile application) to cover daily conversations.
Then, we will try to record as much data as possible to train acoustic model
(I am guessing around 100 people with 7-8 hours of recording).
How many different sentences should we use for recordings (of course they should cover all the vocabulary)? Does the number really matter or the number of different combinations?
Should we record all these sentences in a quiet environment with good quality or should we record some of them in such noisy environments?
What can you recommend different pronunciations due to accents?
Finally what should use for CFG_FINAL_NUM_DENSITIES (4 or 8) and CFG_N_TIED_STATES (2000 - 4000)?
First of all I recommend you to read the tutorial
http://cmusphinx.sourceforge.net/wiki/tutorialam
It will answer some of your questions beforehand
Ideally they all should be different, that would help to increase diversity
There is no strict dependency however more diversity is better than less
diversity.
Noisy recordings are better
Sorry, it's hard to understand this question
You need to try all combinations and see which works better
Thanks for help. Everything goes well, but I have one more question about
recording. Should we record the speech as if we are talking normally in a
daily conversation or should we emphasize each word and wait a little bit
between words?
You should speak normally as you speak in usual conversations
We have created an acoustic model for Turkish according to this
link. Currently we have
around ~500 words, 30 phones, and ~2,5 hours of recording. After preparation
of files etc., running acoustic model script took like ~4 minutes.
As it is suggested in "Using the Model" section, we observed the folder with
name <your_db_name>.cd_semi_<number_of senones="">. This folder is like 20KB. The
whole model_parameters is ~2 MB. This made me feel unsafe. Because the
accuracy is not as good as we expected. Would you suggest something for this
issue? We have checked logdir, but nothing really pops out.</number_of></your_db_name>
Tutorial has troubleshooting section, please read it
Tutorial also has recommendation for the amount of audio required to train the
system. Please read it.