CMU Sphinx / Forums / Help: About training Acoustic Model

stevenyslin - 2016-07-05

Hello,

I have two questions for training voice data,

Q1：
I want to train my voice for HMM model,
I found my each wav data just only have one second or two second,

for example:

<s> apple </s> (arctic_0001) <s> banana </s> (arctic_0002) <s> cat </s> (arctic_0003) <s> dog </s> (arctic_0004) <s> egg </s> (arctic_0005) <s> fish </s> (arctic_0006)

But in this tutorial,
http://cmusphinx.sourceforge.net/wiki/tutorialam
It says that the voice of length is not less than 5 seconds and not more than 30 seconds.

My question:
Does It will have big effect for training?
If yes, does there have solution that can solve this problem?

Q2：
I asked two person(person_A, person_B) to record their each HMM model.
My environment is in quiet room to record voice.
My dictionary have eight vocabulary, each vocabulary will speak five times, so a group will have forty wav data.
My total training data have 400 piece of data, and testing data have 200 piece of data.

My question:
(1) In person_A, it seems to quickly converge for training and have good accuracy, and the number of training data in 280 to 400, the recognition of train to train will not reduce, but in person_B, the recognition of train to train will reduce in 320 to 400, and the recognition of testing data just only about 80%,
What the reason led to accuracy reduce for person_B , or is there anything we should pay special attantion to?

(2) Is there have over-training when train HMM ?
Because I use over 240 piece of data , the accuracy will reduce.

person_A:
http://imgur.com/egvqxo0
person_B:
http://imgur.com/6Ldt6YW

Thanks for your help.

Last edit: stevenyslin 2016-07-05
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Nickolay V. Shmyrev - 2016-07-05
  
  Does It will have big effect for training?
  
  I updated the tutorial with more clear description:
  
  Audio recordings should contain training audio and that training audio should match the audio you want to recognize. In case of mismatch there could be drop of the accuracy, sometimes significant. For example, if you want to recognize continuous speech your training database should record continuous speech. For continuous speech audio files shouldn't be very long and shouldn't be very short. Optimal length is not less than 5 seconds and not more than 30 seconds. Very long files make training much harder. If you are going to recognize short isolated commands, your training database should contain the files with short isolated commands. It is better to design database to recognize continuous speech from the beginning though and not spend your time on commands. In the end you speak continuously anyway. Amount of silence in the beginning of the utterance and in the end of the utterance should not exceed 0.2 second.
  
  (1) In person_A, it seems to quickly converge for training and have good accuracy, and the number of training data in 280 to 400, the recognition of train to train will not reduce, but in person_B, the recognition of train to train will reduce in 320 to 400, and the recognition of testing data just only about 80%,
  What the reason led to accuracy reduce for person_B , or is there anything we should pay special attantion to?
  
  You do not have enough data for training for parameters you train, so the accuracy is unstable. You need to increase the data size or reduce number of parameters.
  
  (2) Is there have over-training when train HMM ? Because I use over 240 piece of data , the accuracy will reduce.
  
  Yes.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

stevenyslin - 2016-07-05

Dear sir,

Really thanks a lot, I have new question for Q1 Q2：

Q1：

Amount of silence in the beginning of the utterance and in the end of the utterance should not exceed 0.2 second.

(1) How to control the silence not exceed 0.2 second ?
Because I use rec for linux library to record voice and fixed time.
(2) Does training data and testing data both not exceed 0.2 second ? Or just training data need to not exceed 0.2 second?

Q2：

You do not have enough data for training for parameters you train, so the accuracy is unstable. You need to increase the data size or reduce number of parameters.

(1) How many training data should I have ?
Because I just want to recognize eight vocabulary by myself, did I need more than one hour data for training ?
(2) Sorry, I don't understand reduce number of parameters, did you mean modify "sphinx_train.cfg" ?

Thanks for your help again.

Last edit: stevenyslin 2016-07-05

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Nickolay V. Shmyrev - 2016-07-05
  
  (1) How to control the silence not exceed 0.2 second ?
  Because I use rec for linux library to record voice and fixed time.
  
  Cut files in editor or use sox trim to trim silence.
  
  (2) Does training data and testing data both not exceed 0.2 second ? Or just training data need to not exceed 0.2 second?
  
  Both
  
  (1) How many training data should I have ?
  Because I just want to recognize eight vocabulary by myself, did I need more than one hour data for training ?
  
  Yes, you need to follow the tutorial
  
  (2) Sorry, I don't understand reduce number of parameters, did you mean modify "sphinx_train.cfg" ?
  
  Yes, this is also covered in tutorial
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

stevenyslin - 2016-07-05

Dear sir,

for Q1：
Did pocketsphinx_continuous also have to do the same process (not exceed 0.2 second) ?
If yes, could you tell me where is code will do this process?

Thanks for your help again

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Nickolay V. Shmyrev - 2016-07-05
  
  Yes pocketsphinx also leaves some silence before decoding in internals, you can find the code in fe_prespch_buf.c in sphinxbase.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

stevenyslin - 2016-07-12

Dear sir,

So, when we spoke in noise environment, the fe_prespch_buf.c will also cut the noise fragments which we haven't spoken ?

Thanks for your help again.

Last edit: stevenyslin 2016-07-12

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Nickolay V. Shmyrev - 2016-07-12
  
  Yes
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

stevenyslin - 2016-07-14

Sorry, I have one more question to ask：
Did pocketsphinx_batch also use fe_prespch_buf.c to cut the noise fragments ?

Really thanks a lot.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Nickolay V. Shmyrev - 2016-07-14
  
  Did pocketsphinx_batch also use fe_prespch_buf.c to cut the noise fragments ?
  
  Yes
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

stevenyslin - 2016-07-14

Dear Sir,

Ok, I got it. So, pocketsphinx_continuous and pocketsphinx_batch will use fe_prespch_buf.c to cut the noise fragments or silence,

My question：
Why should we control silence or noise not exceed 0.2 second, beacuse both of them will process it, is it ?

Thanks for your help again.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Nickolay V. Shmyrev - 2016-07-14
  
  0.2 seconds is needed to train the silence phone and to accurately estimate the transition from silence to the first speech phone.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

About training Acoustic Model

Speech Recognition Toolkit

Forums

Help

About training Acoustic Model document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

About training Acoustic Model