Menu

About training Acoustic Model

Help
2016-07-05
2016-07-14
  • stevenyslin

    stevenyslin - 2016-07-05

    Hello,

    I have two questions for training voice data,

    Q1:
    I want to train my voice for HMM model,
    I found my each wav data just only have one second or two second,

    for example:

    <s> apple </s> (arctic_0001)
    <s> banana </s> (arctic_0002)
    <s> cat </s> (arctic_0003)
    <s> dog </s> (arctic_0004)
    <s> egg </s> (arctic_0005)
    <s> fish </s> (arctic_0006)
    

    But in this tutorial,
    http://cmusphinx.sourceforge.net/wiki/tutorialam
    It says that the voice of length is not less than 5 seconds and not more than 30 seconds.

    My question:
    Does It will have big effect for training?
    If yes, does there have solution that can solve this problem?

    Q2:
    I asked two person(person_A, person_B) to record their each HMM model.
    My environment is in quiet room to record voice.
    My dictionary have eight vocabulary, each vocabulary will speak five times, so a group will have forty wav data.
    My total training data have 400 piece of data, and testing data have 200 piece of data.

    My question:
    (1) In person_A, it seems to quickly converge for training and have good accuracy, and the number of training data in 280 to 400, the recognition of train to train will not reduce, but in person_B, the recognition of train to train will reduce in 320 to 400, and the recognition of testing data just only about 80%,
    What the reason led to accuracy reduce for person_B , or is there anything we should pay special attantion to?

    (2) Is there have over-training when train HMM ?
    Because I use over 240 piece of data , the accuracy will reduce.

    person_A:
    http://imgur.com/egvqxo0
    person_B:
    http://imgur.com/6Ldt6YW

    Thanks for your help.

     

    Last edit: stevenyslin 2016-07-05
    • Nickolay V. Shmyrev

      Does It will have big effect for training?

      I updated the tutorial with more clear description:

      Audio recordings should contain training audio and that training audio should match the audio you want to recognize. In case of mismatch there could be drop of the accuracy, sometimes significant. For example, if you want to recognize continuous speech your training database should record continuous speech. For continuous speech audio files shouldn't be very long and shouldn't be very short. Optimal length is not less than 5 seconds and not more than 30 seconds. Very long files make training much harder. If you are going to recognize short isolated commands, your training database should contain the files with short isolated commands. It is better to design database to recognize continuous speech from the beginning though and not spend your time on commands. In the end you speak continuously anyway. Amount of silence in the beginning of the utterance and in the end of the utterance should not exceed 0.2 second.

      (1) In person_A, it seems to quickly converge for training and have good accuracy, and the number of training data in 280 to 400, the recognition of train to train will not reduce, but in person_B, the recognition of train to train will reduce in 320 to 400, and the recognition of testing data just only about 80%,
      What the reason led to accuracy reduce for person_B , or is there anything we should pay special attantion to?

      You do not have enough data for training for parameters you train, so the accuracy is unstable. You need to increase the data size or reduce number of parameters.

      (2) Is there have over-training when train HMM ? Because I use over 240 piece of data , the accuracy will reduce.

      Yes.

       
  • stevenyslin

    stevenyslin - 2016-07-05

    Dear sir,

    Really thanks a lot, I have new question for Q1 Q2:

    Q1:

    Amount of silence in the beginning of the utterance and in the end of the utterance should not exceed 0.2 second.

    (1) How to control the silence not exceed 0.2 second ?
    Because I use rec for linux library to record voice and fixed time.
    (2) Does training data and testing data both not exceed 0.2 second ? Or just training data need to not exceed 0.2 second?

    Q2:

    You do not have enough data for training for parameters you train, so the accuracy is unstable. You need to increase the data size or reduce number of parameters.

    (1) How many training data should I have ?
    Because I just want to recognize eight vocabulary by myself, did I need more than one hour data for training ?
    (2) Sorry, I don't understand reduce number of parameters, did you mean modify "sphinx_train.cfg" ?

    Thanks for your help again.

     

    Last edit: stevenyslin 2016-07-05
    • Nickolay V. Shmyrev

      (1) How to control the silence not exceed 0.2 second ?
      Because I use rec for linux library to record voice and fixed time.

      Cut files in editor or use sox trim to trim silence.

      (2) Does training data and testing data both not exceed 0.2 second ? Or just training data need to not exceed 0.2 second?

      Both

      (1) How many training data should I have ?
      Because I just want to recognize eight vocabulary by myself, did I need more than one hour data for training ?

      Yes, you need to follow the tutorial

      (2) Sorry, I don't understand reduce number of parameters, did you mean modify "sphinx_train.cfg" ?

      Yes, this is also covered in tutorial

       
  • stevenyslin

    stevenyslin - 2016-07-05

    Dear sir,

    for Q1:
    Did pocketsphinx_continuous also have to do the same process (not exceed 0.2 second) ?
    If yes, could you tell me where is code will do this process?

    Thanks for your help again

     
    • Nickolay V. Shmyrev

      Yes pocketsphinx also leaves some silence before decoding in internals, you can find the code in fe_prespch_buf.c in sphinxbase.

       
  • stevenyslin

    stevenyslin - 2016-07-12

    Dear sir,

    So, when we spoke in noise environment, the fe_prespch_buf.c will also cut the noise fragments which we haven't spoken ?

    Thanks for your help again.

     

    Last edit: stevenyslin 2016-07-12
    • Nickolay V. Shmyrev

      Yes

       
  • stevenyslin

    stevenyslin - 2016-07-14

    Sorry, I have one more question to ask:
    Did pocketsphinx_batch also use fe_prespch_buf.c to cut the noise fragments ?

    Really thanks a lot.

     
    • Nickolay V. Shmyrev

      Did pocketsphinx_batch also use fe_prespch_buf.c to cut the noise fragments ?

      Yes

       
  • stevenyslin

    stevenyslin - 2016-07-14

    Dear Sir,

    Ok, I got it. So, pocketsphinx_continuous and pocketsphinx_batch will use fe_prespch_buf.c to cut the noise fragments or silence,

    My question:
    Why should we control silence or noise not exceed 0.2 second, beacuse both of them will process it, is it ?

    Thanks for your help again.

     
    • Nickolay V. Shmyrev

      0.2 seconds is needed to train the silence phone and to accurately estimate the transition from silence to the first speech phone.

       

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.