Menu

Acoustic model adaptation

Help
2016-05-19
2016-05-27
  • Ananthapadmanabhan

    Hi,

    I'm trying to adapt the acoustic model for a different accent. I'm using multiple utterances for training. The problem is that the adaptation process recognizes 2 utterances from the same individual and does not recognize 2.

    I have followed the tutorial from :-
    http://cmusphinx.sourceforge.net/wiki/tutorialadapt

    and troubleshooting from:-
    http://cmusphinx.sourceforge.net/wiki/tutorialam#troubleshooting

    but it still didn't help

    The bw function call being used is :

    ..\sphinxtrain\bin\Release\Win32\bw -hmmdir ..\pocketsphinx\model\en-us\en-us -moddeffn ..\pocketsphinx\model\en-us\en-us\mdef.txt -ts2cbfn .ptm. -feat 1s_c_d_dd -svspec 0-12/13-25/26-38 -cmn current -agc none -dictfn ..\custom-dict\h2g2.dict -ctlfn ..\h2g2\fileids.fileids -lsnfn ..\h2g2\convert_text.transcription -accumdir .

    Here is the part of log:-

    *utt> 28 mvoice9denoisedpart1 333INFO: cmn.c(183): CMN: 69.17 5.41 6.44 8.07 -5.05 -11.88 16.33 5.91 -5.77 -16.24 -12.09 -5.56 7.25
    0 96 40 9 10 1.112933e-102 -1.788552e+002 -5.955879e+004 utt 0.023x 1.101e upd 0.023x 1.075e fwd 0.014x 1.109e bwd 0.009x 0.992e gau 0.028x 1.
    355e rsts 0.000x 0.000e rstf 0.000x 0.000e rstu 0.000x 0.000e

    utt> 29 mvoice9denoisedpart2 680INFO: cmn.c(183): CMN: 67.55 12.40 6.01 11.76 -8.36 -15.87 16.77 -6.53 -8.55 -12.22 -9.91 -7.97 5.76
    0 256 47 ERROR: "backward.c", line 421: Failed to align audio to trancript: final state of the search is not reached
    ERROR: "baumwelch.c", line 324: mvoice9denoisedpart2 ignored
    utt 0.021x 0.917e upd 0.018x 1.016e fwd 0.018x 1.016e bwd 0.000x 0.000e gau 0.014x 1.152e rsts 0.000x 0.000e rstf 0.000x 0.000e rstu 0.000x 0.000e

    utt> 30 mvoice9denoisedpart3 815INFO: cmn.c(183): CMN: 71.05 10.72 5.71 16.02 -8.00 -19.48 22.63 -0.06 -0.81 -13.28 -9.29 -9.30 3.70
    0 292 61 ERROR: "backward.c", line 421: Failed to align audio to trancript: final state of the search is not reached
    ERROR: "baumwelch.c", line 324: mvoice9denoisedpart3 ignored
    utt 0.017x 1.038e upd 0.017x 1.031e fwd 0.017x 1.024e bwd 0.000x 0.000e gau 0.013x 1.179e rsts 0.000x 0.000e rstf 0.000x 0.000e rstu 0.000x 0.000e

    utt> 31 mvoice9denoisedpart4 573INFO: cmn.c(183): CMN: 71.86 11.21 5.49 10.48 -3.98 -17.57 16.87 0.62 -5.87 -14.06 -12.95 -6.38 6.31
    0 184 43 17 13 5.013159e-102 -1.810491e+002 -1.037411e+005 utt 0.033x 1.019e upd 0.033x 1.008e fwd 0.016x 1.024e bwd 0.016x 0.981e gau 0.030x 2
    .397e rsts 0.003x 0.640e rstf 0.000x 0.000e rstu 0.000x 0.000e*

    Here are the above 4 files mentioned in the log.

    https://drive.google.com/open?id=0BzWUiNE5OXlub0dlQzZVS2FzdFk

    Can anyone please help with this?

     
    • Nickolay V. Shmyrev

      Overall your pronunciation is far from US English so it is perfectly fine to have alignment mistake. You need to train model from scratch here.

      To get more detailed help on this issue you need to provide the transcription file you were using, you did not provide that.

       
      • Ananthapadmanabhan

        "Overall your pronunciation is far from US English so it is perfectly fine to have alignment mistake. You need to train model from scratch here."
        

        To overcome the discrepancy in the pronounciation, I have made a dictionary file with revised pronounciations using the same phonemes to train. I have also ensured that all the variants of pronounciations in the audio files are present in the dictionary. Is that not sufficient?

        "To get more detailed help on this issue you need to provide the transcription file you were using"
        

        I have uploaded the revised dictionary file and transcription text in the following link same as above one.
        https://drive.google.com/folderview?id=0BzWUiNE5OXlub0dlQzZVS2FzdFk&usp=drive_web

         
        • Nickolay V. Shmyrev

          Alternative pronunciations are not used in adaptation, they are only used in decoding. To use alternative pronunciations in adaptation you need to force align the transcript first with sphinx3_align. Or you need to manually point which alternative pronunciation is actually used in transcription file.

          Dictionary variants are not very effective overall, you need Indian English model or US English speech.

           
  • Ananthapadmanabhan

    Thanks for the forced align tip.

    I was succesful in adapting the model, although I was expecting accuracy around 40-50%. Then I saw this

    "Dictionary variants are not very effective overall, you need Indian English model or US English speech."
    

    Essentially, due to absence of Indian English model, I would like to make one using call center data. To get funding for the same, I am trying to build a small model of 100 words which can identify all these words in Indian accent. As shared before, I have made the dictionary and lm file for the same.

    Now my questions are:-

    1. Do you recommend training or adapting the model? (In tutorial, adapting was suggested which seems against your suggestion)Although I would want to train it for developing the final model for 4-5k words, my question is what works best for the 100 words model.

    2. If either of the options(adapting/training), how much data should I gather to have 70-80% accuracy?

    If you have answered these questions in any other thread, can you please give me the link because I wasn't able to find it.

     
    • Nickolay V. Shmyrev

      To successfully recognize call center data you need to train a new model. Existing models will not be helpful for you, they simply will not work. You need > 100 hours of data at least, ideally > 300 hours of transcribed speech data.

      You need to train the model, adaptation is not very effective. This does not sound easy, but it is a state of the things.

      If you still want to demo an adaptation I suggest you to build the model for your personal voice exclusively and higher quality audio. Do not use other random people recordings. You need to record about 20-30 minutes of speech yourself.

       
      • Nickolay V. Shmyrev

        And, beside acoustic model adaptation, you need to build a specialized call center language model. It is also critical for the accuracy. For that you again need transcribed callcenter logs.

         

Log in to post a comment.