CMU Sphinx / Forums / Help: Acoustic model adaptation

Ananthapadmanabhan - 2016-05-19

Hi,

I'm trying to adapt the acoustic model for a different accent. I'm using multiple utterances for training. The problem is that the adaptation process recognizes 2 utterances from the same individual and does not recognize 2.

I have followed the tutorial from :-
http://cmusphinx.sourceforge.net/wiki/tutorialadapt

and troubleshooting from:-
http://cmusphinx.sourceforge.net/wiki/tutorialam#troubleshooting

but it still didn't help

The bw function call being used is :

..\sphinxtrain\bin\Release\Win32\bw -hmmdir ..\pocketsphinx\model\en-us\en-us -moddeffn ..\pocketsphinx\model\en-us\en-us\mdef.txt -ts2cbfn .ptm. -feat 1s_c_d_dd -svspec 0-12/13-25/26-38 -cmn current -agc none -dictfn ..\custom-dict\h2g2.dict -ctlfn ..\h2g2\fileids.fileids -lsnfn ..\h2g2\convert_text.transcription -accumdir .

Here is the part of log:-

*utt> 28 mvoice9denoisedpart1 333INFO: cmn.c(183): CMN: 69.17 5.41 6.44 8.07 -5.05 -11.88 16.33 5.91 -5.77 -16.24 -12.09 -5.56 7.25
0 96 40 9 10 1.112933e-102 -1.788552e+002 -5.955879e+004 utt 0.023x 1.101e upd 0.023x 1.075e fwd 0.014x 1.109e bwd 0.009x 0.992e gau 0.028x 1.
355e rsts 0.000x 0.000e rstf 0.000x 0.000e rstu 0.000x 0.000e

utt> 29 mvoice9denoisedpart2 680INFO: cmn.c(183): CMN: 67.55 12.40 6.01 11.76 -8.36 -15.87 16.77 -6.53 -8.55 -12.22 -9.91 -7.97 5.76
0 256 47 ERROR: "backward.c", line 421: Failed to align audio to trancript: final state of the search is not reached
ERROR: "baumwelch.c", line 324: mvoice9denoisedpart2 ignored
utt 0.021x 0.917e upd 0.018x 1.016e fwd 0.018x 1.016e bwd 0.000x 0.000e gau 0.014x 1.152e rsts 0.000x 0.000e rstf 0.000x 0.000e rstu 0.000x 0.000e

utt> 30 mvoice9denoisedpart3 815INFO: cmn.c(183): CMN: 71.05 10.72 5.71 16.02 -8.00 -19.48 22.63 -0.06 -0.81 -13.28 -9.29 -9.30 3.70
0 292 61 ERROR: "backward.c", line 421: Failed to align audio to trancript: final state of the search is not reached
ERROR: "baumwelch.c", line 324: mvoice9denoisedpart3 ignored
utt 0.017x 1.038e upd 0.017x 1.031e fwd 0.017x 1.024e bwd 0.000x 0.000e gau 0.013x 1.179e rsts 0.000x 0.000e rstf 0.000x 0.000e rstu 0.000x 0.000e

utt> 31 mvoice9denoisedpart4 573INFO: cmn.c(183): CMN: 71.86 11.21 5.49 10.48 -3.98 -17.57 16.87 0.62 -5.87 -14.06 -12.95 -6.38 6.31
0 184 43 17 13 5.013159e-102 -1.810491e+002 -1.037411e+005 utt 0.033x 1.019e upd 0.033x 1.008e fwd 0.016x 1.024e bwd 0.016x 0.981e gau 0.030x 2
.397e rsts 0.003x 0.640e rstf 0.000x 0.000e rstu 0.000x 0.000e*

Here are the above 4 files mentioned in the log.

https://drive.google.com/open?id=0BzWUiNE5OXlub0dlQzZVS2FzdFk

Can anyone please help with this?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Nickolay V. Shmyrev - 2016-05-22
  
  Overall your pronunciation is far from US English so it is perfectly fine to have alignment mistake. You need to train model from scratch here.
  
  To get more detailed help on this issue you need to provide the transcription file you were using, you did not provide that.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Ananthapadmanabhan - 2016-05-23
    
    "Overall your pronunciation is far from US English so it is perfectly fine to have alignment mistake. You need to train model from scratch here."
    
    To overcome the discrepancy in the pronounciation, I have made a dictionary file with revised pronounciations using the same phonemes to train. I have also ensured that all the variants of pronounciations in the audio files are present in the dictionary. Is that not sufficient?
    
    "To get more detailed help on this issue you need to provide the transcription file you were using"
    
    I have uploaded the revised dictionary file and transcription text in the following link same as above one.
    https://drive.google.com/folderview?id=0BzWUiNE5OXlub0dlQzZVS2FzdFk&usp=drive_web
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
    - Nickolay V. Shmyrev - 2016-05-23
      
      Alternative pronunciations are not used in adaptation, they are only used in decoding. To use alternative pronunciations in adaptation you need to force align the transcript first with sphinx3_align. Or you need to manually point which alternative pronunciation is actually used in transcription file.
      
      Dictionary variants are not very effective overall, you need Indian English model or US English speech.
      
      If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Ananthapadmanabhan - 2016-05-27

Thanks for the forced align tip.

I was succesful in adapting the model, although I was expecting accuracy around 40-50%. Then I saw this

"Dictionary variants are not very effective overall, you need Indian English model or US English speech."

Essentially, due to absence of Indian English model, I would like to make one using call center data. To get funding for the same, I am trying to build a small model of 100 words which can identify all these words in Indian accent. As shared before, I have made the dictionary and lm file for the same.

Now my questions are:-

Do you recommend training or adapting the model? (In tutorial, adapting was suggested which seems against your suggestion)Although I would want to train it for developing the final model for 4-5k words, my question is what works best for the 100 words model.

If either of the options(adapting/training), how much data should I gather to have 70-80% accuracy?

If you have answered these questions in any other thread, can you please give me the link because I wasn't able to find it.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Nickolay V. Shmyrev - 2016-05-27
  
  To successfully recognize call center data you need to train a new model. Existing models will not be helpful for you, they simply will not work. You need > 100 hours of data at least, ideally > 300 hours of transcribed speech data.
  
  You need to train the model, adaptation is not very effective. This does not sound easy, but it is a state of the things.
  
  If you still want to demo an adaptation I suggest you to build the model for your personal voice exclusively and higher quality audio. Do not use other random people recordings. You need to record about 20-30 minutes of speech yourself.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Nickolay V. Shmyrev - 2016-05-27
    
    And, beside acoustic model adaptation, you need to build a specialized call center language model. It is also critical for the accuracy. For that you again need transcribed callcenter logs.
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Acoustic model adaptation

Speech Recognition Toolkit

Forums

Help

Acoustic model adaptation document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

Acoustic model adaptation