Hi,
I am trying to recognize some 8k call center data using 8k acoustic models in pocketsphinx. So far the accuracy is about 35%. I have to say that I have not trained the language model still and using the default en-us.lm.bin that comes with pocketsphinx. The acoustic model for 8k was downloaded from the models directory (cmusphinx-en-us-8khz-5.1.tar.gz). I have some follow up questions:
MAP adaptation should help. My audio files are in the range of a few minutes on average. Should I segment the audio into smaller chunks or they be reasonably longer like 30 seconds or so? For such a scenario,how much improvement in accuracy can be expected?
Does it make sense to train a new AM or adapt the existing 8k model given that I have only around 20 hours of data? The keyword spotting accuracy is also low. This makes me believe that the acoustics is not matched to the model.
Any feedback is appreciated.
My command line for pocketsphinx_continuous is as below
good i'll gate yes
they sure it does look good grub doc doctor are you do it doing right mr slot badger wednesday garlic
go go captain i'd help today
out
i had a charge that it did not belong to meet them that we are palate
but failing that
when it varies
i had disputed right
MAP adaptation should help. My audio files are in the range of a few minutes on average. Should I segment the audio into smaller chunks or they be reasonably longer like 30 seconds or so? For such a scenario,how much improvement in accuracy can be expected?
MAP should not really help. Adaptation makes sense when you have something to adapt to. Like particular speaker, or environment. If you just want to create speaker-independent transcription system you'd better train from scratch from a big database. You need to have a big database for many speakers anyway.
Does it make sense to train a new AM or adapt the existing 8k model given that I have only around 20 hours of data? The keyword spotting accuracy is also low. This makes me believe that the acoustics is not matched to the model.
20 hours of data is not enough. If you want to increase accuracy you might better try more accurate models. For example Fisher DNN model with Kaldi toolkit, which is going to be much more accurate than our en-us.
If you want to train your own model you can take broadcast data and try to emulate the channel you have by using the same compression.
It is also much more productive to train domain-specific language model. It should help a lot.
You can also try to get better sound quality, it is critical for accuracy.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi,
I am trying to recognize some 8k call center data using 8k acoustic models in pocketsphinx. So far the accuracy is about 35%. I have to say that I have not trained the language model still and using the default en-us.lm.bin that comes with pocketsphinx. The acoustic model for 8k was downloaded from the models directory (cmusphinx-en-us-8khz-5.1.tar.gz). I have some follow up questions:
MAP adaptation should help. My audio files are in the range of a few minutes on average. Should I segment the audio into smaller chunks or they be reasonably longer like 30 seconds or so? For such a scenario,how much improvement in accuracy can be expected?
Does it make sense to train a new AM or adapt the existing 8k model given that I have only around 20 hours of data? The keyword spotting accuracy is also low. This makes me believe that the acoustics is not matched to the model.
Any feedback is appreciated.
My command line for pocketsphinx_continuous is as below
pocketsphinx_continuous.exe -hmm ../Sphinx/cmusphinx-en-us-8khz-5.1/en-us-8khz/ -dict cmudict-en-us.dict -samprate 8000 -featparams ../Sphinx/cmusphinx-en-us-8khz-5.1/en-us-8khz/feat.params -infile pp.wav -dither yes -verbose yes -lm ../sphinx-cygwin/pocketsphinx/model/en-us/en-us.lm.bin -logfn qq.txt
Result is as below for (pp.wav):
good i'll gate yes
they sure it does look good grub doc doctor are you do it doing right mr slot badger wednesday garlic
go go captain i'd help today
out
i had a charge that it did not belong to meet them that we are palate
but failing that
when it varies
i had disputed right
I have uploaded the audio at:
https://drive.google.com/folderview?id=0B10fku0xDUT5flZneWhmNklINGlDWGJ6OEJQZ1pJS1kzMHd1dlJULXhZbEJSbC1IbW5BRk0&usp=sharing
MAP should not really help. Adaptation makes sense when you have something to adapt to. Like particular speaker, or environment. If you just want to create speaker-independent transcription system you'd better train from scratch from a big database. You need to have a big database for many speakers anyway.
20 hours of data is not enough. If you want to increase accuracy you might better try more accurate models. For example Fisher DNN model with Kaldi toolkit, which is going to be much more accurate than our en-us.
If you want to train your own model you can take broadcast data and try to emulate the channel you have by using the same compression.
It is also much more productive to train domain-specific language model. It should help a lot.
You can also try to get better sound quality, it is critical for accuracy.