I'm trying to adapt PocketSphinx (pocketsphinx-5prealpha.tar.gz) to transcribe Callcenter calls (one agent for now) as mentioned here. It needs to recognize different English accents (different US accents, Philliphines etc). The Audio Calls were originally in 8khz Mono - I was unable to get any accuracy using the 8khz model, so I upsampled the Audio calls to 16khz and used the 16khz Model. Unfortunately, I'm still getting very low accuracy with the test data.
Acoustic Model - cmusphinx-en-us-ptm-5.2.tar
LM - cmusphinx-5.0-en-us.lm
Dictionary - cmudict-en-us.dict
Model + Wave files: Zipped Separately here
What else I tried:
1. I removed some irrelevant words from the dictionary (the words that it was incorrectly transcribing). Didnt help much
2. Added some more phonemes for the words incorrectly recognized as I thought would match the agent's accent, didnt help much.
3. Tried a different agent, but this agent would use slang, and many times eat up his words. He was barely understandable even by me.
4. Used the lmtool to create a totally new dictionary and language model. Maybe I did something wrong, but I still didnt get a good output.
5. Some articles suggested using MLLR instead of MAP (I dont know the difference between them), so I adapted using MLLR only. Wasnt helpful
From all the above, you can probably see that I'm just throwing stones randomly in the dark. Can you please tell me what I'm doing wrong, and how to increase the accuracy?
PS: I dont know anything about Speech to Text Conversion, and I just need to integrate pocketsphinx into our web application so we can perform some text analytics. So any advice would be very welcome, including recommending any links/books on the basic theory :)
Thanks alot!
Last edit: Najla 2016-04-14
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
If you go through everything else step by step, it should work fine too.
Design a callcenter transcription system can be quite complicated, you might want to use agent-adapted models or use quite accurate generic model like DNN model with less adaptation capabilities. It depends on amount of transcribed speech per agent. You might also need to run adaptation to some accent, our models work best for US English, worse for other accents.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi,
I'm trying to adapt PocketSphinx (pocketsphinx-5prealpha.tar.gz) to transcribe Callcenter calls (one agent for now) as mentioned here. It needs to recognize different English accents (different US accents, Philliphines etc). The Audio Calls were originally in 8khz Mono - I was unable to get any accuracy using the 8khz model, so I upsampled the Audio calls to 16khz and used the 16khz Model. Unfortunately, I'm still getting very low accuracy with the test data.
Acoustic Model - cmusphinx-en-us-ptm-5.2.tar
LM - cmusphinx-5.0-en-us.lm
Dictionary - cmudict-en-us.dict
Model + Wave files: Zipped Separately here
What else I tried:
1. I removed some irrelevant words from the dictionary (the words that it was incorrectly transcribing). Didnt help much
2. Added some more phonemes for the words incorrectly recognized as I thought would match the agent's accent, didnt help much.
3. Tried a different agent, but this agent would use slang, and many times eat up his words. He was barely understandable even by me.
4. Used the lmtool to create a totally new dictionary and language model. Maybe I did something wrong, but I still didnt get a good output.
5. Some articles suggested using MLLR instead of MAP (I dont know the difference between them), so I adapted using MLLR only. Wasnt helpful
From all the above, you can probably see that I'm just throwing stones randomly in the dark. Can you please tell me what I'm doing wrong, and how to increase the accuracy?
PS: I dont know anything about Speech to Text Conversion, and I just need to integrate pocketsphinx into our web application so we can perform some text analytics. So any advice would be very welcome, including recommending any links/books on the basic theory :)
Thanks alot!
Last edit: Najla 2016-04-14
You need to use 8khz model to recognize 8khz audio, explained here:
http://cmusphinx.sourceforge.net/wiki/faq#qwhat_is_sample_rate_and_how_does_it_affect_accuracy
It is also better to use continuous model, not ptm model. Continuous models are slower but more accurate.
MLLR adaptation for continuous model with subsequent MAP adaptation should give you enough accuracy given your adaptaiton transcription is correct.
You need to train a custom language model from transcripts as described in http://cmusphinx.sourceforge.net/wiki/tutoriallmadvanced
If you go through everything else step by step, it should work fine too.
Design a callcenter transcription system can be quite complicated, you might want to use agent-adapted models or use quite accurate generic model like DNN model with less adaptation capabilities. It depends on amount of transcribed speech per agent. You might also need to run adaptation to some accent, our models work best for US English, worse for other accents.