Menu

Need help with Call Center Calls Transcription

Help
Najla
2016-04-14
2016-04-14
  • Najla

    Najla - 2016-04-14

    Hi,

    I'm trying to adapt PocketSphinx (pocketsphinx-5prealpha.tar.gz) to transcribe Callcenter calls (one agent for now) as mentioned here. It needs to recognize different English accents (different US accents, Philliphines etc). The Audio Calls were originally in 8khz Mono - I was unable to get any accuracy using the 8khz model, so I upsampled the Audio calls to 16khz and used the 16khz Model. Unfortunately, I'm still getting very low accuracy with the test data.

    Acoustic Model - cmusphinx-en-us-ptm-5.2.tar
    LM - cmusphinx-5.0-en-us.lm
    Dictionary - cmudict-en-us.dict
    Model + Wave files: Zipped Separately here

    What else I tried:
    1. I removed some irrelevant words from the dictionary (the words that it was incorrectly transcribing). Didnt help much
    2. Added some more phonemes for the words incorrectly recognized as I thought would match the agent's accent, didnt help much.
    3. Tried a different agent, but this agent would use slang, and many times eat up his words. He was barely understandable even by me.
    4. Used the lmtool to create a totally new dictionary and language model. Maybe I did something wrong, but I still didnt get a good output.
    5. Some articles suggested using MLLR instead of MAP (I dont know the difference between them), so I adapted using MLLR only. Wasnt helpful

    From all the above, you can probably see that I'm just throwing stones randomly in the dark. Can you please tell me what I'm doing wrong, and how to increase the accuracy?

    PS: I dont know anything about Speech to Text Conversion, and I just need to integrate pocketsphinx into our web application so we can perform some text analytics. So any advice would be very welcome, including recommending any links/books on the basic theory :)

    Thanks alot!

     

    Last edit: Najla 2016-04-14
    • Nickolay V. Shmyrev

      You need to use 8khz model to recognize 8khz audio, explained here:

      http://cmusphinx.sourceforge.net/wiki/faq#qwhat_is_sample_rate_and_how_does_it_affect_accuracy

      It is also better to use continuous model, not ptm model. Continuous models are slower but more accurate.

      MLLR adaptation for continuous model with subsequent MAP adaptation should give you enough accuracy given your adaptaiton transcription is correct.

      You need to train a custom language model from transcripts as described in http://cmusphinx.sourceforge.net/wiki/tutoriallmadvanced

      If you go through everything else step by step, it should work fine too.

      Design a callcenter transcription system can be quite complicated, you might want to use agent-adapted models or use quite accurate generic model like DNN model with less adaptation capabilities. It depends on amount of transcribed speech per agent. You might also need to run adaptation to some accent, our models work best for US English, worse for other accents.

       

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.