Bad accuracy on call center recordings

Speech Recognition Toolkit

Brought to you by: air, arthchan2003, awb, bhiksha, and 5 others

This project can now be found here.

Bad accuracy on call center recordings

Forum: Help

Created: 2019-05-08

Updated: 2019-05-29

Umair - 2019-05-08

Dear Team,
I have built an Urdu based recognition model which has very good accuracy when decoder is provided the trained recordings or microphone input. My model had 50% SER and 21% WER.

The real recordings of call center are 16bit, 8Khz, mono channel while the model is 16bit, 16Khz, mono channel. I thought due to the mismatch of sampling frequency it is not able to decode recordings so i built a new model with 8Khz but the accuracy is still very poor.

I have 44 utterances and 15 speakers.

I am stuck in this issue since long. Please help.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Nickolay V. Shmyrev - 2019-05-09
  
  Callcenter model requires a lot of data, I don't think you have it.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Umair - 2019-05-09

Your early response will be highly appreciated.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Umair - 2019-05-09

Dear,
Please have a look at my trained model. I understand that i have less data. Is this could be the only reason for not being able to decode speaker independent recordings. What size do you suggest in this case ?

When i do recording myself using words from the trained vocabulary, it somehow decodes it correctly while the same words if used in the call center calls, it fails to recognize. I don't know why.

Really appreciate your help.

Thanks.

asd.wav

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Nickolay V. Shmyrev - 2019-05-09
  
  You need 1000 hours of training data recordings of callcenter. If you don't have that, your model will not work accurately.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Umair - 2019-05-09

Dear I need to only recognize the specific vocabulary of 44 sentences. Does it also require 1000 hours to work ?

Model attached.

copsnew_8bt.zip

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Nickolay V. Shmyrev - 2019-05-09
  
  Yes
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Umair - 2019-05-29

Dear Nickolay,

I am in process of adding more data for training, however i just found the recordings i am using for training are 16bit, 16Khz, 16 bit Signed Integer PCM encoded. While the recordings used for testing are 16 bit, 8Khz GSM encoded. See attached. I have converted them to PCM inorder to sphinx decode them but accuracy is very very bad. If i provide recordings of the same format (16bit, 16Khz, 16 bit Signed Integer PCM) results are still good.

May i know what are the issues with GSM encoded and how can i tackle this issue ?

Thank you.

Regards.

diff.PNG

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Nickolay V. Shmyrev - 2019-05-29
  
  You have to convert both training and test data to 8khz 16bit PCM before training and testing. And you need to train accordingly.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Log in to post a comment.