Dear Team,
I have built an Urdu based recognition model which has very good accuracy when decoder is provided the trained recordings or microphone input. My model had 50% SER and 21% WER.
The real recordings of call center are 16bit, 8Khz, mono channel while the model is 16bit, 16Khz, mono channel. I thought due to the mismatch of sampling frequency it is not able to decode recordings so i built a new model with 8Khz but the accuracy is still very poor.
I have 44 utterances and 15 speakers.
I am stuck in this issue since long. Please help.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Dear,
Please have a look at my trained model. I understand that i have less data. Is this could be the only reason for not being able to decode speaker independent recordings. What size do you suggest in this case ?
When i do recording myself using words from the trained vocabulary, it somehow decodes it correctly while the same words if used in the call center calls, it fails to recognize. I don't know why.
I am in process of adding more data for training, however i just found the recordings i am using for training are 16bit, 16Khz, 16 bit Signed Integer PCM encoded. While the recordings used for testing are 16 bit, 8Khz GSM encoded. See attached. I have converted them to PCM inorder to sphinx decode them but accuracy is very very bad. If i provide recordings of the same format (16bit, 16Khz, 16 bit Signed Integer PCM) results are still good.
May i know what are the issues with GSM encoded and how can i tackle this issue ?
Dear Team,
I have built an Urdu based recognition model which has very good accuracy when decoder is provided the trained recordings or microphone input. My model had 50% SER and 21% WER.
The real recordings of call center are 16bit, 8Khz, mono channel while the model is 16bit, 16Khz, mono channel. I thought due to the mismatch of sampling frequency it is not able to decode recordings so i built a new model with 8Khz but the accuracy is still very poor.
I have 44 utterances and 15 speakers.
I am stuck in this issue since long. Please help.
Callcenter model requires a lot of data, I don't think you have it.
Your early response will be highly appreciated.
Dear,
Please have a look at my trained model. I understand that i have less data. Is this could be the only reason for not being able to decode speaker independent recordings. What size do you suggest in this case ?
When i do recording myself using words from the trained vocabulary, it somehow decodes it correctly while the same words if used in the call center calls, it fails to recognize. I don't know why.
Really appreciate your help.
Thanks.
You need 1000 hours of training data recordings of callcenter. If you don't have that, your model will not work accurately.
Dear I need to only recognize the specific vocabulary of 44 sentences. Does it also require 1000 hours to work ?
Model attached.
Yes
Dear Nickolay,
I am in process of adding more data for training, however i just found the recordings i am using for training are 16bit, 16Khz, 16 bit Signed Integer PCM encoded. While the recordings used for testing are 16 bit, 8Khz GSM encoded. See attached. I have converted them to PCM inorder to sphinx decode them but accuracy is very very bad. If i provide recordings of the same format (16bit, 16Khz, 16 bit Signed Integer PCM) results are still good.
May i know what are the issues with GSM encoded and how can i tackle this issue ?
Thank you.
Regards.
You have to convert both training and test data to 8khz 16bit PCM before training and testing. And you need to train accordingly.