I'm using Sphinx for a mid-sized vocabulary (around 2000 word) dictation application. It's 8kHz speech over telephone and I've adapted the default Communicator model to my data using around 6000 utterances (each of length ~5 secs).
I'm getting reasonably good results (around 40 WER) at the moment on an independent test set of 500 utterences. However, I noted that in almost every test sample, the decoder incorreclty decodes the very first word. Most of recordings (both train and test) do not have a slience at all before the first word is spoken. Could this be the reason for this problem or could it be something else?
Any suggestions to fix this problem and/or further increase the accuracy of my system would be hightly appreciated!
Thanks,
Nishan
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hello,
I'm using Sphinx for a mid-sized vocabulary (around 2000 word) dictation application. It's 8kHz speech over telephone and I've adapted the default Communicator model to my data using around 6000 utterances (each of length ~5 secs).
I'm getting reasonably good results (around 40 WER) at the moment on an independent test set of 500 utterences. However, I noted that in almost every test sample, the decoder incorreclty decodes the very first word. Most of recordings (both train and test) do not have a slience at all before the first word is spoken. Could this be the reason for this problem or could it be something else?
Any suggestions to fix this problem and/or further increase the accuracy of my system would be hightly appreciated!
Thanks,
Nishan
It could be this or could be something else. Overall, you need a region of silence of about 0.2 second.
It is hard to help you without the data in hand, there are too many issues possible.