We are tuning our CMU Sphinx model for a new langauge. Our model is getting ballpark of a little over 20% word error rate. I'm wondering if it is common for Sphinx models to be especially put off by clicks and pops in the audio file and if so if there is any experience as to what can be best done to fix this.
I just ran our model over about 200k short sentences that we have collected, using pocketsphinx in batch mode. About 7% of these cases ended up with a totally blank transcription.
Anecdotally it looks like the ones that are blank transcirptions also tend to be ones where there is a click or pop in the audio recording - even though the the rest of the audio (to the human ear) sounds just fine (and was in many cases 'approved' by a manual approval process).
Is this a problem that anyone has encountered before and if so what is a good way to deal with it? Can we tweak the audio parameters or do we want to pre-process the audio before giving it to sphinx?
--
Also (possibly related) Nickolay (forgive me if I am wrong but I think) you once suggested that you think that for good quality audio transcriptions of human speech the default tmit parameters are not great defaults? I can't find the page just now but would you say that holds true? Keen to tweak parameters a little if it might give us a percent or two improvement.
Thanks again for all the help!
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
We are tuning our CMU Sphinx model for a new langauge. Our model is getting ballpark of a little over 20% word error rate. I'm wondering if it is common for Sphinx models to be especially put off by clicks and pops in the audio file and if so if there is any experience as to what can be best done to fix this.
I just ran our model over about 200k short sentences that we have collected, using pocketsphinx in batch mode. About 7% of these cases ended up with a totally blank transcription.
Anecdotally it looks like the ones that are blank transcirptions also tend to be ones where there is a click or pop in the audio recording - even though the the rest of the audio (to the human ear) sounds just fine (and was in many cases 'approved' by a manual approval process).
Is this a problem that anyone has encountered before and if so what is a good way to deal with it? Can we tweak the audio parameters or do we want to pre-process the audio before giving it to sphinx?
--
Also (possibly related) Nickolay (forgive me if I am wrong but I think) you once suggested that you think that for good quality audio transcriptions of human speech the default tmit parameters are not great defaults? I can't find the page just now but would you say that holds true? Keen to tweak parameters a little if it might give us a percent or two improvement.
Thanks again for all the help!
If you want accuracy you'd better use kaldi toolkit, not tweak pocketsphinx.