Is there a recommended energy threshold required for sphinx speech recognition to work effectively (specifically for pocketsphinx_continuous, sphinx3_align and sphinx_pitch)?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Thanks for the response. I mean is there a recommended threshold, e.g., SNR value, signal amplitude value, etc., for an input speech file, we know that pocketsphinx or sphinx3_align will struggle to pick up the words. We have noticed that pocketsphinx is not "recognizing" words if their energy values in the speech are "low". Any recommendation in this regard would be very helpful.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Amplitude is effectively normalized during CMN. So it shouldn't be a
problem, as long as SNR is good. Assuming your training data is clean
speech, higher the SNR better it is.
There is nothing recommended (AFAIK), apart from avoiding clipping. As you
can see even in WSJ, which is a standard database, they have done recording
at as low as 0.05 amplitude (and there is no noise there).
Is there a recommended energy threshold required for sphinx speech recognition to work effectively (specifically for pocketsphinx_continuous, sphinx3_align and sphinx_pitch)?
I'm not sure what energy threshold are you asking about, there is no such thing in pocketsphinx.
Thanks for the response. I mean is there a recommended threshold, e.g., SNR value, signal amplitude value, etc., for an input speech file, we know that pocketsphinx or sphinx3_align will struggle to pick up the words. We have noticed that pocketsphinx is not "recognizing" words if their energy values in the speech are "low". Any recommendation in this regard would be very helpful.
@Serotonergic
Amplitude is effectively normalized during CMN. So it shouldn't be a
problem, as long as SNR is good. Assuming your training data is clean
speech, higher the SNR better it is.
(relevant threads,
http://sourceforge.net/p/cmusphinx/mailman/message/31705208/
http://sourceforge.net/p/cmusphinx/discussion/speech-recognition/thread/d3d10c7e/
Last edit: Pranav Jawale 2014-07-04
Thanks Pranav. From the second post I would be interested to know what the "high" and "low" values are (recommended). Yes, my data is clean.
There is nothing recommended (AFAIK), apart from avoiding clipping. As you
can see even in WSJ, which is a standard database, they have done recording
at as low as 0.05 amplitude (and there is no noise there).