I am a real nooby in the case of ASR. To provide you as much data as possible, here the long story: I got a topic for my Bachelor Thesis. I have to provide a akustik method to detect "turmoil - like" situation. Part of the topic is to detect a few keywords like police, help or fire. The location of the keyword spotting is inside and near local public transportation (subway). Due to noisy enviroment, i am not sure if it is possible to detect keywords without clear gaps between the words. Even if I use a method to reduce the backround noise. And it has to run on a Rasbperry PI.
I played around a bit with pocketsphinx, i createt my Dictionary, a simple JSGF and used the acoustic model from voxforge. The WER was ok. My Biggest problem was: different words or noise were often detected as Fire, Police or Help, even with a filler dictionary.
With all the stuff i have read so far, i am not sure if this is possible for me. I mean: in an adequate amount of time or with a satisfactory result.
Now to my Questions:
How do I implement the noise reduction?
Would a specific acoustic model reduce the amount of "noise = fire/police/help"?
Would a specific acoustic model reduce my WER?
Does the keyword spotting work with noise gaps between words?
Is there a golden road to my goal?
kind regards
Last edit: ArcadeBit 2014-05-26
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Due to noisy enviroment, i am not sure if it is possible to detect keywords without clear gaps between the words.
It is possible
Part of the topic is to detect a few keywords like police, help or fire.
For reliable detection keyword must have at least 3 syllables. "fire" is too short for keyword.
I played around a bit with pocketsphinx, i createt my Dictionary, a simple JSGF and used the acoustic model from voxforge.
For keyword spotting there is a specific keyword spotting search mode specified with "-kws" option. It also has option to tune the threshold (-kws_threshold) for detection/false alarm rate
Voxforge model is too inaccurate. Our most accurate model is en-us generic acoustic model.
How do I implement the noise reduction?
Noise reduction is already implemented in development version in subversion trunk
Does the keyword spotting work with noise gaps between words?
No, it should work in continuous stream too.
Is there a golden road to my goal?
Create a test set and evaluate it to get best performance point.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
In another post, about noise robustness, you recommend adding the following to sphinx_train.cfg:
~~~~~~~~~~~~
$CFG_WAVFILE_SRATE = 16000.0;
$CFG_NUM_FILT = 25; # For wideband speech it's 25, for telephone 8khz reasonable value is 15
$CFG_LO_FILT = 130; # For telephone 8kHz speech value is 200
$CFG_HI_FILT = 6800; # For telephone 8kHz speech value is 3500
$CFG_TRANSFORM = "dct"; # Previously legacy transform is used, but dct is more accurate
$CFG_LIFTER = "22"; # Cepstrum lifter is smoothing to improve recognition
$CFG_VECTOR_LENGTH = 13; # 13 is usually enough
~~~~~~~~~~~~~~
Is this a general recommendation for noisy speech?
The default feature set is 1s_c_d_dd. Would you recommend a different feature set
for noisy input? where can I read about the naming of feature sets?
Thanks,
Yuval
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi there,
I am a real nooby in the case of ASR. To provide you as much data as possible, here the long story: I got a topic for my Bachelor Thesis. I have to provide a akustik method to detect "turmoil - like" situation. Part of the topic is to detect a few keywords like police, help or fire. The location of the keyword spotting is inside and near local public transportation (subway). Due to noisy enviroment, i am not sure if it is possible to detect keywords without clear gaps between the words. Even if I use a method to reduce the backround noise. And it has to run on a Rasbperry PI.
I played around a bit with pocketsphinx, i createt my Dictionary, a simple JSGF and used the acoustic model from voxforge. The WER was ok. My Biggest problem was: different words or noise were often detected as Fire, Police or Help, even with a filler dictionary.
With all the stuff i have read so far, i am not sure if this is possible for me. I mean: in an adequate amount of time or with a satisfactory result.
Now to my Questions:
How do I implement the noise reduction?
Would a specific acoustic model reduce the amount of "noise = fire/police/help"?
Would a specific acoustic model reduce my WER?
Does the keyword spotting work with noise gaps between words?
Is there a golden road to my goal?
kind regards
Last edit: ArcadeBit 2014-05-26
It is possible
For reliable detection keyword must have at least 3 syllables. "fire" is too short for keyword.
For keyword spotting there is a specific keyword spotting search mode specified with "-kws" option. It also has option to tune the threshold (-kws_threshold) for detection/false alarm rate
Voxforge model is too inaccurate. Our most accurate model is en-us generic acoustic model.
Noise reduction is already implemented in development version in subversion trunk
No, it should work in continuous stream too.
Create a test set and evaluate it to get best performance point.
Hello,
About three years after the original post...
In another post, about noise robustness, you recommend adding the following to sphinx_train.cfg:
~~~~~~~~~~~~
$CFG_WAVFILE_SRATE = 16000.0;
$CFG_NUM_FILT = 25; # For wideband speech it's 25, for telephone 8khz reasonable value is 15
$CFG_LO_FILT = 130; # For telephone 8kHz speech value is 200
$CFG_HI_FILT = 6800; # For telephone 8kHz speech value is 3500
$CFG_TRANSFORM = "dct"; # Previously legacy transform is used, but dct is more accurate
$CFG_LIFTER = "22"; # Cepstrum lifter is smoothing to improve recognition
$CFG_VECTOR_LENGTH = 13; # 13 is usually enough
~~~~~~~~~~~~~~
Is this a general recommendation for noisy speech?
The default feature set is 1s_c_d_dd. Would you recommend a different feature set
for noisy input? where can I read about the naming of feature sets?
It is simply a default
No.
In the source code.