I'm trying to write a 'kind of' speech recognizer, and it works great if I'm
right up on the microphone during recognition phase, but if I move away the
recognition gets gradually worse.
I've tried CMN, but because the utterances are short (sometimes a single
vowel) is seems to absolutely kill the recognition, and when I view a visual
representation of the data it destroys the relationship between the MFCC
values in a single frame.
I have been reading about RASTA processing, but I can't find any easy-read
documentation, it's all pretty heavy. Does RASTA work on the spectral
information from the FFT of the original sample, or is it done on the MFCCs?
Could this be a better approach? Does anyone have any source code or laymen's
explanation of how to implement this?
Thanks
Ray
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Just a model adaptation could work here. There are also channel compensation
schemes for long-distance recognitin.
I've tried CMN, but because the utterances are short (sometimes a single
vowel) is seems to absolutely kill the recognition, and when I view a visual
representation of the data
You can try to share CMN values across utterances
or is it done on the MFCCs?
There are different types, basically the core idea is to apply a filter which
can be done in various domains.
Does anyone have any source code or laymen's explanation of how to implement
this?
Hi All,
I'm trying to write a 'kind of' speech recognizer, and it works great if I'm
right up on the microphone during recognition phase, but if I move away the
recognition gets gradually worse.
I've tried CMN, but because the utterances are short (sometimes a single
vowel) is seems to absolutely kill the recognition, and when I view a visual
representation of the data it destroys the relationship between the MFCC
values in a single frame.
I have been reading about RASTA processing, but I can't find any easy-read
documentation, it's all pretty heavy. Does RASTA work on the spectral
information from the FFT of the original sample, or is it done on the MFCCs?
Could this be a better approach? Does anyone have any source code or laymen's
explanation of how to implement this?
Thanks
Ray
Just a model adaptation could work here. There are also channel compensation
schemes for long-distance recognitin.
You can try to share CMN values across utterances
There are different types, basically the core idea is to apply a filter which
can be done in various domains.
RASTA sources can be found here:
http://www.icsi.berkeley.edu/~dpwe/projects/sprach/sprachcore.html
Thank you for the quick reply. I'll take a look at the RASTA source code...
code is always easier for me to understand than formulas. ;-)