Does anyone know if there is an alternative to CMS (or CMN) for short
utterances. I have been doing some testing, and it seems to work great for
sections of 2+ seconds in length, but for shorter 0.5-1 second utterances it
really distorts the signal a lot.
Thanks
Ray
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Actually, I'm starting to wonder now if I'm doing my CMN correctly. I have the
mean calculated individually for each co-efficient over time, is that correct?
So, my (pseudo) code reads...
loop through each mfcc (1-12)
mean = 0
loop through each frame
mean = mean + {mfcc value for frame}
end loop
mean = mean / {frame count}
loop through each frame
{mfcc value for frame} = {mfccvalue for frame} - mean
end loop
end loop
should it really be...
mean=0
loop through each mfcc (1-12)
loop through each frame
mean = mean + {mfcc value for frame}
end loop
end loop
loop through each mfcc (1-12)
loop through each frame
{mfcc value for frame} = {mfcc value for frame} - mean
end loop
end loop
Otherwise it doesn't seem to make sense that i'm losing the relative strength
of each mfcc value to another within the same frame.
Thanks
Ray
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
You'll be interested to read the following papers:
Reducing The Effects Of Linear Channel Distortion On Continuous Speech
Recognition (1996)
by Rebecca Anne Bates , Dr. Mari Ostendorf , Associate Professor , Dr. J.
Robin Rohlicek , Vice President , Dr. William , C. Karl , Assistant
Professor
Thank you very much for your reply! So here's my problem. If I am working on a
simple utterance of the word 'ah!', it's very heavy in MFCC (a value of around
28 in all frames, but quite low (between -10 to +2) in all other co-efficients
in all other frames. If I use this technique on each MFCC individually, I lose
the information about the relative strength of each MFCC to each other.
Is this a common problem, or am I going crazy? Is there another technique for
using on this kind of short, single vowel phrase?
Thank you!
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Aaahh... now I see the quote in the second paper. So by that design, couldn't
I keep a running average of the cepstral means for the last 'n' utterances
until 'n' reaches a point of diminishing returns?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
So by that design, couldn't I keep a running average of the cepstral means
for the last 'n' utterances until 'n' reaches a point of diminishing returns?
This is just one of the methods, it's better than default though.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi All,
Does anyone know if there is an alternative to CMS (or CMN) for short
utterances. I have been doing some testing, and it seems to work great for
sections of 2+ seconds in length, but for shorter 0.5-1 second utterances it
really distorts the signal a lot.
Thanks
Ray
Actually, I'm starting to wonder now if I'm doing my CMN correctly. I have the
mean calculated individually for each co-efficient over time, is that correct?
So, my (pseudo) code reads...
loop through each mfcc (1-12)
mean = 0
loop through each frame
mean = mean + {mfcc value for frame}
end loop
mean = mean / {frame count}
loop through each frame
{mfcc value for frame} = {mfccvalue for frame} - mean
end loop
end loop
should it really be...
mean=0
loop through each mfcc (1-12)
loop through each frame
mean = mean + {mfcc value for frame}
end loop
end loop
loop through each mfcc (1-12)
loop through each frame
{mfcc value for frame} = {mfcc value for frame} - mean
end loop
end loop
Otherwise it doesn't seem to make sense that i'm losing the relative strength
of each mfcc value to another within the same frame.
Thanks
Ray
This is correct
No
You'll be interested to read the following papers:
Reducing The Effects Of Linear Channel Distortion On Continuous Speech
Recognition (1996)
by Rebecca Anne Bates , Dr. Mari Ostendorf , Associate Professor , Dr. J.
Robin Rohlicek , Vice President , Dr. William , C. Karl , Assistant
Professor
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.22.9450
and
The Use Of Cepstral Means In Conversational Speech Recognition (1997)
by Martin Westphal
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.31.2342
Thank you very much for your reply! So here's my problem. If I am working on a
simple utterance of the word 'ah!', it's very heavy in MFCC (a value of around
28 in all frames, but quite low (between -10 to +2) in all other co-efficients
in all other frames. If I use this technique on each MFCC individually, I lose
the information about the relative strength of each MFCC to each other.
Is this a common problem, or am I going crazy? Is there another technique for
using on this kind of short, single vowel phrase?
Thank you!
This is not a problem. The problem is to estimate cepstrum mean for a
speaker/channel using just a short sample. Please read the papers first.
Aaahh... now I see the quote in the second paper. So by that design, couldn't
I keep a running average of the cepstral means for the last 'n' utterances
until 'n' reaches a point of diminishing returns?
This is just one of the methods, it's better than default though.