I'm using Sphinx 3.5 on Win32 and when I run the sphinx3-test.bat (something like that) with the control file containing the 2 lines :
pittsburgh.littlendian
pittsburgh.littlendian
The first result is correct : P I T T S B U R G H
but the second result is a little bit different !
I don't understand that !
is this a bug or is it normal ?
thanks
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Anonymous
-
2005-04-12
This is probably normal, and due to the state of the cepstral mean normalizer. There are two ways that cepstral mean normalization is done:
(1) the entire utterance is processed, the mean cepstrum is computed, and then the cepstral features are normalized by subtracting the mean.
(2) as each frame (or block of frames) is processed, it is normalized by subtracting the current estimated cepstral mean, which is updated periodically. The cepstral mean is usually preserved between utterances.
If the recognizer is operating in batch mode, either method can be used, but if the recognizer is operating on "live" utterances, the first is impractical, and only the second is employed.
sphinx3-test.bat uses the livepretend.exe decoder, so I suspect that it is operating in live-mode and using method (2). If this is true, then the state of the cepstral mean normalizer (CMN) is different at the beginning of processing the two identical utterances:
- when processing utterance #1, the CMN contains a zero mean.
- when processing utterance #2, the CMN contains the estimated mean from the first utterance.
Therefore the normalized features will be different in the second utterance compared to the first, and we may expect that the recognition results may be slightly different (the words may be the same, but the scores and word and phone boundaries will be different).
cheers,
jerry
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
So if I want to get the same result for the 2 same files, I have to set the CMN to zero ... ??
In fact, I made a little program :
record dump raw file (5s)
decode the raw file
loop
I saw that the first time my file is "well" decoded, I have a very good result, but the second time the performances are horrible !
I have the impression that I only can use the decoder once. It's a little bit disturbing because I want to do a voice commands system ...
Florent
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Anonymous
-
2005-04-12
Florent -- I am surprised to learn that you had such a dramatic degradation with decoding 5-sec utterances twice in succession. For shorter utterances (less than, say 2 sec), I think that this is a degenerate case that the current CMN algorithms are not prepared to handle. I'll be interested to see what the Sphinx3 folks at CMU have to say.
The motivation for CMN is to compensate for differences in the overall frequency response of the channel, for which the long-term mean of the cepstrum is only an approximation. The problems occur when the recognizer does not have a good estimate of the long-term cepstral mean, which occurs (In Sphinx3 and many others) in the very first utterance after initialization, and to some extent with short, isolated utterances.
Consider a different (and more degenerate) experiment, where the utterance is an isolated vowel "aa". During the first recognition, the estimated long term cepstrum is zero*, which is subtracted from the cepstrum of the signal, leaving it unchanged. If you then attempt to recognize the same utterance again, the estimated cepstral mean is about the same as the spectrum of the vowel, yielding zero after normalization! In this case, the estimated cepstral mean is a very poor estimate of the long term mean, and the resulting features won't fit the model at all.
*Actually, the long term cepstrum is initialized not to all-zeros, but (12.0, 0.0, 0.0, ...), so the C0 term is initialized to an approximation of the log energy.
With longer and more varied utterances, and in the usual case where the each utterance is different than the last one, this problem should be less severe, since after several seconds, the estimated cepstral mean will become closer and closer to the "true" long-term cepstrum, but those first few seconds of speech are being normalized by a poor estimate. This problem is generally ignored, since it vanishes after a few seconds of speech, but it should be possible to start the recognizer with a better or more typical initial cepstral mean vector. In addition, care should be taken with how the cepstral mean is updated during those first few seconds when there is only a little bit of speech in the system.
cheers,
jerry
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Yeah, I have noticed that the beginning of the speech file is not well decoded :
for my exemple of the first message (pittsburgh)
the second time the file is decoded I have something like that :
C AE O T T S B U R G
the end is totally correct !
Do you think that the result will be "correct" if I only start to speak in the microphone 2 or 3 s after the beginning of the recording ?
If not, what do you suggest ?
Thanks,
Florent
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Anonymous
-
2005-04-13
No, waiting a few seconds to speak will not help. The CMN needs several seconds of speech in order to form an effective estimate. (In my previous posting, I should have said "The motivation for CMN is to compensate for differences in the overall frequency response of the channel, for which the long-term mean of the cepstrum OF THE SPEECH SIGNAL is only an approximation."
The solution, IMHO, lies in making sure that the CMN in the recognizer has a useful mean vector to use at all times, including those first few seconds of input speech immediately after the recognizer is initialized. I do not use Sphinx 3, so I don't know the details of the code, but it could be modified to initialize the cepstral mean vector with different values than (12.0, 0.0, 0.0, ...), perhaps with values extracted from the CMN code after a "long enough" utterance (say > 10 seconds) had been processed.
Note that the long term cepstrum contains information not only about the particular channel, but also about the particular speaker's voice. Therefore you might wonder whether it would be just as bad to initialize the CMN algorithm using a mean cepstrum from Speaker X using channel Y? In my experience, no. It is better to use a cepstral mean vector from some arbitrary speaker/channel than to use all-zeros initialization.
cheers,
jerry
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hello,
(I'm french so don't judge me on my english...)
I'm using Sphinx 3.5 on Win32 and when I run the sphinx3-test.bat (something like that) with the control file containing the 2 lines :
pittsburgh.littlendian
pittsburgh.littlendian
The first result is correct : P I T T S B U R G H
but the second result is a little bit different !
I don't understand that !
is this a bug or is it normal ?
thanks
This is probably normal, and due to the state of the cepstral mean normalizer. There are two ways that cepstral mean normalization is done:
(1) the entire utterance is processed, the mean cepstrum is computed, and then the cepstral features are normalized by subtracting the mean.
(2) as each frame (or block of frames) is processed, it is normalized by subtracting the current estimated cepstral mean, which is updated periodically. The cepstral mean is usually preserved between utterances.
If the recognizer is operating in batch mode, either method can be used, but if the recognizer is operating on "live" utterances, the first is impractical, and only the second is employed.
sphinx3-test.bat uses the livepretend.exe decoder, so I suspect that it is operating in live-mode and using method (2). If this is true, then the state of the cepstral mean normalizer (CMN) is different at the beginning of processing the two identical utterances:
- when processing utterance #1, the CMN contains a zero mean.
- when processing utterance #2, the CMN contains the estimated mean from the first utterance.
Therefore the normalized features will be different in the second utterance compared to the first, and we may expect that the recognition results may be slightly different (the words may be the same, but the scores and word and phone boundaries will be different).
cheers,
jerry
First of all, thanks for your quick answer !
The .bat uses the livepretend.exe, yes .
So if I want to get the same result for the 2 same files, I have to set the CMN to zero ... ??
In fact, I made a little program :
record dump raw file (5s)
decode the raw file
loop
I saw that the first time my file is "well" decoded, I have a very good result, but the second time the performances are horrible !
I have the impression that I only can use the decoder once. It's a little bit disturbing because I want to do a voice commands system ...
Florent
Florent -- I am surprised to learn that you had such a dramatic degradation with decoding 5-sec utterances twice in succession. For shorter utterances (less than, say 2 sec), I think that this is a degenerate case that the current CMN algorithms are not prepared to handle. I'll be interested to see what the Sphinx3 folks at CMU have to say.
The motivation for CMN is to compensate for differences in the overall frequency response of the channel, for which the long-term mean of the cepstrum is only an approximation. The problems occur when the recognizer does not have a good estimate of the long-term cepstral mean, which occurs (In Sphinx3 and many others) in the very first utterance after initialization, and to some extent with short, isolated utterances.
Consider a different (and more degenerate) experiment, where the utterance is an isolated vowel "aa". During the first recognition, the estimated long term cepstrum is zero*, which is subtracted from the cepstrum of the signal, leaving it unchanged. If you then attempt to recognize the same utterance again, the estimated cepstral mean is about the same as the spectrum of the vowel, yielding zero after normalization! In this case, the estimated cepstral mean is a very poor estimate of the long term mean, and the resulting features won't fit the model at all.
*Actually, the long term cepstrum is initialized not to all-zeros, but (12.0, 0.0, 0.0, ...), so the C0 term is initialized to an approximation of the log energy.
With longer and more varied utterances, and in the usual case where the each utterance is different than the last one, this problem should be less severe, since after several seconds, the estimated cepstral mean will become closer and closer to the "true" long-term cepstrum, but those first few seconds of speech are being normalized by a poor estimate. This problem is generally ignored, since it vanishes after a few seconds of speech, but it should be possible to start the recognizer with a better or more typical initial cepstral mean vector. In addition, care should be taken with how the cepstral mean is updated during those first few seconds when there is only a little bit of speech in the system.
cheers,
jerry
Yeah, I have noticed that the beginning of the speech file is not well decoded :
for my exemple of the first message (pittsburgh)
the second time the file is decoded I have something like that :
C AE O T T S B U R G
the end is totally correct !
Do you think that the result will be "correct" if I only start to speak in the microphone 2 or 3 s after the beginning of the recording ?
If not, what do you suggest ?
Thanks,
Florent
No, waiting a few seconds to speak will not help. The CMN needs several seconds of speech in order to form an effective estimate. (In my previous posting, I should have said "The motivation for CMN is to compensate for differences in the overall frequency response of the channel, for which the long-term mean of the cepstrum OF THE SPEECH SIGNAL is only an approximation."
The solution, IMHO, lies in making sure that the CMN in the recognizer has a useful mean vector to use at all times, including those first few seconds of input speech immediately after the recognizer is initialized. I do not use Sphinx 3, so I don't know the details of the code, but it could be modified to initialize the cepstral mean vector with different values than (12.0, 0.0, 0.0, ...), perhaps with values extracted from the CMN code after a "long enough" utterance (say > 10 seconds) had been processed.
Note that the long term cepstrum contains information not only about the particular channel, but also about the particular speaker's voice. Therefore you might wonder whether it would be just as bad to initialize the CMN algorithm using a mean cepstrum from Speaker X using channel Y? In my experience, no. It is better to use a cepstral mean vector from some arbitrary speaker/channel than to use all-zeros initialization.
cheers,
jerry
OK !
Thank you very much for all your explanations, I've really appreciated that !!
I will try to fix my problem, but I'm just a beginner, and it will be not easy with your expert's indications !!!
At least I know somebody who knows how it works !!!
But what system do you use?
Florent