If I increase the number of MFCC the error rate increases too. For example, if
I use only 3 MFCC, the error rate for sentences is 2%
If I use 14 MFCC then the error rate increases to 20%!!
Can anybody help me to fix this problem? thanks in advance!
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
If I increase the number of MFCC the error rate increases too. For example,
if I use only 3 MFCC, the error rate for sentences is 2%. If I use 14 MFCC
then the error rate increases to 20%!!
I don't think it's a problem. It's expected behavior
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
1 - If less coefs is better, why the most of research uses around 13 cofs?
It's not better to use 2 coefficients. You've got better results only because
you didn't run sufficient tests, your test setup is not generic enough or too
biased.
The choice of the number of MFCCs to include in an ASR system is largely
empirical. Historically people tried increasing the number of coefficients
until a law of diminishing returns kicked in. In practice, the optimal number
of coefficients depends on the quantity of training data, the details of the
training algorithm (in particular how well the PDFs can be modelled as the
dimensionality of the feature space increases), the number of Gaussian
mixtures in the HMMs, the speaker and background noise characteristics, and
sometimes the available computing resources.
To understand why any specific number of cepstral coefficients is used, you
could do worse than look at very early (pre-HMM) papers. When using DTW using
Euclidean or even Mahalanobis distances, it quickly became apparent that the
very high cepstral coefficients were not helpful for recognition, and to a
lesser extent, neither were the very low ones. The most common solution was to
"lifter" the MFCCs - i.e. apply a weighting function to them to emphasise the
mid-range coefficients. These liftering functions were "optimised" by a number
of researchers, but they almost always ended up being close to zero by the
time you got to the 12th coefficient.
2 - Could you explain why the time of recognition with 3 coefs is so high?
While with 14 coefs it takes around 4min, whit 3 it takes around 20min. Once
again, thank you very much!
The decoding with 13 coefficients is faster because 2-coefficient model
doesn't discriminate sounds well enough. So the recognizer has to explore all
possible decoding results and pruning of bad results doesn't work. With 13
coefficients all wrong paths are quickly pruned and only valid path survive.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi!
I am having the following problem:
If I increase the number of MFCC the error rate increases too. For example, if
I use only 3 MFCC, the error rate for sentences is 2%
If I use 14 MFCC then the error rate increases to 20%!!
Can anybody help me to fix this problem? thanks in advance!
I don't think it's a problem. It's expected behavior
Why is it an expected behavior? As far as I know, the more the number of coef.
better the results will be, right?
Why it is not? Given the details you provided I expect such result
No, it depends on amount of data for example and on many more factors you
didn't take into account.
But I used the same amount of data and the same configurations.
The only configuration changed was the number of coefficients.
s3decode.pl : -ceplen => 3
sphinx_train.cfg : $CFG_VECTOR_LENGTH = 3
make_feats.pl : -ncep 3
So, please, tell me which other factors could I take into account?
It's not relevant here. If you use more coefficients you need to use more data
to train and more data to test. The test data needs to be independent.
Ok, thanks for your explanation. But I have tows questions:
1 - If less coefs is better, why the most of research uses around 13 cofs?
2 - Could you explain why the time of recognition with 3 coefs is so high?
While with 14 coefs it takes around 4min, whit 3 it takes around 20min.
Once again, thank you very much!
It's not better to use 2 coefficients. You've got better results only because
you didn't run sufficient tests, your test setup is not generic enough or too
biased.
The choice of the number of MFCCs to include in an ASR system is largely
empirical. Historically people tried increasing the number of coefficients
until a law of diminishing returns kicked in. In practice, the optimal number
of coefficients depends on the quantity of training data, the details of the
training algorithm (in particular how well the PDFs can be modelled as the
dimensionality of the feature space increases), the number of Gaussian
mixtures in the HMMs, the speaker and background noise characteristics, and
sometimes the available computing resources.
To understand why any specific number of cepstral coefficients is used, you
could do worse than look at very early (pre-HMM) papers. When using DTW using
Euclidean or even Mahalanobis distances, it quickly became apparent that the
very high cepstral coefficients were not helpful for recognition, and to a
lesser extent, neither were the very low ones. The most common solution was to
"lifter" the MFCCs - i.e. apply a weighting function to them to emphasise the
mid-range coefficients. These liftering functions were "optimised" by a number
of researchers, but they almost always ended up being close to zero by the
time you got to the 12th coefficient.
The decoding with 13 coefficients is faster because 2-coefficient model
doesn't discriminate sounds well enough. So the recognizer has to explore all
possible decoding results and pruning of bad results doesn't work. With 13
coefficients all wrong paths are quickly pruned and only valid path survive.