Im new at Sphinx, but im trying to develop a little application with it. I
would like it to recognize some commands on my language, so i am using a
database that i recorded to train the accoustic models with sphinxtrain. But i
do have some questions, can i bother u guys?
Ive read that even if the vocabulary is small, its not advisable to train the
models for entire words. Although in those cases we could train like the
digits example that comes with sphinx. Ive got less than 100 words, should i
do that, or train with phonemes?
Actually, ive already trained with phonemes and the trainning was completed,
but i got some warnings like:
ERROR: "c:\tutorial\sphinxtrain\src\libs\libmodinv\gauden.c", line 1700: var
(mgau= 131, feat= 0, density=3, component=38) < 0
I saw in another topic here, that this error is due to insuficient trainning
data. I think not all phonemes got this error, and i still was able to modify
the HelloWorld application to use by models, many words went pretty well, but
some words were not recognized at all.
I recorded each word 20 times, 15 for trainning and 5 for test. I would like
to ask if thats an acceptable ammount of data. Should i record more samples of
all the words, or should i record only the ones i had problem with?
Thanks in advance, and sorry for the long post
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hey Nickolay, thx for the answer!
I made an improvement in my list of phonemes and transcribed the words again,
that seemed to improved the performance of the system. Wheni ran the test i
got like 5% word error, and i could see it was just some 5 words that couldnt
be recognized (like 4-5 errors, in 5 test recordings of each). I guess some
phonemes werent completely trained or something.. Btw ive got 80 diferent
words to train.
I would like to bring another question this time, sphinxtrain extratcted 13
MFCCs, it there anyway sphinxtrain can also obtain MFCCs derivatives (delta
and delta-delta)?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Oh, i would also also to ask: even with the 5% error on test, when i run the
application, sphinx says "it couldnt hear me", so i speak louder until it
recognizes. Does that msg means the microphone was too low, or it wasnt
actually able to recognize the word? I mean, cause my mic was on max volume =/
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I guess some phonemes werent completely trained or something.. Btw ive got
80 diferent words to train.
That's a lot. You just need more amount of audio to train. There is no sense
to experiment with the phoneset
. Does that msg means the microphone was too low, or it wasnt actually able
to recognize the word? I
You need to compare your training audio and the audio you are using for a
test. Maybe your training database was had too loud audio so it's trained to
recognize only loud. Volume doesn't matter actually until sound recording is
clipped.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Thanks for the anser! Im working on increasing the trainning set, i was
thinking in get 10 more recordings and use 5 for trainning and 5 for test, or
it should be better to use all 10 for trainning?
Sry to ask again, but is there any way to configure sphinxtrain to also obtain
the MFCCs derivatives?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Thanks for the anser! Im working on increasing the trainning set, i was
thinking in get 10 more recordings and use 5 for trainning and 5 for test, or
it should be better to use all 10 for trainning?
Better use all them for test. Train set can be smaller
Sry to ask again, but is there any way to configure sphinxtrain to also
obtain the MFCCs derivatives?
What do you mean "to obtain"? To print on the screen? To save in the file? To
something else? Sphinxtrain computes derivatives on the fly for example, they
aren't stored in the feature file.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
i mean, when we use make_feats it gets a sequence of 13 dimensional array with
the MFCCs and thats what we use on the trainning right? i would like to ask
how to train using 13 MFCCS, 13 MFCC-derivatives and 13 MFCC-2nd order
derivatives, for example
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
No, according to configuration in sphinx_decode.cfg "1s_c_d_dd" it trains with
derivatives. "d" and "dd" mean that. Derivatives aren't stored in feature
files they are computed on the fly.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Ohh, now i understand. So, for example, if i didnt wanna use the derivatives
on trainning i should change that for :
$CFG_FEATURE = "1s_c";
And if i wanted to use 11 MFCCs instead, i should change:
$CFG_VECTOR_LENGTH = 11;
Is that right?
By the way, i guess the "c" in "1s_c_d_dd" means cepstral, but what does that
"1s" mean?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
o, for example, if i didnt wanna use the derivatives on trainning i should
change that for :
$CFG_FEATURE = "1s_c";
And if i wanted to use 11 MFCCs instead, i should change:
$CFG_VECTOR_LENGTH = 11;
Is that right?
Yes
By the way, i guess the "c" in "1s_c_d_dd" means cepstral, but what does
that "1s" mean?
1s means one stream. The disribution can be modelled with number of streams.
That affects quantization. Either you quantize separately or together. If
variable ranges are different it's better to have multiple streams. In
semicontinuous model where quantization is used, 3-4 streams are usually
employed. In continuous models with no quantization 1 stream is enough.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hello everybody
Im new at Sphinx, but im trying to develop a little application with it. I
would like it to recognize some commands on my language, so i am using a
database that i recorded to train the accoustic models with sphinxtrain. But i
do have some questions, can i bother u guys?
Ive read that even if the vocabulary is small, its not advisable to train the
models for entire words. Although in those cases we could train like the
digits example that comes with sphinx. Ive got less than 100 words, should i
do that, or train with phonemes?
Actually, ive already trained with phonemes and the trainning was completed,
but i got some warnings like:
ERROR: "c:\tutorial\sphinxtrain\src\libs\libmodinv\gauden.c", line 1700: var
(mgau= 131, feat= 0, density=3, component=38) < 0
I saw in another topic here, that this error is due to insuficient trainning
data. I think not all phonemes got this error, and i still was able to modify
the HelloWorld application to use by models, many words went pretty well, but
some words were not recognized at all.
I recorded each word 20 times, 15 for trainning and 5 for test. I would like
to ask if thats an acceptable ammount of data. Should i record more samples of
all the words, or should i record only the ones i had problem with?
Thanks in advance, and sorry for the long post
Your "less than 100" provides zero information to answer. As tutorial says
everything below 20 is a small vocabulary everything above 30 is not.
it's also caused by overestimated number of tied states (senones). See the
tutorial for details
It's enough
Hey Nickolay, thx for the answer!
I made an improvement in my list of phonemes and transcribed the words again,
that seemed to improved the performance of the system. Wheni ran the test i
got like 5% word error, and i could see it was just some 5 words that couldnt
be recognized (like 4-5 errors, in 5 test recordings of each). I guess some
phonemes werent completely trained or something.. Btw ive got 80 diferent
words to train.
I would like to bring another question this time, sphinxtrain extratcted 13
MFCCs, it there anyway sphinxtrain can also obtain MFCCs derivatives (delta
and delta-delta)?
Oh, i would also also to ask: even with the 5% error on test, when i run the
application, sphinx says "it couldnt hear me", so i speak louder until it
recognizes. Does that msg means the microphone was too low, or it wasnt
actually able to recognize the word? I mean, cause my mic was on max volume =/
That's a lot. You just need more amount of audio to train. There is no sense
to experiment with the phoneset
You need to compare your training audio and the audio you are using for a
test. Maybe your training database was had too loud audio so it's trained to
recognize only loud. Volume doesn't matter actually until sound recording is
clipped.
Thanks for the anser! Im working on increasing the trainning set, i was
thinking in get 10 more recordings and use 5 for trainning and 5 for test, or
it should be better to use all 10 for trainning?
Sry to ask again, but is there any way to configure sphinxtrain to also obtain
the MFCCs derivatives?
Better use all them for test. Train set can be smaller
What do you mean "to obtain"? To print on the screen? To save in the file? To
something else? Sphinxtrain computes derivatives on the fly for example, they
aren't stored in the feature file.
i mean, when we use make_feats it gets a sequence of 13 dimensional array with
the MFCCs and thats what we use on the trainning right? i would like to ask
how to train using 13 MFCCS, 13 MFCC-derivatives and 13 MFCC-2nd order
derivatives, for example
No, according to configuration in sphinx_decode.cfg "1s_c_d_dd" it trains with
derivatives. "d" and "dd" mean that. Derivatives aren't stored in feature
files they are computed on the fly.
Ohh, now i understand. So, for example, if i didnt wanna use the derivatives
on trainning i should change that for :
$CFG_FEATURE = "1s_c";
And if i wanted to use 11 MFCCs instead, i should change:
$CFG_VECTOR_LENGTH = 11;
Is that right?
By the way, i guess the "c" in "1s_c_d_dd" means cepstral, but what does that
"1s" mean?
Yes
1s means one stream. The disribution can be modelled with number of streams.
That affects quantization. Either you quantize separately or together. If
variable ranges are different it's better to have multiple streams. In
semicontinuous model where quantization is used, 3-4 streams are usually
employed. In continuous models with no quantization 1 stream is enough.
Thanks for all the help Nickolay!
Im gonna work more on the system. Sorry for all the bothering =x