Have few basic questions for some commonly seen terms in SpeechRec/Sphinx.
Couple of lines explaination or pointer to
a link describing these will be enough.
Are MMIE and MLE basically different ways to arrive at HMM parameters ? Does decoder need to know if the acoustic
model was trained using MMIE or MLE algorithms ?
What exactly is LDA/MLLT feature translation ? Does decoder need to know if the acoustic model was trained using
LDA/MLLT ? Do these go together I mean is there anything like only LDA or only
MLLT transformation ?
Can all of these techniques be applied to semicontinuous acoustic models ?
Which of the above techniques were used to train hub4wsj_sc_8k model ?
What does 8k in hub4wsj_sc_8k signify ? Is it the number of senones or the sampling frequency used to create the model ?
Thanks and regrads,
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Are MMIE and MLE basically different ways to arrive at HMM parameters ?
They are different ways to estimate the parameters
Does decoder need to know if the acoustic model was trained using MMIE or
MLE algorithms ?
No
What exactly is LDA/MLLT feature translation?
Feature vector is multiplied on matrix. In LDA case this matrix is choosen to
reduce dimension of the feature vector by selection
of main components. In MLLT case additional property of diagonal covariance is
improved.
Does decoder need to know if the acoustic model was trained using LDA/MLLT?
Yes
Do these go together I mean is there anything like only LDA or only MLLT
transformation ?
There can be only LDA or only MLLT, but in practice it's single matrix which
is a multiplication of MLLT matrix on LDA matrix.
Can all of these techniques be applied to semicontinuous acoustic models
?
No
Which of the above techniques were used to train hub4wsj_sc_8k model?
MMIE probably, I'm not sure. As semicontinuous model it doesn't use feature
space transformation.
What does 8k in hub4wsj_sc_8k signify ? Is it the number of senones or
the sampling frequency used to create the model ?
Sample rate. See in feat.params -upperf 4000 which means that 8khz audio could
be decoded with this model.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
What is the data bit-width of data (8 bit or 16 bit) used to create hub4wsj_sc_8k model ? For optimum decoding accuracy, the bit width of "data to be decoded" should match with the with of "training data" right ? Is this value specified somewhere in the model definition files ?
For training an acoustic model using sphinxtrain, where exactly do I need to specify the parameters (say in order to get a feat.params exactly like that of hub4wsj_sc_8k ) ? I have 16 bit audio recorded at 16 Khz.
Thanks and regards,
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
What is the data bit-width of data (8 bit or 16 bit) used to create
hub4wsj_sc_8k model ? For optimum decoding accuracy, the bit width of "data to
be decoded" should match with the with of "training data" right? Is this value
specified somewhere in the model definition files?
The bit width must always be 16. There is no such configuration because 8 is
just not supported.
For training an acoustic model using sphinxtrain, where exactly do I
need to specify the parameters (say in order to get a feat.params exactly like
that of hub4wsj_sc_8k ) ? I have 16 bit audio recorded at 16 Khz.
Right now you need to edit ./scripts_pl/make_feats.pl. The following section
I'm trying to train a semicont acoustic model for pocketsphinx with exactly
the same parameters as for hub4wsj_sc_8k. Looking at
./scripts_pl/make_feats.pl and ./etc/sphinx_train.cfg have following
questions:
Where do I specify
-transform dct
-round_filters no
-remove_dc yes
-svspec 0-12/13-25/26-38
-cmninit 56,-3,1
What do -svspec and -cmninit specify ?
An acoustic model generated from default an4 settings has parameters like
-alpha 0.97
-dither yes
-doublebw no
-ncep 13
which are not seen in hub4wsj_sc_8k parameters. Why ?
What is the meaning of streams ? Does 1s_c_d_dd specify one stream ?
Is 1s_c_d_dd interpreted as 1 stream, cepstrun coefficients, delta and double delta ?
Feature vector length in this case will be 39 right ?
How to interpret s2_4x ?
Thanks and Regards,
PS:
FYI..
hub4wsj_sc_8k parameters...
-nfilt 20
-lowerf 1
-upperf 4000
-wlen 0.025
-transform dct
-round_filters no
-remove_dc yes
-svspec 0-12/13-25/26-38
-feat 1s_c_d_dd
-agc none
-cmn current
-cmninit 56,-3,1
-varnorm no
Where do I specify -transform dct -round_filters no -remove_dc yes
In make_feats.pl
-svspec 0-12/13-25/26-38
in sphinx_train.cfg configuration variable CFG_SVSPEC
-cmninit 56,-3,1
In feat.params after training
What do -svspec and -cmninit specify ?
svspec - specification for subvector quantization, specify which features to
put in each stream
cmninit - initial value for live CMN. CMN values are printed for each
utterance. In order to guess value for CMN quickly, initial CMN value should
be close to the average cepstral mean value.
An acoustic model generated from default an4 settings has parameters
like -alpha 0.97 -dither yes -doublebw no -ncep 13 which are not seen in
hub4wsj_sc_8k parameters. Why ?
They are default no need to specify them
What is the meaning of streams ?
Each stream is modelled with own gaussian distribution, so if parts of feature
vector are in theory independant, there is sense to use streams. You can find
more information in a textbook.
Does 1s_c_d_dd specify one stream ?
Yes
Is 1s_c_d_dd interpreted as 1 stream, cepstrun coefficients, delta and
double delta ?
Yes, but it's only abbreviation. 1s_c_dd has no meaning for example
Feature vector length in this case will be 39 right ?
Yes
How to interpret s2_4x ?
4 streams, 51 coefficient.
Cepstrum without c0 (12 coefficients)
Deltas with step 2 + Deltas with step 4 (24 coefficients)
c0, delta c0, delta-delta c0 (3 coefficients)
Delta-delta without c0 (12 coefficients)
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Please ignore the above question. I've some more and have clubbed them
together:
Parameters "remove_dc", "transform", "round_filters" and "cmninit" seem to be applicable only in decoding phase (not able to find them anywhere in SphixTrain directory hierarchy !!!). Is this understanding correct ?
If yes then can they all be put in feat.PARAMS after the training is done (you already mentioned that cmninit needs to be put
after the training) ?
Do following setting look OK in sphinx_train.cfg (for a hub4wsj_sc_8k like semicont model for pocketsphinx) ?
$CFG_VECTOR_LENGTH = 39
$CFG_FEATURE = "1s_c_d_dd";
$CFG_NUM_STREAMS = 4; ----- or should it be 1 ?
$CFG_INITIAL_NUM_DENSITIES = 256;
$CFG_FINAL_NUM_DENSITIES = 256;
Thanks and regards,
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Parameters "remove_dc", "transform", "round_filters" and "cmninit" seem
to be applicable only in decoding phase (not able to find them anywhere in
SphixTrain directory hierarchy !!!). Is this understanding correct?
Please don't ask me questions you can answer yourself.
If yes then can they all be put in feat.PARAMS after the training is
done (you already mentioned that cmninit needs to be put after the training) ?
Yes, I already mentioned that
Do following setting look OK in sphinx_train.cfg (for a hub4wsj_sc_8k
like semicont model for pocketsphinx) ? $CFG_VECTOR_LENGTH = 39 $CFG_FEATURE =
"1s_c_d_dd"; $CFG_NUM_STREAMS = 4; ----- or should it be 1 ?
$CFG_INITIAL_NUM_DENSITIES = 256; $CFG_FINAL_NUM_DENSITIES = 256; Thanks and
regards,
No idea, why don't you just try and see if it works.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
My training goes file however when I set $CFG_VECTOR_LENGTH = 39 (Which I
think should be the case because
$CFG_FEATURE = "1s_c_d_dd" ) I get a message
"Expected vector length of 39, got 26" and the training aborts.
In either of the above mentioned cases cases when I set $CFG_SVSPEC = 0-12/13-25/26-38 training aborts and I get following
message in the logfile:
Then I set $CFG_VECTOR_LENGTH = 13; $CFG_FEATURE = "1s_c_d_dd";
$CFG_NUM_STREAMS = 1; $CFG_INITIAL_NUM_DENSITIES = 256;
$CFG_FINAL_NUM_DENSITIES = 256 My training goes file however when I set
$CFG_VECTOR_LENGTH = 39 (Which I think should be the case because $CFG_FEATURE
= "1s_c_d_dd" ) I get a message "Expected vector length of 39, got 26" and the
training aborts.
Vector length is the length of cepstrum vector, not feature vector. It should
be 13, not 39.
cases cases when I set $CFG_SVSPEC = 0-12/13-25/26-38
You need quotes, don't you
$CFG_SVSPEC = "0-12/13-25/26-38";
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Thanks NS. Due to my fixation with vector length and limited knowledge of
perl, I totally missed it.
With the all changes discussed above I'm able to train the model now. One last
hitch remains it seems.
When I try to run this model (the one with same parameters as hub4wsj_sc_8k ), the decoder crashes when I run
a decoding session.
If I just change "-transform dct" to "-transform legacy" in feat.params, decoding works perfectly fine with excellent
accuracy.
PS: pocketsphinx 0.6, sphinxtrain nightly build (dated 12 July 2010). I have data for 4 speakers (100 short utterance each) totalling about 0.27 hrs. 16Khz, mono, 16 bit. WIth this data I had earlier trained an AN4 like model (model with parameters similar to the one for default AN4). That also
works perfectly fine. But if I change the transform to dct in this model (not
sure if it is permissible to do this however...),
this model too start crashing like the other one.
----- As if their is an issue with dct transform. hub4wsj_sc_8k decodes perfectly fine without any issues.
Any clues !
Thanks and regards,
Amit.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
When I try to run this model (the one with same parameters as
hub4wsj_sc_8k ), the decoder crashes when I run a decoding session.
You need to provide details of the crash if you need help on this. You need to
provide backtrace at least.
If I just change "-transform dct" to "-transform legacy" in feat.params,
decoding works perfectly fine with excellent accuracy.
-transform must be in scripts_pl/make_feats.pl from the early feature extraction. It seems you forgot to put it there and trained your model with legacy transform instead of dct one.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Do round_filters and remove_dc too need to match for training and decoding ?
You need to provide details of the crash
Had tried doing some tracing. The failure seems to happen in fsg_search.c
(line nos 1062 to 1069) in
the while loop "while (frm == last_frm)", "fl =
fsg_hist_entry_fsglink(hist_entry)" fl becomes a NULL pointer and line
"if ((!final) || fsg_link_to_state(fl) == fsg_model_final_state(fsg))" results
in error due to fl becoming a NULL pointer.This
happens when "bpidx" goes to 0.
But I suspect the fundamental problem is somewhere in make_feats.pl file.
Another observation (though it might be irrelevant here) is that if I run this decoding session with "hub4wsj_sc_8k" model,
decoding works perfect. When I change the transform to "htk" in feat.params
(of hub4wsj_sc_8k) , the decoding still works perfect
but when I change it to "legacy", it gives me more than 100% WER (doesn't
crash though !).
Regards,
Amit.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I've put my scripts_pl/make_feats.pl at following link. I'm putting
-transform dct there.
Transform option should go together with upperf option, lowerf option and
nfilt option. You placed it incorrectly
Do round_filters and remove_dc too need to match for training and
decoding
Yes
results in error due to fl becoming a NULL pointer.
K, this issue now is fixed in trunk
When I change the transform to "htk" in feat.params (of hub4wsj_sc_8k) , the
decoding still works perfect but when I change it to "legacy", it gives me
more than 100% WER (doesn't crash though !).
dct and htk are actually identical. So this goes as it should be
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Transform option should go together with upperf option, lowerf option and
nfilt option. You placed it incorrectly
When I put transform option together with options that you listed above I get
an error "ERROR: "........\src\libs\libcommon\cmd_ln.c", line 551: Unknown
switch -transform seen", "ERROR: "........\src\libs\libcommon\cmd_ln.c",
line 525: Expecting 'bin/wave2feat -switch_1 <arg_1> -switch_2 <arg_2> ...' </arg_2></arg_1>
This had puzzled me earlier also, parameters like "-transform, -remove_dc and
round_filters" don't seem to be valid arguments for
wave2feat whereas they are valid for sphinx_fe (from sphinxbase).
make_feats.pl tries to run wave2feat.
What might be missing in my setup ?
Regards,
PS: Is pocketsphinxbase also needed for training ? Currently I'm having only
an4 and SphinxTrain in my training setup.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
This had puzzled me earlier also, parameters like "-transform, -remove_dc
and round_filters" don't seem to be valid arguments for wave2feat whereas they
are valid for sphinx_fe (from sphinxbase). make_feats.pl tries to run
wave2feat.
Yes, you need to run sphinx_fe from sphinxbase instead wave2feat
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Yes, you need to run sphinx_fe from sphinxbase instead wave2feat
Thanks NS. This was the missing link that caused all doubts.... Everything is
working fine now.
I'm observing that when I train the model with dct, "Current overall
likelyhood per frame comes to the order of -58 however
when I train with legay it comes to the order of +15. WER is good (with
training set) in both the cases. Just curious to know
what does this value actually signify and is it related to perfectness of the
model !
Regards,
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi,
Have few basic questions for some commonly seen terms in SpeechRec/Sphinx.
Couple of lines explaination or pointer to
a link describing these will be enough.
Are MMIE and MLE basically different ways to arrive at HMM parameters ? Does decoder need to know if the acoustic
model was trained using MMIE or MLE algorithms ?
What exactly is LDA/MLLT feature translation ? Does decoder need to know if the acoustic model was trained using
LDA/MLLT ? Do these go together I mean is there anything like only LDA or only
MLLT transformation ?
Can all of these techniques be applied to semicontinuous acoustic models ?
Which of the above techniques were used to train hub4wsj_sc_8k model ?
What does 8k in hub4wsj_sc_8k signify ? Is it the number of senones or the sampling frequency used to create the model ?
Thanks and regrads,
They are different ways to estimate the parameters
No
Feature vector is multiplied on matrix. In LDA case this matrix is choosen to
reduce dimension of the feature vector by selection
of main components. In MLLT case additional property of diagonal covariance is
improved.
Yes
There can be only LDA or only MLLT, but in practice it's single matrix which
is a multiplication of MLLT matrix on LDA matrix.
No
MMIE probably, I'm not sure. As semicontinuous model it doesn't use feature
space transformation.
Sample rate. See in feat.params -upperf 4000 which means that 8khz audio could
be decoded with this model.
Thanks a lot NS.
Hi NS,
Few more question as an afterthought:
What is the data bit-width of data (8 bit or 16 bit) used to create hub4wsj_sc_8k model ? For optimum decoding accuracy, the bit width of "data to be decoded" should match with the with of "training data" right ? Is this value specified somewhere in the model definition files ?
For training an acoustic model using sphinxtrain, where exactly do I need to specify the parameters (say in order to get a feat.params exactly like that of hub4wsj_sc_8k ) ? I have 16 bit audio recorded at 16 Khz.
Thanks and regards,
The bit width must always be 16. There is no such configuration because 8 is
just not supported.
Right now you need to edit ./scripts_pl/make_feats.pl. The following section
Got it. Thanks NS.
Regards,
Hi NS,
I'm trying to train a semicont acoustic model for pocketsphinx with exactly
the same parameters as for hub4wsj_sc_8k. Looking at
./scripts_pl/make_feats.pl and ./etc/sphinx_train.cfg have following
questions:
Where do I specify
-transform dct
-round_filters no
-remove_dc yes
-svspec 0-12/13-25/26-38
-cmninit 56,-3,1
What do -svspec and -cmninit specify ?
An acoustic model generated from default an4 settings has parameters like
-alpha 0.97
-dither yes
-doublebw no
-ncep 13
which are not seen in hub4wsj_sc_8k parameters. Why ?
What is the meaning of streams ? Does 1s_c_d_dd specify one stream ?
Is 1s_c_d_dd interpreted as 1 stream, cepstrun coefficients, delta and double delta ?
Feature vector length in this case will be 39 right ?
How to interpret s2_4x ?
Thanks and Regards,
PS:
FYI..
hub4wsj_sc_8k parameters...
-nfilt 20
-lowerf 1
-upperf 4000
-wlen 0.025
-transform dct
-round_filters no
-remove_dc yes
-svspec 0-12/13-25/26-38
-feat 1s_c_d_dd
-agc none
-cmn current
-cmninit 56,-3,1
-varnorm no
Default an4.cd_semi_1000 parameters...
-alpha 0.97
-dither yes
-doublebw no
-nfilt 40
-ncep 13
-lowerf 133.33334
-upperf 6855.4976
-nfft 512
-wlen 0.0256
-transform legacy
-feat s2_4x
-agc none
-cmn current
-varnorm no
In make_feats.pl
in sphinx_train.cfg configuration variable CFG_SVSPEC
In feat.params after training
svspec - specification for subvector quantization, specify which features to
put in each stream
cmninit - initial value for live CMN. CMN values are printed for each
utterance. In order to guess value for CMN quickly, initial CMN value should
be close to the average cepstral mean value.
They are default no need to specify them
Each stream is modelled with own gaussian distribution, so if parts of feature
vector are in theory independant, there is sense to use streams. You can find
more information in a textbook.
Yes
Yes, but it's only abbreviation. 1s_c_dd has no meaning for example
Yes
4 streams, 51 coefficient.
Thanks NS,
Am I setting these parameters correctly in sphinx_train.cfg (for ?
$CFG_VECTOR_LENGTH = 39
$CFG_FEATURE = "1s_c_d_dd";
$CFG_NUM_STREAMS = 4; ----- or should it be 1 ?
$CFG_INITIAL_NUM_DENSITIES = 256;
$CFG_FINAL_NUM_DENSITIES = 256;
Regards,
Hi NS,
Please ignore the above question. I've some more and have clubbed them
together:
Parameters "remove_dc", "transform", "round_filters" and "cmninit" seem to be applicable only in decoding phase (not able to find them anywhere in SphixTrain directory hierarchy !!!). Is this understanding correct ?
If yes then can they all be put in feat.PARAMS after the training is done (you already mentioned that cmninit needs to be put
after the training) ?
Do following setting look OK in sphinx_train.cfg (for a hub4wsj_sc_8k like semicont model for pocketsphinx) ?
$CFG_VECTOR_LENGTH = 39
$CFG_FEATURE = "1s_c_d_dd";
$CFG_NUM_STREAMS = 4; ----- or should it be 1 ?
$CFG_INITIAL_NUM_DENSITIES = 256;
$CFG_FINAL_NUM_DENSITIES = 256;
Thanks and regards,
Please don't ask me questions you can answer yourself.
Yes, I already mentioned that
No idea, why don't you just try and see if it works.
Wanted to be doubly sure (Grepping in Win7 is having a weired bahavior)...
will be careful from next time. Sorry about this.
$CFG_VECTOR_LENGTH = 13;
$CFG_FEATURE = "1s_c_d_dd";
$CFG_NUM_STREAMS = 1;
$CFG_INITIAL_NUM_DENSITIES = 256;
$CFG_FINAL_NUM_DENSITIES = 256
My training goes file however when I set $CFG_VECTOR_LENGTH = 39 (Which I
think should be the case because
$CFG_FEATURE = "1s_c_d_dd" ) I get a message
"Expected vector length of 39, got 26" and the training aborts.
message in the logfile:
ERROR: "........\src\libs\libcommon\cmd_ln.c", line 525: Expecting 'C:\User
s\Amit\my_data\amit\Technology\Speech\Sphinx\PocketSphinx\hub4wsj_type_local_m
odel\an4\bin\bw.exe -switch_1 <arg_1> -switch_2 <arg_2> ...' </arg_2></arg_1>
I also see a value of "-svspec -39.8846153846154" in this logfile as if
0-12/13-25/26-38 is evaluated as matematical expression and put for svspec.
Any hints, comments, suggestions (Note: As I mentioned earlier my aim is to
have a model with parameter compatibility to hub4wsj_sc_8k) ?
Note: I might be doing something fundamentally wrong here but am not able to
figure out.
Thanks and regards,
Vector length is the length of cepstrum vector, not feature vector. It should
be 13, not 39.
You need quotes, don't you
Thanks NS. Due to my fixation with vector length and limited knowledge of
perl, I totally missed it.
With the all changes discussed above I'm able to train the model now. One last
hitch remains it seems.
When I try to run this model (the one with same parameters as hub4wsj_sc_8k ), the decoder crashes when I run
a decoding session.
If I just change "-transform dct" to "-transform legacy" in feat.params, decoding works perfectly fine with excellent
accuracy.
PS:
pocketsphinx 0.6, sphinxtrain nightly build (dated 12 July 2010).
I have data for 4 speakers (100 short utterance each) totalling about 0.27 hrs. 16Khz, mono, 16 bit.
WIth this data I had earlier trained an AN4 like model (model with parameters similar to the one for default AN4). That also
works perfectly fine. But if I change the transform to dct in this model (not
sure if it is permissible to do this however...),
this model too start crashing like the other one.
----- As if their is an issue with dct transform.
hub4wsj_sc_8k decodes perfectly fine without any issues.
Any clues !
Thanks and regards,
Amit.
You need to provide details of the crash if you need help on this. You need to
provide backtrace at least.
-transform must be in scripts_pl/make_feats.pl from the early feature extraction. It seems you forgot to put it there and trained your model with legacy transform instead of dct one.
Hi NS,
I've put my scripts_pl/make_feats.pl at following link. I'm putting -transform dct there. Is there something else wrong with the
script ? http://www.mediafire.com/file/8go1e96680df646/make_feats.pl
Do round_filters and remove_dc too need to match for training and decoding ?
But I suspect the fundamental problem is somewhere in make_feats.pl file.
decoding works perfect. When I change the transform to "htk" in feat.params
(of hub4wsj_sc_8k) , the decoding still works perfect
but when I change it to "legacy", it gives me more than 100% WER (doesn't
crash though !).
Regards,
Amit.
Transform option should go together with upperf option, lowerf option and
nfilt option. You placed it incorrectly
Yes
K, this issue now is fixed in trunk
dct and htk are actually identical. So this goes as it should be
Hi NS,
When I put transform option together with options that you listed above I get
an error "ERROR: "........\src\libs\libcommon\cmd_ln.c", line 551: Unknown
switch -transform seen", "ERROR: "........\src\libs\libcommon\cmd_ln.c",
line 525: Expecting 'bin/wave2feat -switch_1 <arg_1> -switch_2 <arg_2> ...' </arg_2></arg_1>
This had puzzled me earlier also, parameters like "-transform, -remove_dc and
round_filters" don't seem to be valid arguments for
wave2feat whereas they are valid for sphinx_fe (from sphinxbase).
make_feats.pl tries to run wave2feat.
What might be missing in my setup ?
Regards,
PS: Is pocketsphinxbase also needed for training ? Currently I'm having only
an4 and SphinxTrain in my training setup.
Yes, you need to run sphinx_fe from sphinxbase instead wave2feat
Thanks NS. This was the missing link that caused all doubts.... Everything is
working fine now.
I'm observing that when I train the model with dct, "Current overall
likelyhood per frame comes to the order of -58 however
when I train with legay it comes to the order of +15. WER is good (with
training set) in both the cases. Just curious to know
what does this value actually signify and is it related to perfectness of the
model !
Regards,