Hello,
I am currently doing some tests for training acoustic model and I have some questions about the decoding for obtaining the test results.
I am using semi-continuous models with a very small vocabulary (10 words) and a simple grammar (JSGF that supports only the 10 words, no sentence).
So when decoding I would like to get the best results as possible, I don't really care about memory & time at this point, I just want to find the best parameters for training.
So I thought about changing the beam width and word beam width inside the decoder settings.
The default values are something like:
If I am right, by putting very small values like 1e-200 for both I should get a wider beam and thus, a better accuracy.
But actually I get a very bad accuracy if I changed them.
To evaluate y accuracy I use a 10 folds cross-validation.
With the default values I get like 15% of error rate (which I know this is bad but this is just some test result) and with smaller beam values I get 96% of error rate.
Since I am doing a cross validation I think it can't be some random effect (because I know that with smaller beams, the randomness could lead to better results than with wider beam, but not with such a difference).
Do you have any idea of where I could be wrong ?
I repeat that I only change the two beam values in the sphinx_decode.cfg file.
I thought that maybe putting a wider beam will force the decoder to explore "wrong" states with bad probability estimations or something like that.
Thank you
EDIT:
I didn't the language weight, do you think it could be linked to that?
But i Saw in some publications that it doesn't affect the results that much (Towards a Dynamic Adjustment of the Language Weight, by G. Stemmer, V. Zeissler, E. Nöth and H. Niemann)
By the way I don't really see why in my case a change of the language weight is changing the results since my language only contains the 10 words witht he same probabilities...
Last edit: floboc 2012-10-01
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I will of course provide the data if needed, but I think it can just be a problem with the configuration of the decoder.
I tried to set -bestpath to 'no' and now it seems to correct the problem: I get a 5% error rate.
What is the meaning of this option ?
I searched but I can't find the documentation of all the decoder parameters, do you know where I can find this ?
Here is my current configuration:
Current configuration: [NAME][DEFLT][VALUE]
-adchdr 0 0
-adcin no no
-agc none none
-agcthresh 2.0 2.000000e+00
-alpha 0.97 9.700000e-01
-argfile
-ascale 20.0 2.000000e+01
-aw 1 1
-backtrace no no
-beam 1e-48 1.000000e-200
-bestpath yes no
-bestpathlw 9.5 9.500000e+00
-bghist no no
-build_outdirs yes yes
-cepdir /Models/myblee_chiffres/feat
-cepext .mfc .mfc
-ceplen 13 13
-cmn current current
-cmninit 8.0 8.0
-compallsen no no
-ctl /Models/myblee_chiffres/etc/myblee_chiffres_test.fileids
-ctlcount -1 38
-ctlincr 1 1
-ctloffset 0 0
-ctm
-debug 0
-dict /Models/myblee_chiffres/etc/myblee_chiffres.dic
-dictcase no no
-dither no no
-doublebw no no
-ds 1 1
-fdict
-feat 1s_c_d_dd s2_4x
-featparams
-fillprob 1e-8 1.000000e-08
-frate 100 100
-fsg
-fsgctl
-fsgdir
-fsgext
-fsgusealtpron yes yes
-fsgusefiller yes yes
-fwdflat yes yes
-fwdflatbeam 1e-64 1.000000e-64
-fwdflatefwid 4 4
-fwdflatlw 8.5 8.500000e+00
-fwdflatsfwin 25 25
-fwdflatwbeam 7e-29 7.000000e-29
-fwdtree yes yes
-hmm /Models/myblee_chiffres/model_parameters/myblee_chiffres.cd_semi_100
-hyp /Models/myblee_chiffres/result/myblee_chiffres-1-1.match
-hypseg
-input_endian little little
-jsgf /Models/myblee_chiffres/etc/myblee_chiffres.gram
-kdmaxbbi -1 -1
-kdmaxdepth 0 0
-kdtree
-latsize 5000 5000
-lda
-ldadim 0 0
-lextreedump 0 0
-lifter 0 0
-lm
-lmctl
-lmname default default
-lmnamectl
-logbase 1.0001 1.000100e+00
-logfn
-logspec no no
-lowerf 133.33334 1.333333e+02
-lpbeam 1e-40 1.000000e-40
-lponlybeam 7e-29 7.000000e-29
-lw 6.5 0.000000e+00
-maxhmmpf -1 -1
-maxnewoov 20 20
-maxwpf -1 -1
-mdef
-mean
-mfclogdir
-min_endfr 0 0
-mixw
-mixwfloor 0.0000001 1.000000e-07
-mllr
-mllrctl
-mllrdir
-mllrext
-mmap yes yes
-nbest 0 0
-nbestdir
-nbestext .hyp .hyp
-ncep 13 13
-nfft 512 512
-nfilt 40 40
-nwpen 1.0 1.000000e+00
-outlatbeam 1e-5 1.000000e-05
-outlatdir
-outlatext .lat .lat
-outlatfmt s3 s3
-pbeam 1e-48 1.000000e-48
-pip 1.0 1.000000e+00
-pl_beam 1e-10 1.000000e-10
-pl_pbeam 1e-5 1.000000e-05
-pl_window 0 0
-rawlogdir
-remove_dc no no
-round_filters yes yes
-samprate 16000 1.600000e+04
-seed -1 -1
-sendump
-senin no no
-senlogdir
-senmgau
-silprob 0.005 5.000000e-03
-smoothspec no no
-svspec
-tmat
-tmatfloor 0.0001 1.000000e-04
-topn 4 4
-topn_beam 0 0
-toprule
-transform legacy legacy
-unit_area yes yes
-upperf 6855.4976 6.855498e+03
-usewdphones no no
-uw 1.0 1.000000e+00
-var
-varfloor 0.0001 1.000000e-04
-varnorm no no
-verbose no no
-warp_params
-warp_type inverse_linear inverse_linear
-wbeam 7e-29 1.000000e-200
-wip 0.65 2.000000e-01
-wlen 0.025625 2.562500e-02
note: the language weight is set to 0 on purpose, since all the words should have the same weight
Thank you again for your help, I really think the toolkit you proide and the support are good.
Last edit: floboc 2012-10-01
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Bestpath option enables search for a best possible path in a lattice created after the first pass search with FSG lextree
I searched but I can't find the documentation of all the decoder parameters, do you know where I can find this ?
Unfortunately there is no good guide on the options partially because such guide should cover algorithm details and could be like a book in volume. If you are interested in some specific value you are welcome to ask. It's not a very good idea to modify default options though.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Can you tell me more about this bestpath option please?
How does it work really? What is the aim of this lattice and how is it created ?
I really need to understand this.
Thank you
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
It should work, if you claim the opposite you need to provide a test to demonstrate the problem.
What is the aim of this lattice and how is it created ?
Lattice is constructed during the forward Viterbi search. It collects all possible decoding outcomes. Lattice node is created when two word paths join in the same point. Then we continue search of just one path and store the alternative in a lattice.
First pass forward tree search is not really precise, it might join certain paths in FSG into a single search token and put the history variants in the lattice. Best result returned by the search path is not necessary the best result globally.
Bestpath ensures that the decoding result is better in terms of possible decoding results because it considers other decoding variants, not just the locally best one as constructed by a Viterbi first pass.
Another advantage of the lattice construction is that with a sufficiently large grammars it allows you to compute posterior probabilities.
For a small simple grammars best path stage is not required and should give no advantage.
You can read about it in a speech recognition book and also check the thesis "Efficient algorithms for speech recognition" by Mosur Ravishankar.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Thank you so much for your explanation.
I am sorry if my english is sometimes approximative, when I ask "How does it work really?" I am not questioning if it works or not, just asking what are the "maths" behind this, and you answered in the second part.
So, using bestpath is recommended if I have a "complex" grammar.
But in the case that I have a grammar with only isolated word,s there is no need for this.
I will take a look at the thesis you talked about.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hello,
I am currently doing some tests for training acoustic model and I have some questions about the decoding for obtaining the test results.
I am using semi-continuous models with a very small vocabulary (10 words) and a simple grammar (JSGF that supports only the 10 words, no sentence).
So when decoding I would like to get the best results as possible, I don't really care about memory & time at this point, I just want to find the best parameters for training.
So I thought about changing the beam width and word beam width inside the decoder settings.
The default values are something like:
$DEC_CFG_BEAMWIDTH = "1e-80";
$DEC_CFG_WORDBEAM = "1e-29";
If I am right, by putting very small values like 1e-200 for both I should get a wider beam and thus, a better accuracy.
But actually I get a very bad accuracy if I changed them.
To evaluate y accuracy I use a 10 folds cross-validation.
With the default values I get like 15% of error rate (which I know this is bad but this is just some test result) and with smaller beam values I get 96% of error rate.
Since I am doing a cross validation I think it can't be some random effect (because I know that with smaller beams, the randomness could lead to better results than with wider beam, but not with such a difference).
Do you have any idea of where I could be wrong ?
I repeat that I only change the two beam values in the sphinx_decode.cfg file.
I thought that maybe putting a wider beam will force the decoder to explore "wrong" states with bad probability estimations or something like that.
Thank you
EDIT:
I didn't the language weight, do you think it could be linked to that?
But i Saw in some publications that it doesn't affect the results that much (Towards a Dynamic Adjustment of the Language Weight, by G. Stemmer, V. Zeissler, E. Nöth and H. Niemann)
By the way I don't really see why in my case a change of the language weight is changing the results since my language only contains the 10 words witht he same probabilities...
Last edit: floboc 2012-10-01
You need to provide your data files if you need a definite answer on this question
This is most likely reason
I will of course provide the data if needed, but I think it can just be a problem with the configuration of the decoder.
I tried to set -bestpath to 'no' and now it seems to correct the problem: I get a 5% error rate.
What is the meaning of this option ?
I searched but I can't find the documentation of all the decoder parameters, do you know where I can find this ?
Here is my current configuration:
Current configuration:
[NAME] [DEFLT] [VALUE]
-adchdr 0 0
-adcin no no
-agc none none
-agcthresh 2.0 2.000000e+00
-alpha 0.97 9.700000e-01
-argfile
-ascale 20.0 2.000000e+01
-aw 1 1
-backtrace no no
-beam 1e-48 1.000000e-200
-bestpath yes no
-bestpathlw 9.5 9.500000e+00
-bghist no no
-build_outdirs yes yes
-cepdir /Models/myblee_chiffres/feat
-cepext .mfc .mfc
-ceplen 13 13
-cmn current current
-cmninit 8.0 8.0
-compallsen no no
-ctl /Models/myblee_chiffres/etc/myblee_chiffres_test.fileids
-ctlcount -1 38
-ctlincr 1 1
-ctloffset 0 0
-ctm
-debug 0
-dict /Models/myblee_chiffres/etc/myblee_chiffres.dic
-dictcase no no
-dither no no
-doublebw no no
-ds 1 1
-fdict
-feat 1s_c_d_dd s2_4x
-featparams
-fillprob 1e-8 1.000000e-08
-frate 100 100
-fsg
-fsgctl
-fsgdir
-fsgext
-fsgusealtpron yes yes
-fsgusefiller yes yes
-fwdflat yes yes
-fwdflatbeam 1e-64 1.000000e-64
-fwdflatefwid 4 4
-fwdflatlw 8.5 8.500000e+00
-fwdflatsfwin 25 25
-fwdflatwbeam 7e-29 7.000000e-29
-fwdtree yes yes
-hmm /Models/myblee_chiffres/model_parameters/myblee_chiffres.cd_semi_100
-hyp /Models/myblee_chiffres/result/myblee_chiffres-1-1.match
-hypseg
-input_endian little little
-jsgf /Models/myblee_chiffres/etc/myblee_chiffres.gram
-kdmaxbbi -1 -1
-kdmaxdepth 0 0
-kdtree
-latsize 5000 5000
-lda
-ldadim 0 0
-lextreedump 0 0
-lifter 0 0
-lm
-lmctl
-lmname default default
-lmnamectl
-logbase 1.0001 1.000100e+00
-logfn
-logspec no no
-lowerf 133.33334 1.333333e+02
-lpbeam 1e-40 1.000000e-40
-lponlybeam 7e-29 7.000000e-29
-lw 6.5 0.000000e+00
-maxhmmpf -1 -1
-maxnewoov 20 20
-maxwpf -1 -1
-mdef
-mean
-mfclogdir
-min_endfr 0 0
-mixw
-mixwfloor 0.0000001 1.000000e-07
-mllr
-mllrctl
-mllrdir
-mllrext
-mmap yes yes
-nbest 0 0
-nbestdir
-nbestext .hyp .hyp
-ncep 13 13
-nfft 512 512
-nfilt 40 40
-nwpen 1.0 1.000000e+00
-outlatbeam 1e-5 1.000000e-05
-outlatdir
-outlatext .lat .lat
-outlatfmt s3 s3
-pbeam 1e-48 1.000000e-48
-pip 1.0 1.000000e+00
-pl_beam 1e-10 1.000000e-10
-pl_pbeam 1e-5 1.000000e-05
-pl_window 0 0
-rawlogdir
-remove_dc no no
-round_filters yes yes
-samprate 16000 1.600000e+04
-seed -1 -1
-sendump
-senin no no
-senlogdir
-senmgau
-silprob 0.005 5.000000e-03
-smoothspec no no
-svspec
-tmat
-tmatfloor 0.0001 1.000000e-04
-topn 4 4
-topn_beam 0 0
-toprule
-transform legacy legacy
-unit_area yes yes
-upperf 6855.4976 6.855498e+03
-usewdphones no no
-uw 1.0 1.000000e+00
-var
-varfloor 0.0001 1.000000e-04
-varnorm no no
-verbose no no
-warp_params
-warp_type inverse_linear inverse_linear
-wbeam 7e-29 1.000000e-200
-wip 0.65 2.000000e-01
-wlen 0.025625 2.562500e-02
note: the language weight is set to 0 on purpose, since all the words should have the same weight
Thank you again for your help, I really think the toolkit you proide and the support are good.
Last edit: floboc 2012-10-01
Bestpath option enables search for a best possible path in a lattice created after the first pass search with FSG lextree
Unfortunately there is no good guide on the options partially because such guide should cover algorithm details and could be like a book in volume. If you are interested in some specific value you are welcome to ask. It's not a very good idea to modify default options though.
Can you tell me more about this bestpath option please?
How does it work really? What is the aim of this lattice and how is it created ?
I really need to understand this.
Thank you
It should work, if you claim the opposite you need to provide a test to demonstrate the problem.
Lattice is constructed during the forward Viterbi search. It collects all possible decoding outcomes. Lattice node is created when two word paths join in the same point. Then we continue search of just one path and store the alternative in a lattice.
First pass forward tree search is not really precise, it might join certain paths in FSG into a single search token and put the history variants in the lattice. Best result returned by the search path is not necessary the best result globally.
Bestpath ensures that the decoding result is better in terms of possible decoding results because it considers other decoding variants, not just the locally best one as constructed by a Viterbi first pass.
Another advantage of the lattice construction is that with a sufficiently large grammars it allows you to compute posterior probabilities.
For a small simple grammars best path stage is not required and should give no advantage.
You can read about it in a speech recognition book and also check the thesis "Efficient algorithms for speech recognition" by Mosur Ravishankar.
Thank you so much for your explanation.
I am sorry if my english is sometimes approximative, when I ask "How does it work really?" I am not questioning if it works or not, just asking what are the "maths" behind this, and you answered in the second part.
So, using bestpath is recommended if I have a "complex" grammar.
But in the case that I have a grammar with only isolated word,s there is no need for this.
I will take a look at the thesis you talked about.