o) 16K conversational speech (subset of ICSI meeting corpus)
o) continuous HMM (3-state no skip, ~3000 senones, 16-mixture)
o) trigram LM
o) major parameters have been tunes for both Sphinx3 and PocketSphinx
Problem:
PocketSphinx gives 3-4% Accuracy drop, comparing with Sphinx3.
(No matter how we tune the parameters)
Sphinx3:
59.3% Acc(1 - WER)
lw: 11
beam: 1e-55
pbeam: 1e-55
wbeam: 1e-35
wip: 0.2
playing around with parameters I can achieve maximum 60.1% Acc!
PocketSphinx
55.9% Acc
lw: 7
beam: 1e-53
pbeam: 1e-53
wbeam: 1e-35
wip: 0.2
playing around with parameters I can achieve maximum 56.5% Acc!
My Questions:
1) Am I missing something for PocketSphinx?
2) Will semi-cont 5-state HMM give better accuracy?
(I am re-training semi HMM now, but logically I don't think semi-cont will
give better Acc...)
Thanks!
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Strange, this is the first time I see such result. Few thoughts on that:
1) ERRORs in the log are not critical, they are actually changed to INFO in
trunk
2) Did you train model with SphinxTrain?
3) What is the accuracy in fwdtree mode in pocketsphinx (-fwdflat no). They
must be more or less the same with sphinx3
which I suppose you also run in fwdtree mode
4) Not sure if it makes sense to check pocketsphinx-0.5 to see if there are
any regressions
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
No. We trained using HTK and converted to Sphinx format. Originally, HTK
decoder achieves 62.*% accuracy.
3) What is the accuracy in fwdtree mode in pocketsphinx (-fwdflat no). They
must be more or less the same with sphinx3 which > I suppose you also run in
fwdtree mode
Will try that and post update.
4) Not sure if it makes sense to check pocketsphinx-0.5 to see if there are
any regressions
Will give this a try as well.
Thanks very much for all the suggestions!
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
That's a serious treat for us! We should definitely solve it!
Next set of questions:
1) Which converter did you use? htk2s3.py from our trunk or something home-
made?
2) What is s3 accuracy with very-very wide beams (like 1e-200). I remember
from David's comaprision that HDecode beams comparing to sphinx3 ones are
actually very wide.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
1) Using Wout's converter: It is written in python with help of David. I guess
it's similar to htk2s3.py in your trunk. (will take an look at htk2s3.py) http://home.student.utwente.nl/w.j.maaskant/htk2s3conv/
the model conversion is a complicated problem, we spent weeks on this, still
2% gap.
2) Using very wide beam itself didn't help (I tried beam=1e-300 and 1e-500
before). partial results attached (pbeam always equal to beam).
For regular beam width, I found that wip=0.001 is very optimized value for
this corpus. For a very wide beam, I didn't able to find a magic number yet...
3) Question: did David or anyone else ever achieve same accuracy after
converting HTK model to S3?
4) Nick, I will send personal emails to you for better communications.
the model conversion is a complicated problem, we spent weeks on this, still
2% gap
Yes, I also suspect something wrong with the converted model. Probably
transition probs or something like that. I found pocketsphinx is very
sensitive to transition probs. We need to closer look on it.
We can also compare scores of the models on the data which is recognized
incorrectly. If you'll provide such utterance and the models I can look
myself.
Question: did David or anyone else ever achieve same accuracy after
converting HTK model to S3
Unfortunately that text is not available and it's better to ask David directly
but I remember he trained with Sphinxtrain on WSJ and just compared results to
Keith WSJ on HTK.
Nick, I will send personal emails to you for better communications.
Yes, please do. Or use cmusphinx-devel. I'm sure you'll get more feedback
there.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
o) 16K conversational speech (subset of ICSI meeting corpus)
o) continuous HMM (3-state no skip, ~3000 senones, 16-mixture)
o) trigram LM
o) major parameters have been tunes for both Sphinx3 and PocketSphinx
Problem:
PocketSphinx gives 3-4% Accuracy drop, comparing with Sphinx3.
(No matter how we tune the parameters)
Sphinx3:
59.3% Acc(1 - WER)
lw: 11
beam: 1e-55
pbeam: 1e-55
wbeam: 1e-35
wip: 0.2
playing around with parameters I can achieve maximum 60.1% Acc!
PocketSphinx
55.9% Acc
lw: 7
beam: 1e-53
pbeam: 1e-53
wbeam: 1e-35
wip: 0.2
playing around with parameters I can achieve maximum 56.5% Acc!
My Questions:
1) Am I missing something for PocketSphinx?
2) Will semi-cont 5-state HMM give better accuracy?
(I am re-training semi HMM now, but logically I don't think semi-cont will
give better Acc...)
Thanks!
It's not clear which decoding mode are you using. I suggest you to provide
heads of the decoding logs where all parameters are listed.
Hi nshmyrev,
The head of decoding log is attached.
I noticed an error "ERROR: "ptm_mgau.c", line 801: Number of codebooks exceeds
256: 2783"
Thanks!
INFO: cmd_ln.c(512): Parsing command line:
/home/tao/SphinxEval/pocketsphinx/bin/pocketsphinx_batch \
-hmm /home/tao/icsi_meeting/model/hmm/16mix_6 \
-lw 5 \
-feat 1s_c_d_dd \
-beam 1e-55 \
-pbeam 1e-55 \
-wbeam 1e-35 \
-dict /home/tao/icsi_meeting/etc/cmu_nosp_new.dict \
-fdict /home/tao/icsi_meeting/etc/cmu_nosp.filler \
-lm /home/tao/icsi_meeting/model/lm/lm_csr_6k_nvp_3gram.DMP \
-wip 0.2 \
-ctl /home/tao/icsi_meeting/etc/test_mini_no_ext.scp \
-cepdir /home/tao/icsi_meeting \
-cepext .mfc \
-hyp /home/tao/icsi_meeting/test_ps.match_0 \
-agc none \
-varnorm no \
-cmn current \
-ctlcount 615 \
-ctloffset 0
Current configuration:
-adchdr 0 0
-adcin no no
-agc none none
-agcthresh 2.0 2.000000e+00
-alpha 0.97 9.700000e-01
-argfile
-ascale 20.0 2.000000e+01
-backtrace no no
-beam 1e-48 1.000000e-55
-bestpath yes yes
-bestpathlw 9.5 9.500000e+00
-bghist no no
-cepdir /home/tao/icsi_meeting
-cepext .mfc .mfc
-ceplen 13 13
-cmn current current
-cmninit 8.0 8.0
-compallsen no no
-ctl /home/tao/icsi_meeting/etc/test_mini_no_ext.scp
-ctlcount -1 615
-ctlincr 1 1
-ctloffset 0 0
-ctm
-debug 0
-dict /home/tao/icsi_meeting/etc/cmu_nosp_new.dict
-dictcase no no
-dither no no
-doublebw no no
-ds 1 1
-fdict /home/tao/icsi_meeting/etc/cmu_nosp.filler
-feat 1s_c_d_dd 1s_c_d_dd
-featparams
-fillprob 1e-8 1.000000e-08
-frate 100 100
-fsg
-fsgusealtpron yes yes
-fsgusefiller yes yes
-fwdflat yes yes
-fwdflatbeam 1e-64 1.000000e-64
-fwdflatefwid 4 4
-fwdflatlw 8.5 8.500000e+00
-fwdflatsfwin 25 25
-fwdflatwbeam 7e-29 7.000000e-29
-fwdtree yes yes
-hmm /home/tao/icsi_meeting/model/hmm/16mix_6
-hyp /home/tao/icsi_meeting/test_ps.match_0
-hypseg
-input_endian little little
-jsgf
-kdmaxbbi -1 -1
-kdmaxdepth 0 0
-kdtree
-latsize 5000 5000
-lda
-ldadim 0 0
-lextreedump 0 0
-lifter 0 0
-lm /home/tao/icsi_meeting/model/lm/lm_csr_6k_nvp_3gram.DMP
-lmctl
-lmname default default
-lmnamectl
-logbase 1.0001 1.000100e+00
-logfn
-logspec no no
-lowerf 133.33334 1.333333e+02
-lpbeam 1e-40 1.000000e-40
-lponlybeam 7e-29 7.000000e-29
-lw 6.5 5.000000e+00
-maxhmmpf -1 -1
-maxnewoov 20 20
-maxwpf -1 -1
-mdef
-mean
-mfclogdir
-mixw
-mixwfloor 0.0000001 1.000000e-07
-mllr
-mllrctl
-mllrdir
-mmap yes yes
-nbest 0 0
-nbestdir
-nbestext .hyp .hyp
-ncep 13 13
-nfft 512 512
-nfilt 40 40
-nwpen 1.0 1.000000e+00
-outlatdir
-pbeam 1e-48 1.000000e-55
-pip 1.0 1.000000e+00
-pl_beam 1e-10 1.000000e-10
-pl_pbeam 1e-5 1.000000e-05
-pl_window 0 0
-rawlogdir
-remove_dc no no
-round_filters yes yes
-samprate 16000 1.600000e+04
-seed -1 -1
-sendump
-senmgau
-silprob 0.005 5.000000e-03
-smoothspec no no
-svspec
-tmat
-tmatfloor 0.0001 1.000000e-04
-topn 4 4
-topn_beam 0 0
-toprule
-transform legacy legacy
-unit_area yes yes
-upperf 6855.4976 6.855498e+03
-usewdphones no no
-uw 1.0 1.000000e+00
-var
-varfloor 0.0001 1.000000e-04
-varnorm no no
-verbose no no
-warp_params
-warp_type inverse_linear inverse_linear
-wbeam 7e-29 1.000000e-35
-wip 0.65 2.000000e-01
-wlen 0.025625 2.562500e-02
INFO: feat.c(979): Initializing feature stream to type: '1s_c_d_dd',
ceplen=13, CMN='current', VARNORM='no', AGC='none'
INFO: cmn.c(142): mean= 12.00, mean= 0.0
INFO: mdef.c(520): Reading model definition:
/home/tao/icsi_meeting/model/hmm/16mix_6/mdef
INFO: bin_mdef.c(173): Allocating 86389 * 8 bytes (674 KiB) for CD tree
INFO: tmat.c(205): Reading HMM transition probability matrices:
/home/tao/icsi_meeting/model/hmm/16mix_6/transition_matrices
INFO: acmod.c(117): Attempting to use SCHMM computation module
INFO: ms_gauden.c(198): Reading mixture gaussian parameter:
/home/tao/icsi_meeting/model/hmm/16mix_6/means
INFO: ms_gauden.c(292): 2783 codebook, 1 feature, size
16x39
INFO: ms_gauden.c(198): Reading mixture gaussian parameter:
/home/tao/icsi_meeting/model/hmm/16mix_6/variances
INFO: ms_gauden.c(292): 2783 codebook, 1 feature, size
16x39
INFO: ms_gauden.c(356): 0 variance values floored
INFO: acmod.c(119): Attempting to use PTHMM computation module
INFO: ms_gauden.c(198): Reading mixture gaussian parameter:
/home/tao/icsi_meeting/model/hmm/16mix_6/means
INFO: ms_gauden.c(292): 2783 codebook, 1 feature, size
16x39
INFO: ms_gauden.c(198): Reading mixture gaussian parameter:
/home/tao/icsi_meeting/model/hmm/16mix_6/variances
INFO: ms_gauden.c(292): 2783 codebook, 1 feature, size
16x39
INFO: ms_gauden.c(356): 0 variance values floored
ERROR: "ptm_mgau.c", line 801: Number of codebooks exceeds 256: 2783
INFO: acmod.c(121): Falling back to general multi-stream GMM computation
INFO: ms_gauden.c(198): Reading mixture gaussian parameter:
/home/tao/icsi_meeting/model/hmm/16mix_6/means
INFO: ms_gauden.c(292): 2783 codebook, 1 feature, size
16x39
INFO: ms_gauden.c(198): Reading mixture gaussian parameter:
/home/tao/icsi_meeting/model/hmm/16mix_6/variances
INFO: ms_gauden.c(292): 2783 codebook, 1 feature, size
16x39
INFO: ms_gauden.c(356): 0 variance values floored
INFO: ms_senone.c(160): Reading senone mixture weights:
/home/tao/icsi_meeting/model/hmm/16mix_6/mixture_weights
INFO: ms_senone.c(211): Truncating senone logs3(pdf) values by 10 bits
INFO: ms_senone.c(218): Not transposing mixture weights in memory
INFO: ms_senone.c(277): Read mixture weights for 2783 senones: 1 features x 16
codewords
INFO: ms_senone.c(331): Mapping senones to individual codebooks
INFO: ms_mgau.c(123): The value of topn: 4
INFO: dict.c(294): Allocating 11617 * 20 bytes (226 KiB) for word entries
INFO: dict.c(306): Reading main dictionary:
/home/tao/icsi_meeting/etc/cmu_nosp_new.dict
INFO: dict.c(206): Allocated 55 KiB for strings, 88 KiB for phones
INFO: dict.c(309): 7518 words read
INFO: dict.c(314): Reading filler dictionary:
/home/tao/icsi_meeting/etc/cmu_nosp.filler
INFO: dict.c(206): Allocated 0 KiB for strings, 0 KiB for phones
INFO: dict.c(317): 3 words read
INFO: dict2pid.c(402): Building PID tables for dictionary
INFO: dict2pid.c(409): Allocating 7521 * 4 bytes (29 KiB) for word-internal
arrays
INFO: dict2pid.c(414): Allocating 41^3 * 2 bytes (134 KiB) for word-initial
triphones
INFO: dict2pid.c(453): Allocating 30332 entries of 2 bytes (59 KiB) for
internal ssids
INFO: dict2pid.c(130): Allocated 20336 bytes (19 KiB) for word-final triphones
INFO: dict2pid.c(193): Allocated 20336 bytes (19 KiB) for single-phone word
triphones
ERROR: "ngram_model_arpa.c", line 76: No \data\ mark in LM file
INFO: ngram_model_dmp.c(141): Will use memory-mapped I/O for LM file
INFO: ngram_model_dmp.c(195): ngrams 1=6197, 2=67406, 3=11661
INFO: ngram_model_dmp.c(241): 6197 = LM.unigrams(+trailer) read
INFO: ngram_model_dmp.c(289): 67406 = LM.bigrams(+trailer) read
INFO: ngram_model_dmp.c(314): 11661 = LM.trigrams read
INFO: ngram_model_dmp.c(338): 21740 = LM.prob2 entries read
INFO: ngram_model_dmp.c(357): 2422 = LM.bo_wt2 entries read
INFO: ngram_model_dmp.c(377): 8881 = LM.prob3 entries read
INFO: ngram_model_dmp.c(405): 132 = LM.tseg_base entries read
INFO: ngram_model_dmp.c(461): 6197 = ascii word strings read
INFO: ngram_search_fwdtree.c(99): 454 unique initial diphones
INFO: ngram_search_fwdtree.c(147): 0 root, 0 non-root channels, 45 single-
phone words
INFO: ngram_search_fwdtree.c(186): Creating search tree
INFO: ngram_search_fwdtree.c(191): before: 0 root, 0 non-root channels, 45
single-phone words
INFO: ngram_search_fwdtree.c(324): after: max nonroot chan increased to 16431
INFO: ngram_search_fwdtree.c(333): after: 454 root, 16303 non-root channels,
44 single-phone words
INFO: ngram_search_fwdflat.c(153): fwdflat: min_ef_width = 4, max_sf_win = 25
INFO: ngram_search.c(407): Resized backpointer table to 10000 entries
INFO: ngram_search_fwdtree.c(1502): 5150 words recognized (21/fr)
INFO: ngram_search_fwdtree.c(1504): 409046 senones evaluated (1676/fr)
INFO: ngram_search_fwdtree.c(1506): 815160 channels searched (3340/fr), 106830
1st, 172547 last
INFO: ngram_search_fwdtree.c(1510): 16256 words for which last channels
evaluated (66/fr)
INFO: ngram_search_fwdtree.c(1513): 41910 candidate words for entering last
phone (171/fr)
INFO: ngram_search_fwdflat.c(295): Utterance vocabulary contains 275 words
INFO: ngram_search_fwdflat.c(912): 1268 words recognized (5/fr)
INFO: ngram_search_fwdflat.c(914): 128933 senones evaluated (528/fr)
INFO: ngram_search_fwdflat.c(916): 199047 channels searched (815/fr)
INFO: ngram_search_fwdflat.c(918): 16803 words searched (68/fr)
INFO: ngram_search_fwdflat.c(920): 16066 word transitions (65/fr)
INFO: ngram_search.c(1132): lattice start node
.0 end node.225INFO: ps_lattice.c(1228): Normalizer P(O) = alpha(:225:242) = -947361
INFO: ps_lattice.c(1266): Joint P(O,S) = -956367 P(S|O) = -9006
INFO: batch.c(659): mfcc_clean_mini/testData/Bmr031/ct-
chan0_fe008_1507.031-1509.476_Bmr031: 2.43 seconds speech, 1.98 seconds CPU,
1.98 seconds wall
INFO: batch.c(661): mfcc_clean_mini/testData/Bmr031/ct-
chan0_fe008_1507.031-1509.476_Bmr031: 0.81 xRT (CPU), 0.81 xRT (elapsed)
INFO: ngram_search_fwdtree.c(1502): 4992 words recognized (26/fr)
INFO: ngram_search_fwdtree.c(1504): 371653 senones evaluated (1906/fr)
INFO: ngram_search_fwdtree.c(1506): 911974 channels searched (4676/fr), 81871
1st, 94145 last
INFO: ngram_search_fwdtree.c(1510): 11509 words for which last channels
evaluated (59/fr)
INFO: ngram_search_fwdtree.c(1513): 71144 candidate words for entering last
phone (364/fr)
INFO: ngram_search_fwdflat.c(295): Utterance vocabulary contains 131 words
INFO: ngram_search_fwdflat.c(912): 1241 words recognized (6/fr)
INFO: ngram_search_fwdflat.c(914): 91853 senones evaluated (471/fr)
INFO: ngram_search_fwdflat.c(916): 106460 channels searched (545/fr)
INFO: ngram_search_fwdflat.c(918): 10184 words searched (52/fr)
INFO: ngram_search_fwdflat.c(920): 8682 word transitions (44/fr)
INFO: ngram_search.c(1132): lattice start node
.0 end node.176INFO: ps_lattice.c(1228): Normalizer P(O) = alpha(:176:193) = -518841
INFO: ps_lattice.c(1266): Joint P(O,S) = -524764 P(S|O) = -5923
Hi Tao!
Strange, this is the first time I see such result. Few thoughts on that:
1) ERRORs in the log are not critical, they are actually changed to INFO in
trunk
2) Did you train model with SphinxTrain?
3) What is the accuracy in fwdtree mode in pocketsphinx (-fwdflat no). They
must be more or less the same with sphinx3
which I suppose you also run in fwdtree mode
4) Not sure if it makes sense to check pocketsphinx-0.5 to see if there are
any regressions
Thanks very much for all the suggestions!
That's a serious treat for us! We should definitely solve it!
Next set of questions:
1) Which converter did you use? htk2s3.py from our trunk or something home-
made?
2) What is s3 accuracy with very-very wide beams (like 1e-200). I remember
from David's comaprision that HDecode beams comparing to sphinx3 ones are
actually very wide.
1) Using Wout's converter: It is written in python with help of David. I guess
it's similar to htk2s3.py in your trunk. (will take an look at htk2s3.py)
http://home.student.utwente.nl/w.j.maaskant/htk2s3conv/
the model conversion is a complicated problem, we spent weeks on this, still
2% gap.
2) Using very wide beam itself didn't help (I tried beam=1e-300 and 1e-500
before). partial results attached (pbeam always equal to beam).
For regular beam width, I found that wip=0.001 is very optimized value for
this corpus. For a very wide beam, I didn't able to find a magic number yet...
3) Question: did David or anyone else ever achieve same accuracy after
converting HTK model to S3?
4) Nick, I will send personal emails to you for better communications.
lw beam wbeam wip dur acc
11.25 1e-70 1e-35 0.001 857 60.2
lw beam wbeam wip dur acc
11.25 1e-70 1e-40 0.001 927 59.8
lw beam wbeam wip dur acc
11.25 1e-70 1e-50 0.001 1172 59.6
lw beam wbeam wip dur acc
11.25 1e-70 1e-60 0.001 1644 59.5
lw beam wbeam wip dur acc
11.25 1e-70 1e-70 0.001 2340 59.5
lw beam wbeam wip dur acc
11.25 1e-70 1e-80 0.001 2810 59.5
lw beam wbeam wip dur acc
11.25 1e-80 1e-35 0.001 1043 60.3
lw beam wbeam wip dur acc
11.25 1e-80 1e-40 0.001 1167 59.8
lw beam wbeam wip dur acc
11.25 1e-80 1e-50 0.001 1363 59.7
lw beam wbeam wip dur acc
11.25 1e-80 1e-60 0.001 1839 59.5
lw beam wbeam wip dur acc
11.25 1e-80 1e-70 0.001 2650 59.5
lw beam wbeam wip dur acc
11.25 1e-80 1e-80 0.001 3547 59.5
lw beam wbeam wip dur acc
11.25 1e-90 1e-35 0.001 1212 60.4
lw beam wbeam wip dur acc
11.25 1e-90 1e-40 0.001 1278 59.9
lw beam wbeam wip dur acc
11.25 1e-90 1e-50 0.001 1528 59.8
lw beam wbeam wip dur acc
11.25 1e-90 1e-60 0.001 2029 59.7
lw beam wbeam wip dur acc
11.25 1e-90 1e-70 0.001 2860 59.7
lw beam wbeam wip dur acc
11.25 1e-90 1e-80 0.001 3802 59.6
lw beam wbeam wip dur acc
11.25 1e-100 1e-35 0.001 1347 60.4
lw beam wbeam wip dur acc
11.25 1e-100 1e-40 0.001 1423 59.9
lw beam wbeam wip dur acc
11.25 1e-100 1e-50 0.001 1678 59.8
lw beam wbeam wip dur acc
11.25 1e-100 1e-60 0.001 2184 59.7
lw beam wbeam wip dur acc
11.25 1e-100 1e-70 0.001 2991 59.7
lw beam wbeam wip dur acc
11.25 1e-100 1e-80 0.001 3977 59.6
lw beam wbeam wip dur acc
11.25 1e-120 1e-35 0.001 1633 60.4
lw beam wbeam wip dur acc
11.25 1e-120 1e-40 0.001 1703 59.9
lw beam wbeam wip dur acc
11.25 1e-120 1e-50 0.001 1974 59.8
lw beam wbeam wip dur acc
11.25 1e-120 1e-60 0.001 2394 59.7
lw beam wbeam wip dur acc
11.25 1e-120 1e-70 0.001 3206 59.7
Yes, I also suspect something wrong with the converted model. Probably
transition probs or something like that. I found pocketsphinx is very
sensitive to transition probs. We need to closer look on it.
We can also compare scores of the models on the data which is recognized
incorrectly. If you'll provide such utterance and the models I can look
myself.
Unfortunately that text is not available and it's better to ask David directly
but I remember he trained with Sphinxtrain on WSJ and just compared results to
Keith WSJ on HTK.
Yes, please do. Or use cmusphinx-devel. I'm sure you'll get more feedback
there.