I'm using a very small sized dataset (5 words) to train an Mandarin acoustic model on windows10. The .align file shows a 100% accurate (file attached). However, when choosing a .wav file from training data set and feeding it to recognizer, output is an empty string.
Hi Fang,
I'm not an expert and haven't used this for a while, but, two possible things did spring to mind on seeing your question: 1) I'd try setting backtrace to yes [-backtrace yes] as I think (could be wrong) with it set to "no" as is currently shown in the configuration settings, then it won't output anything on the command line. You could then at least see if you're getting any output at the command line. 2) I remember having a lot of issues with the cases between by dictionary and the LM file if they're not the same case you won't get any output. Hopoefully, one of those is useful.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Thank you Paul. Unfortunately, both are not working.
1) [-backtrace yes], I issued the command: pocketsphinx_continuous -infile .\wav\goforward.raw -hmm .\model_parameters\en-us\ -lm .\etc\en-us.lm.bin -dict .\etc\cmudict-en-us.dict and got the output: go forward ten meters. In its configuration backtrace is set to no as well. It is not the cause. But it reminds me to compare their configures, will do it later.
2) I also opened the lm file and dic file, seems normal. Besides, the two files are also used for training and testing using sphinxtrain. The training success shows that it is less likely has such inconsistence issue.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Realized that it is an issue of too short audio, and test in training uses pocketsphinx_batch which handles short audio better than pocketsphinx_continous.
See related post here.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I'm using a very small sized dataset (5 words) to train an Mandarin acoustic model on windows10. The .align file shows a 100% accurate (file attached). However, when choosing a .wav file from training data set and feeding it to recognizer, output is an empty string.
Appreciate your help.
The issued command:
pocketsphinx_continuous -infile .\wav\speaker_1\1_03.wav -hmm .\model_parameters\demo.ci_cont\ -lm .\etc\demo.lm -dict .\etc\demo.dic
output:
INFO: pocketsphinx.c(152): Parsed model-specific feature parameters from .\model_parameters\demo.ci_cont\/feat.params
Current configuration:
[NAME] [DEFLT] [VALUE]
-agc none none
-agcthresh 2.0 2.000000e+000
-allphone
-allphone_ci no no
-alpha 0.97 9.700000e-001
-ascale 20.0 2.000000e+001
-aw 1 1
-backtrace no no
-beam 1e-48 1.000000e-048
-bestpath yes yes
-bestpathlw 9.5 9.500000e+000
-ceplen 13 13
-cmn current current
-cmninit 8.0 8.0
-compallsen no no
-debug 0
-dict .\etc\demo.dic
-dictcase no no
-dither no no
-doublebw no no
-ds 1 1
-fdict
-feat 1s_c_d_dd 1s_c_d_dd
-featparams
-fillprob 1e-8 1.000000e-008
-frate 100 100
-fsg
-fsgusealtpron yes yes
-fsgusefiller yes yes
-fwdflat yes yes
-fwdflatbeam 1e-64 1.000000e-064
-fwdflatefwid 4 4
-fwdflatlw 8.5 8.500000e+000
-fwdflatsfwin 25 25
-fwdflatwbeam 7e-29 7.000000e-029
-fwdtree yes yes
-hmm .\model_parameters\demo.ci_cont\
-input_endian little little
-jsgf
-keyphrase
-kws
-kws_delay 10 10
-kws_plp 1e-1 1.000000e-001
-kws_threshold 1 1.000000e+000
-latsize 5000 5000
-lda
-ldadim 0 0
-lifter 0 22
-lm .\etc\demo.lm
-lmctl
-lmname
-logbase 1.0001 1.000100e+000
-logfn
-logspec no no
-lowerf 133.33334 1.300000e+002
-lpbeam 1e-40 1.000000e-040
-lponlybeam 7e-29 7.000000e-029
-lw 6.5 6.500000e+000
-maxhmmpf 30000 30000
-maxwpf -1 -1
-mdef
-mean
-mfclogdir
-min_endfr 0 0
-mixw
-mixwfloor 0.0000001 1.000000e-007
-mllr
-mmap yes yes
-ncep 13 13
-nfft 512 512
-nfilt 40 25
-nwpen 1.0 1.000000e+000
-pbeam 1e-48 1.000000e-048
-pip 1.0 1.000000e+000
-pl_beam 1e-10 1.000000e-010
-pl_pbeam 1e-10 1.000000e-010
-pl_pip 1.0 1.000000e+000
-pl_weight 3.0 3.000000e+000
-pl_window 5 5
-rawlogdir
-remove_dc no no
-remove_noise yes yes
-remove_silence yes yes
-round_filters yes yes
-samprate 16000 1.600000e+004
-seed -1 -1
-sendump
-senlogdir
-senmgau
-silprob 0.005 5.000000e-003
-smoothspec no no
-svspec
-tmat
-tmatfloor 0.0001 1.000000e-004
-topn 4 4
-topn_beam 0 0
-toprule
-transform legacy dct
-unit_area yes yes
-upperf 6855.4976 6.800000e+003
-uw 1.0 1.000000e+000
-vad_postspeech 50 50
-vad_prespeech 20 20
-vad_startspeech 10 10
-vad_threshold 2.0 2.000000e+000
-var
-varfloor 0.0001 1.000000e-004
-varnorm no no
-verbose no no
-warp_params
-warp_type inverse_linear inverse_linear
-wbeam 7e-29 7.000000e-029
-wip 0.65 6.500000e-001
-wlen 0.025625 2.562500e-002
INFO: feat.c(715): Initializing feature stream to type: '1s_c_d_dd', ceplen=13, CMN='current', VARNORM='no', AGC='none'
INFO: cmn.c(143): mean[0]= 12.00, mean[1..12]= 0.0
INFO: mdef.c(518): Reading model definition: .\model_parameters\demo.ci_cont\/mdef
INFO: bin_mdef.c(181): Allocating 68 * 8 bytes (0 KiB) for CD tree
INFO: tmat.c(206): Reading HMM transition probability matrices: .\model_parameters\demo.ci_cont\/transition_matrices
INFO: acmod.c(117): Attempting to use PTM computation module
INFO: ms_gauden.c(198): Reading mixture gaussian parameter: .\model_parameters\demo.ci_cont\/means
INFO: ms_gauden.c(292): 48 codebook, 1 feature, size:
INFO: ms_gauden.c(294): 1x39
INFO: ms_gauden.c(198): Reading mixture gaussian parameter: .\model_parameters\demo.ci_cont\/variances
INFO: ms_gauden.c(292): 48 codebook, 1 feature, size:
INFO: ms_gauden.c(294): 1x39
INFO: ms_gauden.c(354): 117 variance values floored
INFO: ptm_mgau.c(805): Number of codebooks doesn't match number of ciphones, doesn't look like PTM: 48 != 16
INFO: acmod.c(119): Attempting to use semi-continuous computation module
INFO: ms_gauden.c(198): Reading mixture gaussian parameter: .\model_parameters\demo.ci_cont\/means
INFO: ms_gauden.c(292): 48 codebook, 1 feature, size:
INFO: ms_gauden.c(294): 1x39
INFO: ms_gauden.c(198): Reading mixture gaussian parameter: .\model_parameters\demo.ci_cont\/variances
INFO: ms_gauden.c(292): 48 codebook, 1 feature, size:
INFO: ms_gauden.c(294): 1x39
INFO: ms_gauden.c(354): 117 variance values floored
INFO: acmod.c(121): Falling back to general multi-stream GMM computation
INFO: ms_gauden.c(198): Reading mixture gaussian parameter: .\model_parameters\demo.ci_cont\/means
INFO: ms_gauden.c(292): 48 codebook, 1 feature, size:
INFO: ms_gauden.c(294): 1x39
INFO: ms_gauden.c(198): Reading mixture gaussian parameter: .\model_parameters\demo.ci_cont\/variances
INFO: ms_gauden.c(292): 48 codebook, 1 feature, size:
INFO: ms_gauden.c(294): 1x39
INFO: ms_gauden.c(354): 117 variance values floored
INFO: ms_senone.c(149): Reading senone mixture weights: .\model_parameters\demo.ci_cont\/mixture_weights
INFO: ms_senone.c(200): Truncating senone logs3(pdf) values by 10 bits
INFO: ms_senone.c(207): Not transposing mixture weights in memory
INFO: ms_senone.c(268): Read mixture weights for 48 senones: 1 features x 1 codewords
INFO: ms_senone.c(320): Mapping senones to individual codebooks
INFO: ms_mgau.c(141): The value of topn: 4
WARN: "ms_mgau.c", line 145: -topn argument (4) invalid or > #density codewords (1); set to latter
INFO: phone_loop_search.c(114): State beam -225 Phone exit beam -225 Insertion penalty 0
INFO: dict.c(320): Allocating 4104 * 20 bytes (80 KiB) for word entries
INFO: dict.c(333): Reading main dictionary: .\etc\demo.dic
INFO: dict.c(213): Allocated 0 KiB for strings, 0 KiB for phones
INFO: dict.c(336): 5 words read
INFO: dict.c(358): Reading filler dictionary: .\model_parameters\demo.ci_cont\/noisedict
INFO: dict.c(213): Allocated 0 KiB for strings, 0 KiB for phones
INFO: dict.c(361): 3 words read
INFO: dict2pid.c(396): Building PID tables for dictionary
INFO: dict2pid.c(406): Allocating 16^3 * 2 bytes (8 KiB) for word-initial triphones
INFO: dict2pid.c(132): Allocated 3136 bytes (3 KiB) for word-final triphones
INFO: dict2pid.c(196): Allocated 3136 bytes (3 KiB) for single-phone word triphones
INFO: ngram_model_trie.c(347): Trying to read LM in trie binary format
INFO: ngram_model_trie.c(358): Header doesn't match
INFO: ngram_model_trie.c(176): Trying to read LM in arpa format
INFO: ngram_model_trie.c(192): LM of order 3
INFO: ngram_model_trie.c(194): #1-grams: 8
INFO: ngram_model_trie.c(194): #2-grams: 10
INFO: ngram_model_trie.c(194): #3-grams: 13
INFO: lm_trie.c(473): Training quantizer
INFO: lm_trie.c(481): Building LM trie
INFO: ngram_search_fwdtree.c(99): 5 unique initial diphones
INFO: ngram_search_fwdtree.c(148): 0 root, 0 non-root channels, 4 single-phone words
INFO: ngram_search_fwdtree.c(186): Creating search tree
INFO: ngram_search_fwdtree.c(192): before: 0 root, 0 non-root channels, 4 single-phone words
INFO: ngram_search_fwdtree.c(326): after: max nonroot chan increased to 138
INFO: ngram_search_fwdtree.c(339): after: 5 root, 10 non-root channels, 3 single-phone words
INFO: ngram_search_fwdflat.c(157): fwdflat: min_ef_width = 4, max_sf_win = 25
INFO: continuous.c(307): pocketsphinx_continuous COMPILED ON: Jan 24 2016, AT: 07:35:37
INFO: cmn_prior.c(131): cmn_prior_update: from < 8.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 >
INFO: cmn_prior.c(149): cmn_prior_update: to < 12.78 10.78 -7.19 -7.00 -1.36 1.78 5.53 5.37 1.63 2.67 -2.00 -3.08 -0.49 >
INFO: ngram_search_fwdtree.c(1553): 255 words recognized (2/fr)
INFO: ngram_search_fwdtree.c(1555): 852 senones evaluated (6/fr)
INFO: ngram_search_fwdtree.c(1559): 435 channels searched (3/fr), 153 1st, 278 last
INFO: ngram_search_fwdtree.c(1562): 278 words for which last channels evaluated (2/fr)
INFO: ngram_search_fwdtree.c(1564): 0 candidate words for entering last phone (0/fr)
INFO: ngram_search_fwdtree.c(1567): fwdtree 0.00 CPU 0.000 xRT
INFO: ngram_search_fwdtree.c(1570): fwdtree 0.01 wall 0.011 xRT
INFO: ngram_search_fwdflat.c(302): Utterance vocabulary contains 2 words
INFO: ngram_search_fwdflat.c(948): 360 words recognized (3/fr)
INFO: ngram_search_fwdflat.c(950): 393 senones evaluated (3/fr)
INFO: ngram_search_fwdflat.c(952): 381 channels searched (2/fr)
INFO: ngram_search_fwdflat.c(954): 381 words searched (2/fr)
INFO: ngram_search_fwdflat.c(957): 76 word transitions (0/fr)
INFO: ngram_search_fwdflat.c(960): fwdflat 0.00 CPU 0.000 xRT
INFO: ngram_search_fwdflat.c(963): fwdflat 0.00 wall 0.000 xRT
INFO: ngram_search.c(1253): lattice start node
.0 end node.24INFO: ngram_search.c(1279): Eliminated 0 nodes before end node
INFO: ngram_search.c(1384): Lattice has 7 nodes, 4 links
INFO: ps_lattice.c(1380): Bestpath score: -1235
INFO: ps_lattice.c(1384): Normalizer P(O) = alpha(:24:130) = -96950
INFO: ps_lattice.c(1441): Joint P(O,S) = -96950 P(S|O) = 0
INFO: ngram_search.c(875): bestpath 0.00 CPU 0.000 xRT
INFO: ngram_search.c(878): bestpath 0.01 wall 0.005 xRT
INFO: cmn_prior.c(131): cmn_prior_update: from < 12.78 10.78 -7.19 -7.00 -1.36 1.78 5.53 5.37 1.63 2.67 -2.00 -3.08 -0.49 >
INFO: cmn_prior.c(149): cmn_prior_update: to < 29.57 8.61 3.01 10.81 0.21 0.81 5.55 6.60 4.90 7.37 -3.54 3.63 -0.64 >
INFO: ngram_search_fwdtree.c(1553): 154 words recognized (1/fr)
INFO: ngram_search_fwdtree.c(1555): 423 senones evaluated (3/fr)
INFO: ngram_search_fwdtree.c(1559): 328 channels searched (2/fr), 0 1st, 328 last
INFO: ngram_search_fwdtree.c(1562): 328 words for which last channels evaluated (2/fr)
INFO: ngram_search_fwdtree.c(1564): 0 candidate words for entering last phone (0/fr)
INFO: ngram_search_fwdtree.c(1567): fwdtree 0.00 CPU 0.000 xRT
INFO: ngram_search_fwdtree.c(1570): fwdtree 0.01 wall 0.006 xRT
INFO: ngram_search_fwdflat.c(302): Utterance vocabulary contains 2 words
INFO: ngram_search_fwdflat.c(948): 157 words recognized (1/fr)
INFO: ngram_search_fwdflat.c(950): 423 senones evaluated (3/fr)
INFO: ngram_search_fwdflat.c(952): 373 channels searched (2/fr)
INFO: ngram_search_fwdflat.c(954): 373 words searched (2/fr)
INFO: ngram_search_fwdflat.c(957): 76 word transitions (0/fr)
INFO: ngram_search_fwdflat.c(960): fwdflat 0.00 CPU 0.000 xRT
INFO: ngram_search_fwdflat.c(963): fwdflat 0.00 wall 0.001 xRT
INFO: ngram_search.c(1253): lattice start node
.0 end node.60INFO: ngram_search.c(1279): Eliminated 0 nodes before end node
INFO: ngram_search.c(1384): Lattice has 6 nodes, 5 links
INFO: ps_lattice.c(1380): Bestpath score: -649
INFO: ps_lattice.c(1384): Normalizer P(O) = alpha(:60:140) = -58852
INFO: ps_lattice.c(1441): Joint P(O,S) = -61367 P(S|O) = -2515
INFO: ngram_search.c(875): bestpath 0.00 CPU 0.000 xRT
INFO: ngram_search.c(878): bestpath 0.01 wall 0.004 xRT
INFO: cmn_prior.c(131): cmn_prior_update: from < 29.57 8.61 3.01 10.81 0.21 0.81 5.55 6.60 4.90 7.37 -3.54 3.63 -0.64 >
INFO: cmn_prior.c(149): cmn_prior_update: to < 29.57 8.61 3.01 10.81 0.21 0.81 5.55 6.60 4.90 7.37 -3.54 3.63 -0.64 >
INFO: ngram_search_fwdflat.c(302): Utterance vocabulary contains 0 words
INFO: ngram_search_fwdtree.c(432): TOTAL fwdtree 0.02 CPU 0.006 xRT
INFO: ngram_search_fwdtree.c(435): TOTAL fwdtree 0.04 wall 0.014 xRT
INFO: ngram_search_fwdflat.c(176): TOTAL fwdflat 0.00 CPU 0.000 xRT
INFO: ngram_search_fwdflat.c(179): TOTAL fwdflat 0.00 wall 0.000 xRT
INFO: ngram_search.c(303): TOTAL bestpath 0.00 CPU 0.000 xRT
INFO: ngram_search.c(306): TOTAL bestpath 0.01 wall 0.004 xRT
Hi Fang,
I'm not an expert and haven't used this for a while, but, two possible things did spring to mind on seeing your question: 1) I'd try setting backtrace to yes [-backtrace yes] as I think (could be wrong) with it set to "no" as is currently shown in the configuration settings, then it won't output anything on the command line. You could then at least see if you're getting any output at the command line. 2) I remember having a lot of issues with the cases between by dictionary and the LM file if they're not the same case you won't get any output. Hopoefully, one of those is useful.
Thank you Paul. Unfortunately, both are not working.
1) [-backtrace yes], I issued the command: pocketsphinx_continuous -infile .\wav\goforward.raw -hmm .\model_parameters\en-us\ -lm .\etc\en-us.lm.bin -dict .\etc\cmudict-en-us.dict and got the output: go forward ten meters. In its configuration backtrace is set to no as well. It is not the cause. But it reminds me to compare their configures, will do it later.
2) I also opened the lm file and dic file, seems normal. Besides, the two files are also used for training and testing using sphinxtrain. The training success shows that it is less likely has such inconsistence issue.
Realized that it is an issue of too short audio, and test in training uses pocketsphinx_batch which handles short audio better than pocketsphinx_continous.
See related post here.