Menu

Recognition quality is bad for first utterance

Help
2016-09-14
2016-09-23
  • Daniel Wolf

    Daniel Wolf - 2016-09-14

    When performing word recognition, the first utterance is often detected very poorly. After that, accuracy is great.

    I have a recording of a man saying (very clearly):

    Marley was dead to begin with.
    There is no doubt whatever about that.

    When I run this recording through pocketsphinx_continuous (exact invocation and output below), the result is:

    you and
    there is no doubt whatever about that

    The first sentence is recognized as garbage, the second sentence is recognized perfectly.

    I edited the WAVE file, looping it once. So the WAVE file now contains:

    Marley was dead to begin with.
    There is no doubt whatever about that.
    Marley was dead to begin with.
    There is no doubt whatever about that.

    Now, the output becomes:

    you and
    there is no doubt whatever about that
    marley was dead to begin with
    there is no doubt whatsoever about that

    So the same utterance that was recognized as garbage the first time was recognized perfectly later on!

    My questions are:

    1. Is this a bug or is there a reason for this behavior?
    2. Is there a workaround?

    Details

    Here is the looped WAVE file.

    The command line is: pocketsphinx_continuous.exe -infile marley-looped.wav -hmm cmusphinx-en-us-5.2 -lm ..\..\..\model\en-us\en-us.lm.bin -dict ..\..\..\model\en-us\cmudict-en-us.dict

    Here is the full output:

    INFO: pocketsphinx.c(152): Parsed model-specific feature parameters from cmusphinx-en-us-5.2/feat.params
    Current configuration:
    [NAME]          [DEFLT]     [VALUE]
    -agc            none        none
    -agcthresh      2.0     2.000000e+000
    -allphone               
    -allphone_ci        no      no
    -alpha          0.97        9.700000e-001
    -ascale         20.0        2.000000e+001
    -aw         1       1
    -backtrace      no      no
    -beam           1e-48       1.000000e-048
    -bestpath       yes     yes
    -bestpathlw     9.5     9.500000e+000
    -ceplen         13      13
    -cmn            current     current
    -cmninit        8.0     40,3,-1
    -compallsen     no      no
    -debug                  0
    -dict                   ..\..\..\model\en-us\cmudict-en-us.dict
    -dictcase       no      no
    -dither         no      no
    -doublebw       no      no
    -ds         1       1
    -fdict                  
    -feat           1s_c_d_dd   1s_c_d_dd
    -featparams             
    -fillprob       1e-8        1.000000e-008
    -frate          100     100
    -fsg                    
    -fsgusealtpron      yes     yes
    -fsgusefiller       yes     yes
    -fwdflat        yes     yes
    -fwdflatbeam        1e-64       1.000000e-064
    -fwdflatefwid       4       4
    -fwdflatlw      8.5     8.500000e+000
    -fwdflatsfwin       25      25
    -fwdflatwbeam       7e-29       7.000000e-029
    -fwdtree        yes     yes
    -hmm                    cmusphinx-en-us-5.2
    -input_endian       little      little
    -jsgf                   
    -keyphrase              
    -kws                    
    -kws_delay      10      10
    -kws_plp        1e-1        1.000000e-001
    -kws_threshold      1       1.000000e+000
    -latsize        5000        5000
    -lda                    
    -ldadim         0       0
    -lifter         0       22
    -lm                 ..\..\..\model\en-us\en-us.lm.bin
    -lmctl                  
    -lmname                 
    -logbase        1.0001      1.000100e+000
    -logfn                  
    -logspec        no      no
    -lowerf         133.33334   1.300000e+002
    -lpbeam         1e-40       1.000000e-040
    -lponlybeam     7e-29       7.000000e-029
    -lw         6.5     6.500000e+000
    -maxhmmpf       30000       30000
    -maxwpf         -1      -1
    -mdef                   
    -mean                   
    -mfclogdir              
    -min_endfr      0       0
    -mixw                   
    -mixwfloor      0.0000001   1.000000e-007
    -mllr                   
    -mmap           yes     yes
    -ncep           13      13
    -nfft           512     512
    -nfilt          40      25
    -nwpen          1.0     1.000000e+000
    -pbeam          1e-48       1.000000e-048
    -pip            1.0     1.000000e+000
    -pl_beam        1e-10       1.000000e-010
    -pl_pbeam       1e-10       1.000000e-010
    -pl_pip         1.0     1.000000e+000
    -pl_weight      3.0     3.000000e+000
    -pl_window      5       5
    -rawlogdir              
    -remove_dc      no      no
    -remove_noise       yes     yes
    -remove_silence     yes     yes
    -round_filters      yes     yes
    -samprate       16000       1.600000e+004
    -seed           -1      -1
    -sendump                
    -senlogdir              
    -senmgau                
    -silprob        0.005       5.000000e-003
    -smoothspec     no      no
    -svspec                 
    -tmat                   
    -tmatfloor      0.0001      1.000000e-004
    -topn           4       4
    -topn_beam      0       0
    -toprule                
    -transform      legacy      dct
    -unit_area      yes     yes
    -upperf         6855.4976   6.800000e+003
    -uw         1.0     1.000000e+000
    -vad_postspeech     50      50
    -vad_prespeech      20      20
    -vad_startspeech    10      10
    -vad_threshold      2.0     2.000000e+000
    -var                    
    -varfloor       0.0001      1.000000e-004
    -varnorm        no      no
    -verbose        no      no
    -warp_params                
    -warp_type      inverse_linear  inverse_linear
    -wbeam          7e-29       7.000000e-029
    -wip            0.65        6.500000e-001
    -wlen           0.025625    2.562500e-002
    
    INFO: feat.c(715): Initializing feature stream to type: '1s_c_d_dd', ceplen=13, CMN='current', VARNORM='no', AGC='none'
    INFO: cmn.c(143): mean[0]= 12.00, mean[1..12]= 0.0
    INFO: acmod.c(154): Reading linear feature transformation from cmusphinx-en-us-5.2/feature_transform
    INFO: mdef.c(518): Reading model definition: cmusphinx-en-us-5.2/mdef
    INFO: bin_mdef.c(181): Allocating 142124 * 8 bytes (1110 KiB) for CD tree
    INFO: tmat.c(206): Reading HMM transition probability matrices: cmusphinx-en-us-5.2/transition_matrices
    INFO: acmod.c(117): Attempting to use PTM computation module
    INFO: ms_gauden.c(198): Reading mixture gaussian parameter: cmusphinx-en-us-5.2/means
    INFO: ms_gauden.c(292): 5138 codebook, 1 feature, size: 
    INFO: ms_gauden.c(294):  32x36
    INFO: ms_gauden.c(198): Reading mixture gaussian parameter: cmusphinx-en-us-5.2/variances
    INFO: ms_gauden.c(292): 5138 codebook, 1 feature, size: 
    INFO: ms_gauden.c(294):  32x36
    INFO: ms_gauden.c(354): 813 variance values floored
    INFO: ptm_mgau.c(801): Number of codebooks exceeds 256: 5138
    INFO: acmod.c(119): Attempting to use semi-continuous computation module
    INFO: ms_gauden.c(198): Reading mixture gaussian parameter: cmusphinx-en-us-5.2/means
    INFO: ms_gauden.c(292): 5138 codebook, 1 feature, size: 
    INFO: ms_gauden.c(294):  32x36
    INFO: ms_gauden.c(198): Reading mixture gaussian parameter: cmusphinx-en-us-5.2/variances
    INFO: ms_gauden.c(292): 5138 codebook, 1 feature, size: 
    INFO: ms_gauden.c(294):  32x36
    INFO: ms_gauden.c(354): 813 variance values floored
    INFO: acmod.c(121): Falling back to general multi-stream GMM computation
    INFO: ms_gauden.c(198): Reading mixture gaussian parameter: cmusphinx-en-us-5.2/means
    INFO: ms_gauden.c(292): 5138 codebook, 1 feature, size: 
    INFO: ms_gauden.c(294):  32x36
    INFO: ms_gauden.c(198): Reading mixture gaussian parameter: cmusphinx-en-us-5.2/variances
    INFO: ms_gauden.c(292): 5138 codebook, 1 feature, size: 
    INFO: ms_gauden.c(294):  32x36
    INFO: ms_gauden.c(354): 813 variance values floored
    INFO: ms_senone.c(149): Reading senone mixture weights: cmusphinx-en-us-5.2/mixture_weights
    INFO: ms_senone.c(200): Truncating senone logs3(pdf) values by 10 bits
    INFO: ms_senone.c(207): Not transposing mixture weights in memory
    INFO: ms_senone.c(268): Read mixture weights for 5138 senones: 1 features x 32 codewords
    INFO: ms_senone.c(320): Mapping senones to individual codebooks
    INFO: ms_mgau.c(141): The value of topn: 4
    INFO: phone_loop_search.c(114): State beam -225 Phone exit beam -225 Insertion penalty 0
    INFO: dict.c(320): Allocating 138627 * 20 bytes (2707 KiB) for word entries
    INFO: dict.c(333): Reading main dictionary: ..\..\..\model\en-us\cmudict-en-us.dict
    INFO: dict.c(213): Allocated 1014 KiB for strings, 1677 KiB for phones
    INFO: dict.c(336): 134522 words read
    INFO: dict.c(358): Reading filler dictionary: cmusphinx-en-us-5.2/noisedict
    INFO: dict.c(213): Allocated 0 KiB for strings, 0 KiB for phones
    INFO: dict.c(361): 9 words read
    INFO: dict2pid.c(396): Building PID tables for dictionary
    INFO: dict2pid.c(406): Allocating 46^3 * 2 bytes (190 KiB) for word-initial triphones
    INFO: dict2pid.c(132): Allocated 25576 bytes (24 KiB) for word-final triphones
    INFO: dict2pid.c(196): Allocated 25576 bytes (24 KiB) for single-phone word triphones
    INFO: ngram_model_trie.c(347): Trying to read LM in trie binary format
    INFO: ngram_search_fwdtree.c(99): 790 unique initial diphones
    INFO: ngram_search_fwdtree.c(148): 0 root, 0 non-root channels, 61 single-phone words
    INFO: ngram_search_fwdtree.c(186): Creating search tree
    INFO: ngram_search_fwdtree.c(192): before: 0 root, 0 non-root channels, 61 single-phone words
    INFO: ngram_search_fwdtree.c(326): after: max nonroot chan increased to 152075
    INFO: ngram_search_fwdtree.c(339): after: 722 root, 151947 non-root channels, 57 single-phone words
    INFO: ngram_search_fwdflat.c(157): fwdflat: min_ef_width = 4, max_sf_win = 25
    INFO: continuous.c(307): pocketsphinx_continuous.exe COMPILED ON: Jan 24 2016, AT: 07:35:37
    
    INFO: cmn_prior.c(131): cmn_prior_update: from < 40.00  3.00 -1.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00 >
    INFO: cmn_prior.c(149): cmn_prior_update: to   < 32.03 -24.01 -7.83 -7.19 -4.55 -3.87 -1.97 -1.61 -1.01 -1.77 -1.75 -1.08 -0.03 >
    INFO: ngram_search_fwdtree.c(1553):      555 words recognized (6/fr)
    INFO: ngram_search_fwdtree.c(1555):    90791 senones evaluated (976/fr)
    INFO: ngram_search_fwdtree.c(1559):    85910 channels searched (923/fr), 60390 1st, 813 last
    INFO: ngram_search_fwdtree.c(1562):      813 words for which last channels evaluated (8/fr)
    INFO: ngram_search_fwdtree.c(1564):       61 candidate words for entering last phone (0/fr)
    INFO: ngram_search_fwdtree.c(1567): fwdtree 0.25 CPU 0.268 xRT
    INFO: ngram_search_fwdtree.c(1570): fwdtree 0.26 wall 0.281 xRT
    INFO: ngram_search_fwdflat.c(302): Utterance vocabulary contains 1 words
    INFO: ngram_search_fwdflat.c(948):      507 words recognized (5/fr)
    INFO: ngram_search_fwdflat.c(950):     1875 senones evaluated (20/fr)
    INFO: ngram_search_fwdflat.c(952):      804 channels searched (8/fr)
    INFO: ngram_search_fwdflat.c(954):      804 words searched (8/fr)
    INFO: ngram_search_fwdflat.c(957):       26 word transitions (0/fr)
    INFO: ngram_search_fwdflat.c(960): fwdflat 0.02 CPU 0.017 xRT
    INFO: ngram_search_fwdflat.c(963): fwdflat 0.00 wall 0.004 xRT
    INFO: ngram_search.c(1253): lattice start node <s>.0 end node </s>.88
    INFO: ngram_search.c(1279): Eliminated 0 nodes before end node
    INFO: ngram_search.c(1384): Lattice has 131 nodes, 
    you and
    301 links
    INFO: ps_lattice.c(1380): Bestpath score: -1802
    INFO: ps_lattice.c(1384): Normalizer P(O) = alpha(</s>:88:91) = -106556
    INFO: ps_lattice.c(1441): Joint P(O,S) = -117364 P(S|O) = -10808
    INFO: ngram_search.c(875): bestpath 0.00 CPU 0.000 xRT
    INFO: ngram_search.c(878): bestpath 0.00 wall 0.000 xRT
    INFO: ngram_search_fwdtree.c(952): cand_sf[] increased to 64 entries
    INFO: ngram_search.c(467): Resized score stack to 200000 entries
    INFO: ngram_search.c(459): Resized backpointer table to 10000 entries
    INFO: ngram_search.c(467): Resized score stack to 400000 entries
    INFO: cmn_prior.c(131): cmn_prior_update: from < 32.03 -24.01 -7.83 -7.19 -4.55 -3.87 -1.97 -1.61 -1.01 -1.77 -1.75 -1.08 -0.03 >
    INFO: cmn_prior.c(149): cmn_prior_update: to   < 50.70 -7.39 -8.04  9.92  0.74 -1.07  2.16 -3.96 -3.23  0.15 -4.53 -2.35 -1.18 >
    INFO: ngram_search.c(459): Resized backpointer table to 20000 entries
    INFO: ngram_search_fwdtree.c(1553):    10216 words recognized (33/fr)
    INFO: ngram_search_fwdtree.c(1555):  1088647 senones evaluated (3546/fr)
    INFO: ngram_search_fwdtree.c(1559):  5613145 channels searched (18283/fr), 204819 1st, 258889 last
    INFO: ngram_search_fwdtree.c(1562):    16233 words for which last channels evaluated (52/fr)
    INFO: ngram_search_fwdtree.c(1564):   549369 candidate words for entering last phone (1789/fr)
    INFO: ngram_search_fwdtree.c(1567): fwdtree 4.65 CPU 1.514 xRT
    INFO: ngram_search_fwdtree.c(1570): fwdtree 4.66 wall 1.517 xRT
    INFO: ngram_search_fwdflat.c(302): Utterance vocabulary contains 215 words
    INFO: ngram_search_fwdflat.c(948):     8379 words recognized (27/fr)
    INFO: ngram_search_fwdflat.c(950):   315901 senones evaluated (1029/fr)
    INFO: ngram_search_fwdflat.c(952):   545193 channels searched (1775/fr)
    INFO: ngram_search_fwdflat.c(954):    27711 words searched (90/fr)
    INFO: ngram_search_fwdflat.c(957):    15715 word transitions (51/fr)
    INFO: ngram_search_fwdflat.c(960): fwdflat 0.78 CPU 0.254 xRT
    INFO: ngram_search_fwdflat.c(963): fwdflat 0.78 wall 0.253 xRT
    INFO: ngram_search.c(1253): lattice start node <s>.0 end node </s>.303
    INFO: ngram_search.c(1279): Eliminated 4 nodes before end node
    INFO: ngram_search.c(1384): Lattice has 978 nodes, 12700 links
    INFO: ps_lattice.c(1380): Bestpath score: -11389
    INFO: ps_lattice.c(1384): Normalizer P(O) = alpha(</s>:303:305) = -604627
    INFO: ps_lattice.c(1441): Joint P(O,S) = -654861 P(S|O) = -50234
    INFO: ngram_search.c(875): bestpath 0.05 CPU 0.015 xRT
    INFO: ngram_search.c(878): bestpath 0.05 wall 0.018 xRT
    INFO: cmn_prior.c(131): cmn_prior_update: from < 50.70 -7.39 -8.04  9.92  0.74 -1.07  2.16 -3.96 -3.23  0.15 -4.53 -2.35 -1.18 >
    INFO: cmn_prior.c(149): cmn_prior_update: to   < 53.28 -3.54 -10.48  8.31 -0.25  0.65  3.00 -3.78 -6.03  2.07 -2.03 -3.30 -1.17 >
    INFO: ngram_search_fwdtree.c(1553):     3042 words recognized (11/fr)
    INFO: ngram_search_fwdtree.c(1555):   729197 senones evaluated (2642/fr)
    INFO: ngram_search_fwdtree.c(1559):  1971764 channels searched (7144/fr), 153505 1st, 138939 last
    INFO: ngram_search_fwdtree.c(1562):     8071 words for which last channels evaluated (29/fr)
    INFO: ngram_search_fwdtree.c(1564):    91938 candidate words for entering last phone (333/fr)
    INFO: ngram_search_fwdtree.c(1567): fwdtree 2.29 CPU 0.831 xRT
    INFO: ngram_search_fwdtree.c(1570): fwdtree 2.29 wall 0.829 xRT
    INFO: ngram_search_fwdflat.c(302): Utterance vocabulary contains 167 words
    INFO: ngram_search_fwdflat.c(948):     1953 words recognized (7/fr)
    INFO: ngram_search_fwdflat.c(950):   137953 senones evaluated (500/fr)
    INFO: ngram_search_fwdflat.c(952):   166511 channels searched (603/fr)
    INFO: ngram_search_fwdflat.c(954):    11861 words searched (42/fr)
    INFO: ngram_search_fwdflat.c(957):     8561 word transitions (31/fr)
    INFO: ngram_search_fwdflat.c(960): fwdflat 0.31 CPU 0.113 xRT
    INFO: ngram_search_fwdflat.c(963): fwdflat 0.32 wall 0.116 xRT
    INFO: ngram_search.c(1253): lattice start node <s>.0 end node </s>.232
    INFO: ngram_search.c(1279): Eliminated 3 nodes before end node
    INFO: ngram_search.c(1384): Lattice has 449 nodes, 947 links
    INFO: ps_lattice.c(1380): Bestpath there is no doubt whatever about that
    marley was dead to begin with
    score: -8387
    INFO: ps_lattice.c(1384): Normalizer P(O) = alpha(</s>:232:274) = -611164
    INFO: ps_lattice.c(1441): Joint P(O,S) = -687648 P(S|O) = -76484
    INFO: ngram_search.c(875): bestpath 0.00 CPU 0.000 xRT
    INFO: ngram_search.c(878): bestpath 0.00 wall 0.001 xRT
    INFO: cmn_prior.c(99): cmn_prior_update: from < 53.28 -3.54 -10.48  8.31 -0.25  0.65  3.00 -3.78 -6.03  2.07 -2.03 -3.30 -1.17 >
    INFO: cmn_prior.c(116): cmn_prior_update: to   < 54.28 -2.88 -9.62  9.65 -0.34  1.39  4.18 -2.85 -6.48  1.69 -2.54 -3.53 -0.77 >
    INFO: cmn_prior.c(131): cmn_prior_update: from < 54.28 -2.88 -9.62  9.65 -0.34  1.39  4.18 -2.85 -6.48  1.69 -2.54 -3.53 -0.77 >
    INFO: cmn_prior.c(149): cmn_prior_update: to   < 53.77 -3.48 -9.69 10.57  0.95 -0.06  2.72 -4.49 -4.85  1.55 -3.28 -2.92 -1.39 >
    INFO: ngram_search_fwdtree.c(1553):     4591 words recognized (14/fr)
    INFO: ngram_search_fwdtree.c(1555):   746928 senones evaluated (2341/fr)
    INFO: ngram_search_fwdtree.c(1559):  1725585 channels searched (5409/fr), 156642 1st, 191540 last
    INFO: ngram_search_fwdtree.c(1562):    10403 words for which last channels evaluated (32/fr)
    INFO: ngram_search_fwdtree.c(1564):    67474 candidate words for entering last phone (211/fr)
    INFO: ngram_search_fwdtree.c(1567): fwdtree 2.23 CPU 0.699 xRT
    INFO: ngram_search_fwdtree.c(1570): fwdtree 2.24 wall 0.701 xRT
    INFO: ngram_search_fwdflat.c(302): Utterance vocabulary contains 227 words
    INFO: ngram_search_fwdflat.c(948):     3005 words recognized (9/fr)
    INFO: ngram_search_fwdflat.c(950):   199154 senones evaluated (624/fr)
    INFO: ngram_search_fwdflat.c(952):   258005 channels searched (808/fr)
    INFO: ngram_search_fwdflat.c(954):    17375 words searched (54/fr)
    INFO: ngram_search_fwdflat.c(957):    12108 word transitions (37/fr)
    INFO: ngram_search_fwdflat.c(960): fwdflat 0.45 CPU 0.142 xRT
    INFO: ngram_search_fwdflat.c(963): fwdflat 0.45 wall 0.141 xRT
    INFO: ngram_search.c(1253): lattice start node <s>.0 end node </s>.281
    INFO: ngram_search.c(1279): Eliminated 0 nodes before end node
    INFO: ngram_search.c(1384): Lattice has 658 nodes, 3168 links
    INFO: ps_lattice.c(1380): Bestpath score: -10760
    INFO: ps_lattice.c(1384): Normalizer P(O) = alpha(</s>:281:317) = -717890
    INFO: ps_lattice.c(1441): Joint P(O,S) = -799839 P(S|O) = -81949
    INFO: ngram_search.c(875): bestpath 0.00 CPU 0.000 xRT
    INFO: ngram_search.c(878): bestpath 0.00 wall 0.002 xRT
    INFO: cmn_prior.c(99): cmn_prior_update: from < 53.77 -3.48 -9.69 10.57  0.95 -0.06  2.72 -4.49 -4.85  1.55 -3.28 -2.92 -1.39 >
    INFO: cmn_prior.c(116): cmn_prior_update: to   < 54.77 -2.02 -9.99 11.09  1.25  0.86  3.64 -4.14 -5.56  1.39 -2.35 -3.91 -1.65 >
    INFO: cmn_prior.c(131): cmn_prior_update: from < 54.77 -2.02 -9.99 11.09  1.25  0.86  3.64 -4.14 -5.56  1.39 -2.35 -3.91 -1.65 >
    INFO: cmn_prior.c(149): cmn_prior_update: to   < 55.85 -1.94 -11.23  8.14 -0.51  0.71  2.91 -4.37 -6.67  3.00 -1.53 -3.09 -1.14 >
    INFO: ngram_search_fwdtree.c(1553):     3010 words recognized (11/fr)
    INFO: ngram_search_fwdtree.c(1555):   774655 senones evaluated (2709/fr)
    INFO: ngram_search_fwdtree.c(1559):  2069923 channels searched (7237/fr), 161855 1st, 146208 last
    INFO: ngram_search_fwdtree.c(1562):     8620 words for which last channels evaluated (30/fr)
    INFO: ngram_search_fwdtree.c(1564):   106065 candidate words for entering last phone (370/fr)
    INFO: ngram_search_fwdtree.c(1567): fwdtree 2.37 CPU 0.829 xRT
    INFO: ngram_search_fwdtree.c(1570): fwdtree 2.39 wall 0.835 xRT
    INFO: ngram_search_fwdflat.c(302): Utterance vocabulary contains 159 words
    INFO: ngram_search_fwdflat.c(948):     1957 words recognized (7/fr)
    INFO: ngram_search_fwdflat.c(950):   132091 senones evaluated (462/fr)
    INFO: ngram_search_fwdflat.c(952):   150625 channels searched (526/fr)
    INFO: ngram_search_fwdflat.c(954):    11249 words searched (39/fr)
    INFO: ngram_search_fwdflat.c(957):     8252 word transitions (28/fr)
    INFO: ngram_search_fwdflat.c(960): fwdflat 0.30 CPU 0.104 xRT
    INFO: ngram_search_fwdflat.c(963): fwdflat 0.28 wall 0.099 xRT
    INFO: ngram_search.c(1253): lattice start node <s>.0 end node </s>.231
    INFO: ngram_search.c(1279): Elimithere is no doubt whatsoever about that
    nated 3 nodes before end node
    INFO: ngram_search.c(1384): Lattice has 452 nodes, 848 links
    INFO: ps_lattice.c(1380): Bestpath score: -8264
    INFO: ps_lattice.c(1384): Normalizer P(O) = alpha(</s>:231:284) = -639431
    INFO: ps_lattice.c(1441): Joint P(O,S) = -678850 P(S|O) = -39419
    INFO: ngram_search.c(875): bestpath 0.00 CPU 0.000 xRT
    INFO: ngram_search.c(878): bestpath 0.00 wall 0.001 xRT
    INFO: ngram_search_fwdtree.c(432): TOTAL fwdtree 11.79 CPU 0.924 xRT
    INFO: ngram_search_fwdtree.c(435): TOTAL fwdtree 11.83 wall 0.927 xRT
    INFO: ngram_search_fwdflat.c(176): TOTAL fwdflat 1.86 CPU 0.145 xRT
    INFO: ngram_search_fwdflat.c(179): TOTAL fwdflat 1.83 wall 0.144 xRT
    INFO: ngram_search.c(303): TOTAL bestpath 0.05 CPU 0.004 xRT
    INFO: ngram_search.c(306): TOTAL bestpath 0.06 wall 0.005 xRT
    
     
    • Nickolay V. Shmyrev

      You can see in the log the initial CMN estimation is off, you can set better cmninit value and it will recognize words from start.

       

      Last edit: Nickolay V. Shmyrev 2016-09-14
      • Daniel Wolf

        Daniel Wolf - 2016-09-15

        Thanks, that makes sense!

        I'd like to automatically determine initial CMN values for a given WAVE file. I found another thread, where you recommend this approach:

        no initial estimate -> record full utterance -> normalize only last CMN (current mode) -> decode
        few decoding cycles are done -> have reliable CMN estimate -> normalize CMN (live_mode)

        Is there any existing code I can look at to see how this is done?

         
        • Daniel Wolf

          Daniel Wolf - 2016-09-16

          After a bit more research, my understanding is this:

          • pocketsphinx_continuous continually adapts the CMN values to the input by using a variation of a sliding window approach. This allows for low latency, but can lead to poor results at the start or immediately after the recording characteristics have changed in mid-recording.
          • pocketsphinx_batch, on the other hand, does not use historic cepstral values. Instead, it analyzes each utterance as a whole, determines the actual mean value for this utterance, and subtracts it.

          So I assume that batch mode will always give better results and should be preferred whenever latency is not an issue. Is this correct?

           
          • Nickolay V. Shmyrev

            pocketsphinx_batch, on the other hand, does not use historic cepstral values. Instead, it analyzes each utterance as a whole, determines the actual mean value for this utterance, and subtracts it.

            Sort of. Batch also has issues. For example if half of your audio is loud half is quiet speech (call recording with two speakers mixed for example). The best solution would be short-term normalization which normalizes on a range of 0.1 seconds. This should be a part of new acoustic model research though and pretty complex problem.

             
            • Daniel Wolf

              Daniel Wolf - 2016-09-16

              Thank you, Nickolay; that makes sense. I'm working with dialog recordings for computer games, so there will never be multiple speakers in a single recording. The volume should be pretty stable within an utterance. So I'll look into the code for pocketsphinx_batch.

              Two questions regarding that:

              • I assume that what I described is what pocketsphinx_batch calls 'current' normalization scheme, whereas 'prior' would always use the CMN values from the previous utterance. When does it make sense to use 'prior' mode?
              • pocketsphinx_batch expects a "file listing utterances to be processed". I assume that this file must contain the names of WAVE files plus timecodes of utterances, but I couldn't find any documentation on the exact file format. Could you point me to some documentation or an example file?
               
              • Nickolay V. Shmyrev

                I assume that what I described is what pocketsphinx_batch calls 'current' normalization scheme, whereas 'prior' would always use the CMN values from the previous utterance. When does it make sense to use 'prior' mode?

                There is no 'prior/current' anymore, it is now called 'live' or 'batch'. You can use 'batch' if you don't need continuous processing. But again, it needs testing. For example, if you have many very short utterances like 'yes', batch is not very efficient for them, live is more reliable estimation. On the other hand batch is used for training, so in test it's closer to training. If you have high quality volume normalized audio like in games both methods should be fine, there should be no difference at all.

                pocketsphinx_batch expects a "file listing utterances to be processed". I assume that this file must contain the names of WAVE files plus timecodes of utterances, but I couldn't find any documentation on the exact file format. Could you point me to some documentation or an example file?

                I don't think timecodes are frequenty used. http://cmusphinx.sourceforge.net/wiki/tutorialtuning explains how to run the batch. You can call ps_process_raw from API with final_utt set to TRUE to use batch processing.

                 
                • Daniel Wolf

                  Daniel Wolf - 2016-09-16

                  I understand your argument concerning very short utterances. A similar argument can be made for "utterances" detected by VAD that actually contain only breathing. In batch mode, calculating CMN values based on such an utterance probably won't give ideal results.

                  Given that my recordings are usually very "stable", I had the following idea: I could concatenate the utterances of an entire recording into a single long utterance, then analyze the first 10 seconds or so of this combined utterance to get reliable CMN values for the entire recording. Then I could use these fixed CMN values for all utterances of the recording. That way, anomalies like very short utterances or breath utterances won't affect the CMN values.

                  I'd need to do two things:

                  1. Analyze a number of samples with minimal processing, just to get the CMN values
                  2. Perform word recognition and alignment on an utterance using these fixed CMN values.

                  Is this possible using Pocketsphinx?

                   
                  • Nickolay V. Shmyrev

                    It might be possible with small code modifications.

                     
                    • Daniel Wolf

                      Daniel Wolf - 2016-09-17

                      That's good to hear. I'll look into it, then.

                      One more question before I do: You said: "You can call ps_process_raw from API with final_utt set to TRUE to use batch processing." I don't see a final_utt parameter in ps_process_raw. My understanding was that to use batch CMN mode, I simply specify -cmn = batch when creating the decoder configuration.

                      So what is the correct way to use batch CMN in a program that's similar to pocketsphinx_continuous?

                       
  • Jonas Helm

    Jonas Helm - 2016-09-20

    First, thank you Nickolay for the answers on the other thread and also here.
    I now got the batch mode to work, in my case, like Daniel did, I also had to change the "-cmn" in the feat.params file to "batch". Using set_string in the configuration somehow didn't work.
    The only thing is, that the very first time decoding, the values are still slightly different. After this, decoding the same file over and over (which i did to observe the behaviour of the decoder) delivers the exact same cmn values erverytime, to my satisfaction, because I now have in principal reproducible results.
    I wonder why this is happening?

     
    • Nickolay V. Shmyrev

      I wonder why this is happening?

      You could provide at least the log to give more information about your problem.

       
  • Jonas Helm

    Jonas Helm - 2016-09-20

    Of course;)
    You can see in the first cmn.c there are slightly different values than in the second one, for the same file being decoded.

    INFO: pocketsphinx.c(152): Parsed model-specific feature parameters from ./models/ger2/voxforge.cd_cont_3000/feat.params
    Current configuration:
    [NAME]          [DEFLT]     [VALUE]
    -agc            none        none
    -agcthresh      2.0     2.000000e+00
    -allphone               
    -allphone_ci        no      no
    -alpha          0.97        9.700000e-01
    -ascale         20.0        2.000000e+01
    -aw         1       1
    -backtrace      no      no
    -beam           1e-48       1.000000e-48
    -bestpath       yes     yes
    -bestpathlw     9.5     9.500000e+00
    -ceplen         13      13
    -cmn            live        batch
    -cmninit        40,3,-1     40,3,-1
    -compallsen     no      no
    -debug                  0
    -dict                   ./models/ger2/voxforge.dic
    -dictcase       no      no
    -dither         no      no
    -doublebw       no      no
    -ds         1       1
    -fdict                  
    -feat           1s_c_d_dd   1s_c_d_dd
    -featparams             
    -fillprob       1e-8        1.000000e-08
    -frate          100     100
    -fsg                    
    -fsgusealtpron      yes     yes
    -fsgusefiller       yes     yes
    -fwdflat        yes     yes
    -fwdflatbeam        1e-64       1.000000e-64
    -fwdflatefwid       4       4
    -fwdflatlw      8.5     8.500000e+00
    -fwdflatsfwin       25      25
    -fwdflatwbeam       7e-29       7.000000e-29
    -fwdtree        yes     yes
    -hmm                    ./models/ger2/voxforge.cd_cont_3000
    -input_endian       little      little
    -jsgf                   
    -keyphrase              
    -kws                    
    -kws_delay      10      10
    -kws_plp        1e-1        1.000000e-01
    -kws_threshold      1       1.000000e+00
    -latsize        5000        5000
    -lda                    
    -ldadim         0       0
    -lifter         0       22
    -lm                 ./models/ger2/voxforge.lm.dmp
    -lmctl                  
    -lmname                 
    -logbase        1.0001      1.000100e+00
    -logfn                  
    -logspec        no      no
    -lowerf         133.33334   1.300000e+02
    -lpbeam         1e-40       1.000000e-40
    -lponlybeam     7e-29       7.000000e-29
    -lw         6.5     6.500000e+00
    -maxhmmpf       30000       30000
    -maxwpf         -1      -1
    -mdef                   
    -mean                   
    -mfclogdir              
    -min_endfr      0       0
    -mixw                   
    -mixwfloor      0.0000001   1.000000e-07
    -mllr                   
    -mmap           yes     yes
    -ncep           13      13
    -nfft           512     512
    -nfilt          40      25
    -nwpen          1.0     1.000000e+00
    -pbeam          1e-48       1.000000e-48
    -pip            1.0     1.000000e+00
    -pl_beam        1e-10       1.000000e-10
    -pl_pbeam       1e-10       1.000000e-10
    -pl_pip         1.0     1.000000e+00
    -pl_weight      3.0     3.000000e+00
    -pl_window      5       5
    -rawlogdir              
    -remove_dc      no      no
    -remove_noise       yes     yes
    -remove_silence     yes     yes
    -round_filters      yes     yes
    -samprate       16000       1.600000e+04
    -seed           -1      -1
    -sendump                
    -senlogdir              
    -senmgau                
    -silprob        0.005       5.000000e-03
    -smoothspec     no      no
    -svspec                 
    -tmat                   
    -tmatfloor      0.0001      1.000000e-04
    -topn           4       4
    -topn_beam      0       0
    -toprule                
    -transform      legacy      dct
    -unit_area      yes     yes
    -upperf         6855.4976   6.800000e+03
    -uw         1.0     1.000000e+00
    -vad_postspeech     50      50
    -vad_prespeech      20      20
    -vad_startspeech    10      10
    -vad_threshold      2.0     2.000000e+00
    -var                    
    -varfloor       0.0001      1.000000e-04
    -varnorm        no      no
    -verbose        no      no
    -warp_params                
    -warp_type      inverse_linear  inverse_linear
    -wbeam          7e-29       7.000000e-29
    -wip            0.65        6.500000e-01
    -wlen           0.025625    2.562500e-02
    
    INFO: feat.c(715): Initializing feature stream to type: '1s_c_d_dd', ceplen=13, CMN='batch', VARNORM='no', AGC='none'
    INFO: cmn.c(97): mean[0]= 12.00, mean[1..12]= 0.0
    INFO: acmod.c(156): Reading linear feature transformation from ./models/ger2/voxforge.cd_cont_3000/feature_transform
    INFO: mdef.c(518): Reading model definition: ./models/ger2/voxforge.cd_cont_3000/mdef
    INFO: bin_mdef.c(181): Allocating 82313 * 8 bytes (643 KiB) for CD tree
    INFO: tmat.c(206): Reading HMM transition probability matrices: ./models/ger2/voxforge.cd_cont_3000/transition_matrices
    INFO: acmod.c(117): Attempting to use PTM computation module
    INFO: ms_gauden.c(127): Reading mixture gaussian parameter: ./models/ger2/voxforge.cd_cont_3000/means
    INFO: ms_gauden.c(242): 3198 codebook, 1 feature, size: 
    INFO: ms_gauden.c(244):  16x29
    INFO: ms_gauden.c(127): Reading mixture gaussian parameter: ./models/ger2/voxforge.cd_cont_3000/variances
    INFO: ms_gauden.c(242): 3198 codebook, 1 feature, size: 
    INFO: ms_gauden.c(244):  16x29
    INFO: ms_gauden.c(304): 3398 variance values floored
    INFO: ptm_mgau.c(804): Number of codebooks exceeds 256: 3198
    INFO: acmod.c(119): Attempting to use semi-continuous computation module
    INFO: ms_gauden.c(127): Reading mixture gaussian parameter: ./models/ger2/voxforge.cd_cont_3000/means
    INFO: ms_gauden.c(242): 3198 codebook, 1 feature, size: 
    INFO: ms_gauden.c(244):  16x29
    INFO: ms_gauden.c(127): Reading mixture gaussian parameter: ./models/ger2/voxforge.cd_cont_3000/variances
    INFO: ms_gauden.c(242): 3198 codebook, 1 feature, size: 
    INFO: ms_gauden.c(244):  16x29
    INFO: ms_gauden.c(304): 3398 variance values floored
    INFO: acmod.c(121): Falling back to general multi-stream GMM computation
    INFO: ms_gauden.c(127): Reading mixture gaussian parameter: ./models/ger2/voxforge.cd_cont_3000/means
    INFO: ms_gauden.c(242): 3198 codebook, 1 feature, size: 
    INFO: ms_gauden.c(244):  16x29
    INFO: ms_gauden.c(127): Reading mixture gaussian parameter: ./models/ger2/voxforge.cd_cont_3000/variances
    INFO: ms_gauden.c(242): 3198 codebook, 1 feature, size: 
    INFO: ms_gauden.c(244):  16x29
    INFO: ms_gauden.c(304): 3398 variance values floored
    INFO: ms_senone.c(149): Reading senone mixture weights: ./models/ger2/voxforge.cd_cont_3000/mixture_weights
    INFO: ms_senone.c(200): Truncating senone logs3(pdf) values by 10 bits
    INFO: ms_senone.c(207): Not transposing mixture weights in memory
    INFO: ms_senone.c(268): Read mixture weights for 3198 senones: 1 features x 16 codewords
    INFO: ms_senone.c(320): Mapping senones to individual codebooks
    INFO: ms_mgau.c(144): The value of topn: 4
    INFO: phone_loop_search.c(114): State beam -225 Phone exit beam -225 Insertion penalty 0
    INFO: dict.c(320): Allocating 31724 * 20 bytes (619 KiB) for word entries
    INFO: dict.c(333): Reading main dictionary: ./models/ger2/voxforge.dic
    INFO: dict.c(213): Dictionary size 27625, allocated 261 KiB for strings, 453 KiB for phones
    INFO: dict.c(336): 27625 words read
    INFO: dict.c(358): Reading filler dictionary: ./models/ger2/voxforge.cd_cont_3000/noisedict
    INFO: dict.c(213): Dictionary size 27628, allocated 0 KiB for strings, 0 KiB for phones
    INFO: dict.c(361): 3 words read
    INFO: dict2pid.c(396): Building PID tables for dictionary
    INFO: dict2pid.c(406): Allocating 66^3 * 2 bytes (561 KiB) for word-initial triphones
    INFO: dict2pid.c(132): Allocated 52536 bytes (51 KiB) for word-final triphones
    INFO: dict2pid.c(196): Allocated 52536 bytes (51 KiB) for single-phone word triphones
    INFO: ngram_model_trie.c(354): Trying to read LM in trie binary format
    INFO: ngram_search_fwdtree.c(74): Initializing search tree
    INFO: ngram_search_fwdtree.c(101): 605 unique initial diphones
    INFO: ngram_search_fwdtree.c(186): Creating search channels
    INFO: ngram_search_fwdtree.c(323): Max nonroot chan increased to 88448
    INFO: ngram_search_fwdtree.c(333): Created 605 root, 88320 non-root channels, 3 single-phone words
    INFO: ngram_search_fwdflat.c(157): fwdflat: min_ef_width = 4, max_sf_win = 25
    INFO: cmn.c(137): CMN: 56.81 10.49 -15.97 -1.20 -11.23  0.55 -2.62 -10.79  5.88 -2.95  0.00  1.65  1.44 
    INFO: ngram_search.c(467): Resized score stack to 200000 entries
    INFO: ngram_search.c(459): Resized backpointer table to 10000 entries
    INFO: ngram_search.c(467): Resized score stack to 400000 entries
    INFO: ngram_search.c(459): Resized backpointer table to 20000 entries
    INFO: ngram_search.c(467): Resized score stack to 800000 entries
    INFO: ngram_search_fwdtree.c(1550):    19310 words recognized (37/fr)
    INFO: ngram_search_fwdtree.c(1552):   866215 senones evaluated (1666/fr)
    INFO: ngram_search_fwdtree.c(1556):  2799399 channels searched (5383/fr), 229667 1st, 344393 last
    INFO: ngram_search_fwdtree.c(1559):    26152 words for which last channels evaluated (50/fr)
    INFO: ngram_search_fwdtree.c(1561):   114699 candidate words for entering last phone (220/fr)
    INFO: ngram_search_fwdtree.c(1564): fwdtree 2.04 CPU 0.393 xRT
    INFO: ngram_search_fwdtree.c(1567): fwdtree 2.04 wall 0.393 xRT
    INFO: ngram_search_fwdflat.c(302): Utterance vocabulary contains 641 words
    INFO: ngram_search_fwdflat.c(948):    18418 words recognized (35/fr)
    INFO: ngram_search_fwdflat.c(950):   455759 senones evaluated (876/fr)
    INFO: ngram_search_fwdflat.c(952):  1041342 channels searched (2002/fr)
    INFO: ngram_search_fwdflat.c(954):    80559 words searched (154/fr)
    INFO: ngram_search_fwdflat.c(957):    41472 word transitions (79/fr)
    INFO: ngram_search_fwdflat.c(960): fwdflat 0.95 CPU 0.183 xRT
    INFO: ngram_search_fwdflat.c(963): fwdflat 0.95 wall 0.183 xRT
    INFO: ngram_search.c(1250): lattice start node <s>.0 end node </s>.477
    INFO: ngram_search.c(1276): Eliminated 0 nodes before end node
    INFO: ngram_search.c(1381): Lattice has 1000 nodes, 11698 links
    INFO: ps_lattice.c(1380): Bestpath score: -26229
    INFO: ps_lattice.c(1384): Normalizer P(O) = alpha(</s>:477:518) = -1620915
    INFO: ps_lattice.c(1441): Joint P(O,S) = -1879425 P(S|O) = -258510
    INFO: ngram_search.c(1027): bestpath 0.06 CPU 0.012 xRT
    INFO: ngram_search.c(1030): bestpath 0.07 wall 0.014 xRT
    INFO: cmn.c(137): CMN: 56.79  9.97 -15.63 -0.86 -11.32  0.70 -2.79 -10.76  6.16 -2.98 -0.08  1.79  1.28 
    INFO: ngram_search_fwdtree.c(1550):    19070 words recognized (37/fr)
    INFO: ngram_search_fwdtree.c(1552):   849370 senones evaluated (1662/fr)
    INFO: ngram_search_fwdtree.c(1556):  2863305 channels searched (5603/fr), 220538 1st, 343566 last
    INFO: ngram_search_fwdtree.c(1559):    25787 words for which last channels evaluated (50/fr)
    INFO: ngram_search_fwdtree.c(1561):   121067 candidate words for entering last phone (236/fr)
    INFO: ngram_search_fwdtree.c(1564): fwdtree 2.03 CPU 0.397 xRT
    INFO: ngram_search_fwdtree.c(1567): fwdtree 2.03 wall 0.397 xRT
    INFO: ngram_search_fwdflat.c(302): Utterance vocabulary contains 655 words
    INFO: ngram_search_fwdflat.c(948):    18699 words recognized (37/fr)
    INFO: ngram_search_fwdflat.c(950):   463114 senones evaluated (906/fr)
    INFO: ngram_search_fwdflat.c(952):  1066650 channels searched (2087/fr)
    INFO: ngram_search_fwdflat.c(954):    81393 words searched (159/fr)
    INFO: ngram_search_fwdflat.c(957):    41209 word transitions (80/fr)
    INFO: ngram_search_fwdflat.c(960): fwdflat 1.01 CPU 0.198 xRT
    INFO: ngram_search_fwdflat.c(963): fwdflat 1.02 wall 0.200 xRT
    INFO: ngram_search.c(1250): lattice start node <s>.0 end node </s>.468
    INFO: ngram_search.c(1276): Eliminated 0 nodes before end node
    INFO: ngram_search.c(1381): Lattice has 1021 nodes, 12080 links
    INFO: ps_lattice.c(1380): Bestpath score: -25911
    INFO: ps_lattice.c(1384): Normalizer P(O) = alpha(</s>:468:509) = -1618457
    INFO: ps_lattice.c(1441): Joint P(O,S) = -1843456 P(S|O) = -224999
    INFO: ngram_search.c(1027): bestpath 0.08 CPU 0.015 xRT
    INFO: ngram_search.c(1030): bestpath 0.08 wall 0.015 xRT
    
     
    • Daniel Wolf

      Daniel Wolf - 2016-09-20

      You mention setting -cmn = batch. Did you also change the call to ps_process_raw so that you pass an entire utterance at once and also pass true for final_utt?

      I didn't check the log details, but doing so vastly increased the recognition quality for me.

       
  • Jonas Helm

    Jonas Helm - 2016-09-21

    Yes I did that. It is also very important for me, because if I don't set "full_utt=True" in the call, the batch mode is changed into live mode for this decoding process. (You could see in the log cmn_live.c instead of cmn.c updating)
    Here's just an extract of the log, just the CMN values and the corresponding WER (its a very bad quality audio and even some words are missing in the dict, so the WER in this case is not representative, but it's good to see the difference between first and second decoding process):

    INFO: cmn.c(137): CMN: 56.81 10.49 -15.97 -1.20 -11.23  0.55 -2.62 -10.79  5.88 -2.95  0.00  1.65  1.44 
    deu#unidir_mic_a3n3.wav                            WER    92.86
    
    INFO: cmn.c(137): CMN: 56.79  9.97 -15.63 -0.86 -11.32  0.70 -2.79 -10.76  6.16 -2.98 -0.08  1.79  1.28 
    deu#unidir_mic_a3n3.wav                            WER    78.57
    

    If I do it over again, the second CMN values and the WER are staying constant.
    At the moment I decode every file twice and only the second time I fetch the recognized words, that makes my program produce reproducible results.
    But I'm just wondering how the same file would result in different CMN values in batch mode, at least the first time calculating. The full log is in my previous post.

     
    • Nickolay V. Shmyrev

      There is also noise and silence removal which need some time to adapt. You can disable them with

       config.set_boolean('-remove_noise', False)
       config.set_boolean('-remove_silence', False)
      

      if your audio has no noise and silence is already stipped. With disabled noise and silence removal results must be identical.

       
  • Jonas Helm

    Jonas Helm - 2016-09-23

    Thanks, that's interesting.
    Does this "-remove_noise" only effect the non speech parts, equivalent to the silence removal i guess, or is it like a general processing step of noise reduction for the whole file?

     
    • Nickolay V. Shmyrev

      remove_noise works on the whole file, including speech parts.

       

Log in to post a comment.