Menu

Sphinx3 - My experience

Help
2004-07-07
2012-09-22
  • Michael Horbovetz

    i created a new set of LM files with just the numbers in them (i.e. one, two, three, etc.).  the recognition for this engine is horrible.  i tried with the default LM that sphinx3 comes with but it was just as bad.

    anyone else have a different experience?

    here is the log from running sphinx3 with my customized LM.

    D:\speech\sphinx3\win32\batch>.\sphinx3-numbers

    D:\speech\sphinx3\win32\batch>echo off
    " "
    "sphinx3-simple:"
    "  Demo CMU Sphinx-3 decoder called with command line arguments."
    " "
    "<executing $S3CONTINUOUS, please wait>"
    INFO: d:\speech\sphinx3\src\libutil\cmd_ln.c(276): Parsing command line: \
            -mdef ./model/hmm/hub4_cd_continuous_8gau_1s_c_d_dd/hub4opensrc.6000.mdef \
            -fdict ./model/lm/numbers/filler.dict \
            -dict ./model/lm/numbers/numbers.dic \
            -mean ./model/hmm/hub4_cd_continuous_8gau_1s_c_d_dd/means \
            -var ./model/hmm/hub4_cd_continuous_8gau_1s_c_d_dd/variances \
            -mixw ./model/hmm/hub4_cd_continuous_8gau_1s_c_d_dd/mixture_weights \
            -tmat ./model/hmm/hub4_cd_continuous_8gau_1s_c_d_dd/transition_matrices\
            -upperf 6855.49756 \
            -lowerf 133.33334 \
            -nfilt 40 \
            -feat 1s_c_d_dd \
            -nfft 512 \
            -wlen 0.025625 \
            -samprate 16000 \
            -agc none \
            -varnorm no \
            -cmn current \
            -subvqbeam 1e-02 \
            -epl 4 \
            -fillprob 0.02 \
            -lw 9.5 \
            -maxwpf 1 \
            -beam 1e-40 \
            -pbeam 1e-30 \
            -wbeam 1e-20 \
            -maxhmmpf 1500 \
            -wend_beam 1e-1 \
            -ci_pbeam 1e-3 \
            -ds 2 \
            -lm ./model/lm/numbers/numbers.lm.DMP

    Configuration in effect:
    [NAME]          [DEFLT]         [VALUE]
    -agc            max             none
    -alpha          0.97            9.700000e-001
    -beam           1.0e-55         1.000000e-040
    -bghist         0               0
    -bptbldir
    -cepdir
    -ci_pbeam       1e-80           1.000000e-003
    -cmn            current         current
    -cond_ds        0               0
    -ctl
    -ctlcount       1000000000      1000000000
    -ctloffset      0               0
    -ctl_lm
    -dict                           ./model/lm/numbers/numbers.dic
    -ds             1               2
    -epl            3               4
    -fdict                          ./model/lm/numbers/filler.dict
    -feat           1s_c_d_dd       1s_c_d_dd
    -fillpen
    -fillprob       0.1             2.000000e-002
    -frate          100             100
    -gs
    -gs4gs          1               1
    -hmmdump        0               0
    -hmmhistbinsize 5000            5000
    -hyp
    -hypseg
    -latext         lat.gz          lat.gz
    -lextreedump    0               0
    -lm                             ./model/lm/numbers/numbers.lm.DMP
    -lmctlfn
    -lmdumpdir
    -lminmemory     0               0
    -log3table      1               1
    -logbase        1.0003          1.000300e+000
    -lowerf         200             1.333333e+002
    -lw             8.5             9.500000e+000
    -maxcepvecs     256             256
    -maxhistpf      100             100
    -maxhmmpf       20000           1500
    -maxhyplen      1000            1000
    -maxwpf         20              1
    -mdef                           ./model/hmm/hub4_cd_continuous_8gau_1s_c_d_dd/hub4opensrc.6000.mdef
    -mean                           ./model/hmm/hub4_cd_continuous_8gau_1s_c_d_dd/means
    -mixw                           ./model/hmm/hub4_cd_continuous_8gau_1s_c_d_dd/mixture_weights
    -mixwfloor      0.0000001       1.000000e-007
    -nfft           256             512
    -nfilt          31              40
    -Nlextree       3               3
    -outlatdir
    -outlatoldfmt   1               1
    -pbeam          1.0e-50         1.000000e-030
    -pheurtype      0               0
    -pl_beam        1.0e-80         0.000000e+000
    -pl_window      1               1
    -ptranskip      0               0
    -samprate       8000            16000
    -senmgau        .cont.          .cont.
    -silprob        0.1             1.000000e-001
    -subvq
    -subvqbeam      3.0e-3          1.000000e-002
    -svq4svq        0               0
    -tmat                           ./model/hmm/hub4_cd_continuous_8gau_1s_c_d_dd/transition_matrices
    -tmatfloor      0.0001          1.000000e-004
    -treeugprob     1               1
    -upperf         3500            6.855498e+003
    -utt
    -uw             0.7             7.000000e-001
    -var                            ./model/hmm/hub4_cd_continuous_8gau_1s_c_d_dd/variances
    -varfloor       0.0001          1.000000e-004
    -varnorm        no              no
    -vqeval         3               3
    -wbeam          1.0e-35         1.000000e-020
    -wend_beam      1.0e-80         1.000000e-001
    -wip            0.7             7.000000e-001
    -wlen           0.0256          2.562500e-002

    INFO: d:\speech\sphinx3\src\libs3decoder\kbcore.c(95): Initializing core models:
    INFO: d:\speech\sphinx3\src\libs3decoder\logs3.c(99): Initializing logbase: 1.000300e+000 (add table: 1)
    INFO: d:\speech\sphinx3\src\libs3decoder\logs3.c(161): Log-Add table size = 2935
    0
    INFO: d:\speech\sphinx3\src\libs3decoder\feat.c(642): Initializing feature stream to type: '1s_c_d_dd', CMN='current', VARNORM='no', AGC='none'
    INFO: d:\speech\sphinx3\src\libs3decoder\mdef.c(594): Reading model definition: ./model/hmm/hub4_cd_continuous_8gau_1s_c_d_dd/hub4opensrc.6000.mdef
    INFO: d:\speech\sphinx3\src\libs3decoder\mdef.c(771): 48 CI-phone, 133500 CD-phone, 3 emitstate/phone, 144 CI-sen, 6144 Sen, 32639 Sen-Seq
    INFO: d:\speech\sphinx3\src\libs3decoder\dict.c(358): Reading main dictionary: ./model/lm/numbers/numbers.dic
    ERROR: "d:\speech\sphinx3\src\libs3decoder\dict.c", line 192: Line 7: Bad ciphone: AX; word SEVEN ignored
    INFO: d:\speech\sphinx3\src\libs3decoder\dict.c(361): 11 words read
    INFO: d:\speech\sphinx3\src\libs3decoder\dict.c(366): Reading filler dictionary: ./model/lm/numbers/filler.dict
    INFO: d:\speech\sphinx3\src\libs3decoder\dict.c(369): 3 words read
    INFO: d:\speech\sphinx3\src\libs3decoder\lm.c(739): LM read('./model/lm/numbers/numbers.lm.DMP', lw= 9.50, wip= -1188, uw= 0.70)
    INFO: d:\speech\sphinx3\src\libs3decoder\lm.c(553):       12 ug
    INFO: d:\speech\sphinx3\src\libs3decoder\lm.c(583):       20 bigrams [on disk]
    INFO: d:\speech\sphinx3\src\libs3decoder\lm.c(591):       10 trigrams [on disk]
    INFO: d:\speech\sphinx3\src\libs3decoder\lm.c(613):        3 bigram prob entries
    INFO: d:\speech\sphinx3\src\libs3decoder\lm.c(631):        3 trigram bowt entries
    INFO: d:\speech\sphinx3\src\libs3decoder\lm.c(647):        2 trigram prob entries
    INFO: d:\speech\sphinx3\src\libs3decoder\lm.c(662):        1 trigram segtable entries (512 segsize)
    INFO: d:\speech\sphinx3\src\libs3decoder\lm.c(696):       12 word strings
    ERROR: "d:\speech\sphinx3\src\libs3decoder\wid.c", line 171: SEVEN is not a word in dictionary and it is not a class tag.
    INFO: d:\speech\sphinx3\src\libs3decoder\wid.c(178): 1 LM words not in dictionary; ignored
    INFO: d:\speech\sphinx3\src\libs3decoder\cont_mgau.c(92): Reading mixture gaussian file './model/hmm/hub4_cd_continuous_8gau_1s_c_d_dd/means'
    INFO: d:\speech\sphinx3\src\libs3decoder\cont_mgau.c(244): 6144 mixture Gaussians, 8 components, veclen 26688544
    INFO: d:\speech\sphinx3\src\libs3decoder\cont_mgau.c(92): Reading mixture gaussian file './model/hmm/hub4_cd_continuous_8gau_1s_c_d_dd/variances'
    INFO: d:\speech\sphinx3\src\libs3decoder\cont_mgau.c(244): 6144 mixture Gaussians, 8 components, veclen 26688496
    INFO: d:\speech\sphinx3\src\libs3decoder\cont_mgau.c(265): Reading mixture weights file './model/hmm/hub4_cd_continuous_8gau_1s_c_d_dd/mixture_weights'
    ERROR: "d:\speech\sphinx3\src\libs3decoder\cont_mgau.c", line 346: Weight normalization failed for 3 senones
    INFO: d:\speech\sphinx3\src\libs3decoder\cont_mgau.c(358): Read 6144 x 8 mixture weights
    INFO: d:\speech\sphinx3\src\libs3decoder\cont_mgau.c(374): Removing uninitialized Gaussian densities 6 7 8
    INFO: d:\speech\sphinx3\src\libs3decoder\cont_mgau.c(404): 24 densities removed
    (3 mixtures removed entirely)
    INFO: d:\speech\sphinx3\src\libs3decoder\cont_mgau.c(412): Applying variance floor
    INFO: d:\speech\sphinx3\src\libs3decoder\cont_mgau.c(424): 0 variance values floored
    INFO: d:\speech\sphinx3\src\libs3decoder\cont_mgau.c(470): Precomputing Mahalanobis distance invariants
    INFO: d:\speech\sphinx3\src\libs3decoder\tmat.c(135): Reading HMM transition probability matrices: ./model/hmm/hub4_cd_continuous_8gau_1s_c_d_dd/transition_matrices
    ERROR: "d:\speech\sphinx3\src\libs3decoder\tmat.c", line 197: Normalization failed for tmat 2 from state 0
    ERROR: "d:\speech\sphinx3\src\libs3decoder\tmat.c", line 197: Normalization failed for tmat 2 from state 1
    ERROR: "d:\speech\sphinx3\src\libs3decoder\tmat.c", line 197: Normalization failed for tmat 2 from state 2
    INFO: d:\speech\sphinx3\src\libs3decoder\tmat.c(217): Read 48 transition matrices of size 3x4
    INFO: d:\speech\sphinx3\src\libs3decoder\dict2pid.c(254): Building PID tables for dictionary
    INFO: d:\speech\sphinx3\src\libs3decoder\dict2pid.c(422): 63 composite states; 21 composite sseq
    INFO: d:\speech\sphinx3\src\libs3decoder\kbcore.c(225): Verifying models consistency:
    INFO: d:\speech\sphinx3\src\libs3decoder\kb.c(197): Building lextrees
    INFO: d:\speech\sphinx3\src\libs3decoder\kb.c(243): Creating Unigram Table
    INFO: d:\speech\sphinx3\src\libs3decoder\kb.c(246): Size of word table after unigram + words in class: 9
    INFO: d:\speech\sphinx3\src\libs3decoder\kb.c(263): Lextrees(3), 112 nodes(ug)
    INFO: d:\speech\sphinx3\src\libs3decoder\kb.c(291): Lextrees(3), 1 nodes(filler)

    INFO: d:\speech\sphinx3\src\libs3decoder\kb.c(317): Beam= -307006, PBeam= -230254, WBeam= -153503, SVQBeam= -15350
    INFO: d:\speech\sphinx3\src\libs3decoder\kb.c(322): Down Sampling Ratio = 2
    INFO: d:\speech\sphinx3\src\libs3decoder\kb.c(328): Conditional Down Sampling Parameter = 0
    INFO: d:\speech\sphinx3\src\libs3decoder\kb.c(333): GS map would be used for Gaussian Selection? = 1
    INFO: d:\speech\sphinx3\src\libs3decoder\kb.c(336): SVQ would be used as Gaussian Score ?= 0
    INFO: d:\speech\sphinx3\src\libs3decoder\kb.c(339): CI phone beam to prune the number of parent CI phones in CI-base GMM Selection = 23025
    INFO: d:\speech\sphinx3\src\libs3decoder\kb.c(345): Word-end pruning beam: 7675
    INFO: d:\speech\sphinx3\src\libs3decoder\kb.c(348): Phoneme look-ahead window size = 1
    WARNING: "d:\speech\sphinx3\src\libs3decoder\logs3.c", line 203: logs3 argument: 0.000000e+000; using S3_LOGPROB_ZERO
    INFO: d:\speech\sphinx3\src\libs3decoder\kb.c(353): Phoneme look-ahead beam = -939524096
    INFO: d:\speech\sphinx3\src\libs3decoder\vithist.c(77): Initializing Viterbi-history module
    Allocating 32 buffers of 2500 samples each

    System will listen for ~ 5.0 sec of speech
    Hit <cr> before speaking:
    INFO: d:\speech\sphinx3\src\libs3decoder\feat.c(971): Feature buffers initialized to 256 vectors
    INFO: d:\speech\sphinx3\src\libs3decoder\cmn_prior.c(72): mean[0]= 12.00, mean[1..12]= 0.0
    INFO: d:\speech\sphinx3\src\programs\live.c(268): live_nfeatvec: 11
    INFO: d:\speech\sphinx3\src\libs3decoder\approx_cont_mgau.c(328): Re-normalizing the previous score
    INFO: d:\speech\sphinx3\src\programs\live.c(268): live_nfeatvec: 15
    INFO: d:\speech\sphinx3\src\programs\main_live_example.c(128): PARTIAL HYP: <sil>
    INFO: d:\speech\sphinx3\src\programs\live.c(268): live_nfeatvec: 16
    INFO: d:\speech\sphinx3\src\programs\live.c(268): live_nfeatvec: 15
    INFO: d:\speech\sphinx3\src\libs3decoder\approx_cont_mgau.c(328): Re-normalizing the previous score
    INFO: d:\speech\sphinx3\src\programs\main_live_example.c(128): PARTIAL HYP: <sil>
    INFO: d:\speech\sphinx3\src\programs\live.c(268): live_nfeatvec: 16
    ERROR: "d:\speech\sphinx3\src\libs3decoder\vithist.c", line 599: No word exits from last frame in block 72
    INFO: d:\speech\sphinx3\src\programs\live.c(268): live_nfeatvec: 16
    INFO: d:\speech\sphinx3\src\programs\main_live_example.c(128): PARTIAL HYP: <sil> NINE
    INFO: d:\speech\sphinx3\src\programs\live.c(268): live_nfeatvec: 15
    INFO: d:\speech\sphinx3\src\programs\live.c(268): live_nfeatvec: 16
    INFO: d:\speech\sphinx3\src\programs\main_live_example.c(128): PARTIAL HYP: <sil> NINE <sil> EIGHT
    INFO: d:\speech\sphinx3\src\programs\live.c(268): live_nfeatvec: 16
    INFO: d:\speech\sphinx3\src\libs3decoder\approx_cont_mgau.c(328): Re-normalizing the previous score
    INFO: d:\speech\sphinx3\src\libs3decoder\approx_cont_mgau.c(328): Re-normalizing the previous score
    INFO: d:\speech\sphinx3\src\programs\live.c(268): live_nfeatvec: 15
    INFO: d:\speech\sphinx3\src\programs\main_live_example.c(128): PARTIAL HYP: <sil> NINE <sil> EIGHT TWO TWO
    INFO: d:\speech\sphinx3\src\programs\live.c(268): live_nfeatvec: 16
    INFO: d:\speech\sphinx3\src\programs\live.c(268): live_nfeatvec: 15
    INFO: d:\speech\sphinx3\src\programs\main_live_example.c(128): PARTIAL HYP: <sil> NINE <sil> EIGHT TWO TWO
    INFO: d:\speech\sphinx3\src\programs\live.c(268): live_nfeatvec: 16
    INFO: d:\speech\sphinx3\src\libs3decoder\approx_cont_mgau.c(328): Re-normalizing the previous score
    INFO: d:\speech\sphinx3\src\libs3decoder\approx_cont_mgau.c(328): Re-normalizing the previous score
    INFO: d:\speech\sphinx3\src\libs3decoder\approx_cont_mgau.c(328): Re-normalizing the previous score
    INFO: d:\speech\sphinx3\src\libs3decoder\approx_cont_mgau.c(328): Re-normalizing the previous score
    INFO: d:\speech\sphinx3\src\programs\live.c(268): live_nfeatvec: 16
    INFO: d:\speech\sphinx3\src\programs\main_live_example.c(128): PARTIAL HYP: <sil> NINE <sil> EIGHT TWO TWO <sil>
    INFO: d:\speech\sphinx3\src\programs\live.c(268): live_nfeatvec: 15
    INFO: d:\speech\sphinx3\src\programs\live.c(268): live_nfeatvec: 16
    INFO: d:\speech\sphinx3\src\programs\main_live_example.c(128): PARTIAL HYP: <sil> NINE <sil> EIGHT TWO TWO <sil>
    INFO: d:\speech\sphinx3\src\programs\live.c(268): live_nfeatvec: 16
    INFO: d:\speech\sphinx3\src\programs\live.c(268): live_nfeatvec: 15
    INFO: d:\speech\sphinx3\src\programs\main_live_example.c(128): PARTIAL HYP: <sil> NINE <sil> EIGHT TWO TWO <sil>
    INFO: d:\speech\sphinx3\src\programs\live.c(268): live_nfeatvec: 16
    INFO: d:\speech\sphinx3\src\programs\live.c(268): live_nfeatvec: 18

    Backtrace(null)
    LatID  SFrm  EFrm        AScr     LScr Type
        57     0    58     -844730   -74100   -1 <sil>
        83    59    90     -461667   -96036    0 NINE
        98    91   105     -417859   -74100   -1 <sil>
       110   106   117     -261291  -120128    0 EIGHT
       128   118   135     -405479  -120128    0 TWO
       181   136   168     -370310  -120128    0 TWO
       348   169   309    -1464553   -74100   -1 <sil>
       350   310   310           0   -23123    0 </s>
               0   310    -4225889  -701843 (Total)

    FWDVIT: NINE EIGHT TWO TWO  (null)

    FWDXCT: null S 0 T -4927732 A -4225889 L -701843 0 -844730 -74100 <sil> 59 -461667 -96036 NINE 91 -417859 -74100 <sil> 106 -261291 -120128 EIGHT 118 -405479 -120128 TWO 136 -370310 -120128 TWO 169 -1464553 -74100 <sil> 310

    INFO: d:\speech\sphinx3\src\libs3decoder\utt.c(281):  310 frm;   120 sen,   946gau/fr, Sen 0.10 CPU 0.11 Clk [Ovrhd 0.00 CPU 0.00 Clk];     55 hmm,   1 wd/fr,0.10 CPU 0.10 Clk (null)
    INFO: d:\speech\sphinx3\src\libs3decoder\utt.c(295): HMMHist[0..0](null): 18(5)
    INFO: d:\speech\sphinx3\src\libs3decoder\lm.c(823):       440 tg(),       425 tgcache,       14 bo;     6 fills,        1 in mem (9.1%)
    INFO: d:\speech\sphinx3\src\libs3decoder\lm.c(826):      127 bg(),       14 bo;    5 fills,       14 in mem (66.7%)
    INFO: d:\speech\sphinx3\src\programs\main_live_example.c(114):

    FINAL HYP: <sil> NINE <sil> EIGHT TWO TWO <sil> </s>
    D:\speech\sphinx3>

     
    • Michael Horbovetz

      btw, i said "Nine Seven Two"

       
    • Michael Horbovetz

      ok, i see why Seven isn't recognized.  looks like the generated AX of S EH V AX N, doesn't exist in the hub4opensrc.6000.mdef file.

      do i need to generate this file as well given the LM?  or where do i get an updated model definition file?

      thanks,
      mike

       
    • The Grand Janitor

      Hi Mike,
           The included model is a general acoustic model, when it was trained, it was targeted to build a model for broad cast new type language and speech characteristics.  The model is obviously much more "flat" than a digit specific HMM model.  That's why it doesn't work for you. 
            Sphinx 3's model is basically just for automatic testing of the software.  If you use it to build application, there will always have a lot of problems.  We still recommend you to train your own models for your applications.  
            We have give a lot of disclaimers in README and web pages.   However, this is still not a common knowledge for all users.   Hopefully, we can come up something later to remind the users about this important fact.

      Arthur
           

       

Log in to post a comment.