Menu

Sphinx2 language models

Help
2008-01-29
2012-09-22
  • Mike Deisher

    Mike Deisher - 2008-01-29

    Hi. I have been using Sphinx3 with the HUB4 "open source" acoustic model and WSJ language 5k model. I wanted to see how Sphinx2 compares in terms of speed, so I built Sphinx2 and tried to run it with the Sphinx2 HUB4 "open source" acoustic models and WSJ 5k language model. Unfortunately, Sphinx2 cannot load wsj5k.DMP. It aborts with the error message:

    INFO: lm_3g.c(864): Reading LM file model/lm/wsj5k.DMP (name "")
    FATAL_ERROR: "lm_3g.c", line 522: No \data\ mark in LM file

    Does Sphinx2 use a different LM format? I could not find anything about this in the documentation.

    Regards,

    Mike

     
    • Nickolay V. Shmyrev

      1) Use sphinx3_lm_convert to convert binary compressed model back to text.
      2) Use pocketsphinx instead of sphinx2, it's even faster and more efficient.

       
      • Mike Deisher

        Mike Deisher - 2008-01-30

        Thanks! I uncompressed the LM.

        Sphinx2 now complains about the dictionary being too large.

        INFO: lm_3g.c(901): 130615 words in dictionary
        FATAL_ERROR: "lm_3g.c", line 918: #dict-words(130615) > 65534

        Strange that cmudict is too large for CMU Sphinx2. I'll try pocketsphinx.

        --Mike

         
        • Nickolay V. Shmyrev

          I suppose you can either strip cmudict to unigrams from wsj.dmp or use swb model included in pocketsphinx. It should not be worse.

           
          • Mike Deisher

            Mike Deisher - 2008-01-31

            Thanks. That worked. Now I'm able to use the HUB4 LM (not sure why it failed before -- must have been the dictionary issue) and the WSJ 8kHz AM (the one that comes with PocketSphinx) with the swb.dic file. I get 70.9% word accuracy on my WSJ test set. Sphinx3 gets 82.7% on the same set of sentences (but with 16 kHz bandwidth audio and AM). Does this sound like the accuracy you would expect? Best regards,

            Mike

             
            • David Huggins-Daines

              Hi,

              There is a pretty big vocabulary and language model mismatch, but that still seems pretty far out of line.

              With the "standard" WSJ5k bigram model and the 8khz AM that comes with PocketSphinx, I get between 8.0 and 8.5% WER depending on the beam settings.

              This is on the si_et_05 test set which is a bit harder than the si_dt_05 development set.

               
              • Mike Deisher

                Mike Deisher - 2008-02-01

                Hmm.. something's wrong then. I'm evaluating on si_et_20. Is the standard WSJ5k bigram model publicly available? Thanks!

                --Mike

                 
              • Mike Deisher

                Mike Deisher - 2008-02-12

                Hi. I didn't have WSJ0 so had to order it from LDC. Now I'm set up to test on si_et_05. If I use the WSJ5k bigram model and the 8khz AM that comes with PocketSphinx, I get 79.0% word accuracy. By contrast, I get 92.4% accuracy with the HUB4 AM and WSJ5k LM on Sphinx3.

                This is for PocketSphinx 0.4.1. The latest version from svn compiles but does not pass "make check". Same with the latest nightly build.

                --Mike

                 
                • David Huggins-Daines

                  Hmm, that's definitely strange. With PocketSphinx 0.4.1, on Linux, I get 8.05% WER (91.95% accuracy). Here is the script I use for testing on si_et_05. I have the unshortened .sph files in the directory ./si_et_05, and wsj_test.fileids looks like this:

                  si_et_05/440/440c0201
                  si_et_05/440/440c0202
                  ...

                  On a 3.0GHz Pentium4, this runs at an average of 0.16 xRT.

                  !/bin/sh

                  expt=$1
                  if [ x"$expt" = x ]; then
                  >&2 echo "Usage: $0 EXPTID [DECODER]"
                  exit 1
                  fi
                  decode=${2:-../src/programs/pocketsphinx_batch}

                  $decode \ -hmm ../model/hmm/wsj1 \ -dict bcb05cnp.dic \ -lm bcb05cnp.z.DMP \ -lw 7.5 -wip 0.5 \ -beam 1e-60 -wbeam 1e-40 -bestpathlw 11.5 \ -cepdir . -cepext .sph \ -adcin yes -adchdr 1024 \ -ctl wsj_test.fileids \ -hyp $expt.hyp \ -latsize 50000 \ > $expt.log 2>&1

                   
        • Christopher Bader

          66534 (or 66536?) is the maximum number of words Sphinx 2 and Pocketsphinx can accomodate.

          CB

           
    • Mike Deisher

      Mike Deisher - 2008-01-30

      PocketSphinx also complains that CMUdict is too big. Are word frequencies available for CMUdict? Are there tools to prune infrequently used words?

      It looks like Sphinx2 and PocketSphinx cannot handle a dictionary with more than 65534 words.

      Thanks!

      --Mike

       
    • David Huggins-Daines

      Yes, this is an annoying bug in the Sphinx2 language model code, which PocketSphinx inherited up through version 0.4.1.

      The development version of PocketSphinx in the Subversion repository has removed that limit (there is still a limit of 65536 words in a .DMP format language model due to the file format limitations).

       
    • David Huggins-Daines

      Ahh I just realized that you are using the rather lousy WSJ5k language model that's included for testing purposes with PocketSphinx. That is not the same as the standard (bcb05cnp.Z) language model which comes with the WSJ0 corpus.

      Unfortunately we can't redistribute the bcb05cnp language model, and it's not at all clear what data was used to train it, so I just trained a language model from the acoustic model transcripts to use for testing purposes.

       
      • Mike Deisher

        Mike Deisher - 2008-02-13

        Thanks, David. With bcb05cnp the accuracy is actually worse (77.4% compared to 79.0% with wsj5k). Perhaps it is an acoustic problem. What parameters do you use for feature extraction?

        --Mike

         
        • David Huggins-Daines

          Hmm, very strange. I am using the default parameters from the wsj1 acoustic model:

          -lowerf 1
          -upperf 4000
          -nfilt 20
          -transform dct
          -round_filters no
          -remove_dc yes

           
          • Mike Deisher

            Mike Deisher - 2008-02-14

            Using those parameters doesn't change the score at all.
            I also tried feature extraction directly from the wideband speech (rather than the downsampled speech) and that did not change the score much.
            I think the only thing left is the dictionary. You are using "bcb05cnp.dic" (which does not seem to be included with WSJ0) and I am using "swb.dic". Where did bcb05cnp.dic come from?

            --Mike

             
            • David Huggins-Daines

              Ah, there's your problem. You have a big mismatch between the language model and the dictionary. bcb05cnp.dic is a dictionary I generated from the bcb05cnp language model and cmudict. I used the 'ngram_pronounce' tool from the (unreleased but available from SVN) CMU language modeling toolkit to do this, but for your convenience I've put a copy of it at:

              https://cmusphinx.svn.sourceforge.net/svnroot/cmusphinx/trunk/pocketsphinx/regression/bcb05cnp.dic

               
              • Mike Deisher

                Mike Deisher - 2008-02-15

                Thanks, David! That was it. WER is now 6.9%. It makes sense that restricting the vocabulary to the proper domain would bring up the accuracy.

                --Mike

                 
                • David Huggins-Daines

                  Great! Actually it's not a matter of restricting the vocabulary, the problem is just that the vocabulary in the language model has to match (or be a subset of) the one in the dictionary. The swb and bcb05cnp language models have different vocabularies (swb is trained on telephone conversations, bcb05cnp is trained on financial news stories), and swb.dic only contains the words that are in the swb language model. So if you use it with the bcb05cnp language model you are actually only able to recognize the intersection of the two vocabularies which is (probably) considerably less than 5000 words.

                   
                  • Mike Deisher

                    Mike Deisher - 2008-02-26

                    I wrote a little Perl script to read in CMUDICT and the bcb20cnp language model, and write out a new bcb20cnp.dic dictionary that is small enough for PocketSphinx to load. Even with this configuration, PocketSphinx achieves 73.4% accuracy while Sphinx3 achieves 82.9% word accuracy on the si_et_20 set. Is it expected that PocketSphinx accuracy is comparable to that of Sphinx3 for smaller vocabularies but worse for larger vocabularies?

                    --Mike

                     
                    • David Huggins-Daines

                      Hi,

                      It depends on the acoustic model, but in a general sense (and using the default acoustic models), yes.

                       
    • David Huggins-Daines

      Also, you are using the .wv1 files, not the .wv2 files from WSJ0, right?

       
      • Mike Deisher

        Mike Deisher - 2008-02-14

        Yes, the scores I reported were for the wv1 (Sennheiser) files. One possible difference is that I used Matlab to downsample these files from 16000 Hz sample rate to 8000 Hz sample rate before performaing feature extraction.

        --Mike

         

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.