CMU Sphinx / Forums / Help: Sphinx2 language models

Mike Deisher - 2008-01-29

Hi. I have been using Sphinx3 with the HUB4 "open source" acoustic model and WSJ language 5k model. I wanted to see how Sphinx2 compares in terms of speed, so I built Sphinx2 and tried to run it with the Sphinx2 HUB4 "open source" acoustic models and WSJ 5k language model. Unfortunately, Sphinx2 cannot load wsj5k.DMP. It aborts with the error message:

INFO: lm_3g.c(864): Reading LM file model/lm/wsj5k.DMP (name "")
FATAL_ERROR: "lm_3g.c", line 522: No \data\ mark in LM file

Does Sphinx2 use a different LM format? I could not find anything about this in the documentation.

Regards,

Mike

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Nickolay V. Shmyrev - 2008-01-30
  
  1) Use sphinx3_lm_convert to convert binary compressed model back to text.
  2) Use pocketsphinx instead of sphinx2, it's even faster and more efficient.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Mike Deisher - 2008-01-30
    
    Thanks! I uncompressed the LM.
    
    Sphinx2 now complains about the dictionary being too large.
    
    INFO: lm_3g.c(901): 130615 words in dictionary
    FATAL_ERROR: "lm_3g.c", line 918: #dict-words(130615) > 65534
    
    Strange that cmudict is too large for CMU Sphinx2. I'll try pocketsphinx.
    
    --Mike
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
    - Nickolay V. Shmyrev - 2008-01-30
      
      I suppose you can either strip cmudict to unigrams from wsj.dmp or use swb model included in pocketsphinx. It should not be worse.
      
      If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
      - Mike Deisher - 2008-01-31
        
        Thanks. That worked. Now I'm able to use the HUB4 LM (not sure why it failed before -- must have been the dictionary issue) and the WSJ 8kHz AM (the one that comes with PocketSphinx) with the swb.dic file. I get 70.9% word accuracy on my WSJ test set. Sphinx3 gets 82.7% on the same set of sentences (but with 16 kHz bandwidth audio and AM). Does this sound like the accuracy you would expect? Best regards,
        
        Mike
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
        
        David Huggins-Daines - 2008-02-01
        
        Hi,
        
        There is a pretty big vocabulary and language model mismatch, but that still seems pretty far out of line.
        
        With the "standard" WSJ5k bigram model and the 8khz AM that comes with PocketSphinx, I get between 8.0 and 8.5% WER depending on the beam settings.
        
        This is on the si_et_05 test set which is a bit harder than the si_dt_05 development set.
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
        
        Mike Deisher - 2008-02-01
        
        Hmm.. something's wrong then. I'm evaluating on si_et_20. Is the standard WSJ5k bigram model publicly available? Thanks!
        
        --Mike
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
        
        Mike Deisher - 2008-02-12
        
        Hi. I didn't have WSJ0 so had to order it from LDC. Now I'm set up to test on si_et_05. If I use the WSJ5k bigram model and the 8khz AM that comes with PocketSphinx, I get 79.0% word accuracy. By contrast, I get 92.4% accuracy with the HUB4 AM and WSJ5k LM on Sphinx3.
        
        This is for PocketSphinx 0.4.1. The latest version from svn compiles but does not pass "make check". Same with the latest nightly build.
        
        --Mike
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
        
        David Huggins-Daines - 2008-02-13
        
        Hmm, that's definitely strange. With PocketSphinx 0.4.1, on Linux, I get 8.05% WER (91.95% accuracy). Here is the script I use for testing on si_et_05. I have the unshortened .sph files in the directory ./si_et_05, and wsj_test.fileids looks like this:
        
        si_et_05/440/440c0201
        si_et_05/440/440c0202
        ...
        
        On a 3.0GHz Pentium4, this runs at an average of 0.16 xRT.
        
        !/bin/sh
        
        expt=$1
        if [ x"$expt" = x ]; then
        >&2 echo "Usage: $0 EXPTID [DECODER]"
        exit 1
        fi
        decode=${2:-../src/programs/pocketsphinx_batch}
        
        $decode \
        -hmm ../model/hmm/wsj1 \
        -dict bcb05cnp.dic \
        -lm bcb05cnp.z.DMP \
        -lw 7.5 -wip 0.5 \
        -beam 1e-60 -wbeam 1e-40 -bestpathlw 11.5 \
        -cepdir . -cepext .sph \
        -adcin yes -adchdr 1024 \
        -ctl wsj_test.fileids \
        -hyp $expt.hyp \
        -latsize 50000 \
        > $expt.log 2>&1
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
    - Christopher Bader - 2008-01-30
      
      66534 (or 66536?) is the maximum number of words Sphinx 2 and Pocketsphinx can accomodate.
      
      CB
      
      If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Mike Deisher - 2008-01-30
  
  PocketSphinx also complains that CMUdict is too big. Are word frequencies available for CMUdict? Are there tools to prune infrequently used words?
  
  It looks like Sphinx2 and PocketSphinx cannot handle a dictionary with more than 65534 words.
  
  Thanks!
  
  --Mike
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- David Huggins-Daines - 2008-02-01
  
  Yes, this is an annoying bug in the Sphinx2 language model code, which PocketSphinx inherited up through version 0.4.1.
  
  The development version of PocketSphinx in the Subversion repository has removed that limit (there is still a limit of 65536 words in a .DMP format language model due to the file format limitations).
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- David Huggins-Daines - 2008-02-13
  
  Ahh I just realized that you are using the rather lousy WSJ5k language model that's included for testing purposes with PocketSphinx. That is not the same as the standard (bcb05cnp.Z) language model which comes with the WSJ0 corpus.
  
  Unfortunately we can't redistribute the bcb05cnp language model, and it's not at all clear what data was used to train it, so I just trained a language model from the acoustic model transcripts to use for testing purposes.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Mike Deisher - 2008-02-13
    
    Thanks, David. With bcb05cnp the accuracy is actually worse (77.4% compared to 79.0% with wsj5k). Perhaps it is an acoustic problem. What parameters do you use for feature extraction?
    
    --Mike
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
    - David Huggins-Daines - 2008-02-13
      
      Hmm, very strange. I am using the default parameters from the wsj1 acoustic model:
      
      -lowerf 1
      -upperf 4000
      -nfilt 20
      -transform dct
      -round_filters no
      -remove_dc yes
      
      If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
      - Mike Deisher - 2008-02-14
        
        Using those parameters doesn't change the score at all.
        I also tried feature extraction directly from the wideband speech (rather than the downsampled speech) and that did not change the score much.
        I think the only thing left is the dictionary. You are using "bcb05cnp.dic" (which does not seem to be included with WSJ0) and I am using "swb.dic". Where did bcb05cnp.dic come from?
        
        --Mike
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
        
        David Huggins-Daines - 2008-02-14
        
        Ah, there's your problem. You have a big mismatch between the language model and the dictionary. bcb05cnp.dic is a dictionary I generated from the bcb05cnp language model and cmudict. I used the 'ngram_pronounce' tool from the (unreleased but available from SVN) CMU language modeling toolkit to do this, but for your convenience I've put a copy of it at:
        
        https://cmusphinx.svn.sourceforge.net/svnroot/cmusphinx/trunk/pocketsphinx/regression/bcb05cnp.dic
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
        
        Mike Deisher - 2008-02-15
        
        Thanks, David! That was it. WER is now 6.9%. It makes sense that restricting the vocabulary to the proper domain would bring up the accuracy.
        
        --Mike
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
        
        David Huggins-Daines - 2008-02-15
        
        Great! Actually it's not a matter of restricting the vocabulary, the problem is just that the vocabulary in the language model has to match (or be a subset of) the one in the dictionary. The swb and bcb05cnp language models have different vocabularies (swb is trained on telephone conversations, bcb05cnp is trained on financial news stories), and swb.dic only contains the words that are in the swb language model. So if you use it with the bcb05cnp language model you are actually only able to recognize the intersection of the two vocabularies which is (probably) considerably less than 5000 words.
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
        
        Mike Deisher - 2008-02-26
        
        I wrote a little Perl script to read in CMUDICT and the bcb20cnp language model, and write out a new bcb20cnp.dic dictionary that is small enough for PocketSphinx to load. Even with this configuration, PocketSphinx achieves 73.4% accuracy while Sphinx3 achieves 82.9% word accuracy on the si_et_20 set. Is it expected that PocketSphinx accuracy is comparable to that of Sphinx3 for smaller vocabularies but worse for larger vocabularies?
        
        --Mike
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
        
        David Huggins-Daines - 2008-02-26
        
        Hi,
        
        It depends on the acoustic model, but in a general sense (and using the default acoustic models), yes.
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- David Huggins-Daines - 2008-02-13
  
  Also, you are using the .wv1 files, not the .wv2 files from WSJ0, right?
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Mike Deisher - 2008-02-14
    
    Yes, the scores I reported were for the wv1 (Sennheiser) files. One possible difference is that I used Matlab to downsample these files from 16000 Hz sample rate to 8000 Hz sample rate before performaing feature extraction.
    
    --Mike
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Sphinx2 language models

Speech Recognition Toolkit

Forums

Help

Sphinx2 language models document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

!/bin/sh

Sphinx2 language models