CMU Sphinx / Forums / Help: sphinx2 dictionary/lm problem

Anonymous - 2003-06-06

Hi, I'm trying to run Sphinx2 using cmudict.06d from http://www.speech.cs.cmu.edu/sphinx/models/ and language_model.arpaformat from same place. (The bn.bigram.arpa didn't work either...) I am getting this error, has anyone seen this or know how to correct the problem?

...
INFO: lm_3g.c(874): lm_3g.c(874): ngrams 1=64001, 2=9382014, 3=13459879
INFO: lm_3g.c(882): lm_3g.c(882): 130608 words in dictionary
lm_3g.c(899): #dict-words(130608) > 65534
...

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Jessica P. Hekman - 2003-06-06
  
  You have too many words in your dictionary -- over 65000. The xvoice-sphinx project has two dictionaries with fewer words:
  
  http://xvoice.sourceforge.net/xvoice-sphinx/
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous - 2003-06-06
  
  As Jessica has already said, loading the entire cmudict.0.6d has exceeded Sphinx2's limit of 65,534. The LM you're using has only 64K+1 words, so you should make a smaller dictionary that's the intersection of a large dictionary (such as cmudict.0.6d) and the words in the LM (the unigrams).
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous - 2003-06-10
  
  Thanks for the url, hadn't seen that project before!
  At least now it starts doing something, but now it gets stuck at
  ...
  INFO: lm_3g.c(924): 60001 = #unigrams created
  INFO: lm_3g.c(580): lm_3g.c(580): Reading bigrams
  INFO: lm_3g.c(637): .INFO: lm_3g.c(637): .INFO: lm_3g.c(637): .
  ...
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Anonymous - 2003-06-10
    
    The code responsible for those printouts is:
    
    if ((bgcount & 0x0000ffff) == 0) {
    E_INFO (".");
    
    where bgcount is the number of bigrams read so far. So it prints a "." whenever bgcount passes a multiple of 0xffff = 65K. It's not clear that this is an error, just showing progress in loading the bigrams. Since you appear to have 60K unigrams in your LM, the number of bigrams is probably pretty large. What is it? -- it should be given near the top of your LM file.
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
    - Anonymous - 2003-06-11
      
      It says the number of bigrams is 1044719. I guess I was being to unpatient... Might also have had some additional problem earlier. It sure gets past the reading of the bigrams but the program terminates after
      ...
      WARNING: "lm_3g.c", line 1042: lm_3g.c(1043): 11651 LM words not in dict; ignored
      
      The errormessage in the console reads:
      
      7 [main] sphinx2-continuous 376 handle_exceptions: Exception: STATUS_ACCES
      S_VIOLATION
      2207 [main] sphinx2-continuous 376 open_stackdumpfile: Dumping stack trace to
      sphinx2-continuous.exe.stackdump
      
      I was using '1test60k.5.5.arpa' and 'full.dic' from xvoice project, acoustic model 'sphinx_2_format' from cmu.
      
      If anyone got a clue, please let me know!
      
      If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

sphinx2 dictionary/lm problem

Speech Recognition Toolkit

Forums

Help

sphinx2 dictionary/lm problem document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

sphinx2 dictionary/lm problem