CMU Sphinx / Forums / Speech Recognition Theory: Building a large vocabulary language model

Anonymous - 2000-03-17

We would like to build a large vocabulary language model. I read over most of the documentation of the CMU/CSL Modeling Toolkit, but I don't see how this will work with sphinx. I've created just about all the types of files that are specified in the documentation. Now which ones do I need for sphinx? Our vocabulary is of the 65535 words allowed by the tools. We wanted to use the CMU Dictionary, but it's too big, although we could change set the 4-bit flag specified in the documentation. Any ideas? Please help!

Thanks Edgar

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Kevin A. Lenzo - 2000-03-30
  
  What sort of text do you have? Are you planning to just use all the words with equal probability?
  
  The best thing to do is to get a lot of sentences of the sort you want to recognize; many of the gains in accuracy in the last several years have been as a result of language model constraints. Even unigram probabilities would be better than no language model... Maybe take some text from papers or web pages similar to the domain.
  
  Make a big file of example utterances, one per line. On each line, have a <S> at the beginning and an </S> at the end, and the text in UPPERCASE, without punctuation, in between. With that, you can build a language model with the CMU-Cambridge SLM as follows:
  
  Suppose your text file is called my.text. run the following commands:
  
  cat my.text | text2wfreq | wfreq2vocab > my.vocab
  
  cat my.text | text2idngram -vocab my.vocab | \
  idngram2lm -vocab my.vocab -idngram - \
  -arpa my.lm
  
  The 'my.lm' file is the resulting language model; sphinx2-demo looks into a 'task' directory, and looks for a .lm file and uses it as the language model.
  
  Now you need a dictionary. CMUDICT uses a slightly different phone set than the default sphinx2 phone set, so you need to convert the CMUDICT (or a subset of it) to the right format with utility 'stress2sphinx' and make a file called my.dict, that contains the WORD and the pronunciation (as output by stress2sphinx), and put it in the same task directory as the my.lm file above.
  
  That's it. Those are the only two files sphinx2 really needs to run in continuous mode.
  
  I can make a Wall Street Journal large vocabulary language model available if that would help. It doesn't have many instances of the pronoun 'I' in it...
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Kevin A. Lenzo - 2000-03-30
    
    that should be
    
    cat my.text | text2wfreq | wfreq2vocab > my.vocab
    cat my.text | text2idngram -vocab my.vocab | -idngram - -arpa my.lm
    
    and note you'll need to get at least the words in your .vocab file into your .dict. You can use the top 64000 or so words from CMUDICT, once converted with stress2sphinx, if you like.
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Anonymous - 2001-08-15
    
    Lenzo, have you had the chance to make the WSJ language model (or any other large vocabulary model) available? Also, are the .lm files the language-model format for Sphinx II and the .dmp.Z files the language-model format for Sphinx III?
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Building a large vocabulary language model

Speech Recognition Toolkit

Forums

Help

Building a large vocabulary language model

Building a large vocabulary language model

Speech Recognition Toolkit

Forums

Help

Building a large vocabulary language model document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

Building a large vocabulary language model