Menu

Building a large vocabulary language model

Anonymous
2000-03-17
2012-09-22
  • Anonymous

    Anonymous - 2000-03-17

    We would like to build a large vocabulary language model. I read over most of the documentation of the CMU/CSL Modeling Toolkit, but I don't see how this will work with sphinx. I've created just about all the types of files that are specified in the documentation. Now which ones do I need for sphinx? Our vocabulary is of the 65535 words allowed by the tools. We wanted to use the CMU Dictionary, but it's too big, although we could change set the 4-bit flag specified in the documentation. Any ideas? Please help!

                              Thanks Edgar

     
    • Kevin A. Lenzo

      Kevin A. Lenzo - 2000-03-30

      What sort of text do you have?  Are you planning to just use all the words with equal probability?

      The best thing to do is to get a lot of sentences of the sort you want to recognize; many of the gains in accuracy in the last several years have been as a result of language model constraints.  Even unigram probabilities would be better than no language model...  Maybe take some text from papers or web pages similar to the domain. 

      Make a big file of example utterances, one per line.  On each line, have a <S> at the beginning and an </S> at the end, and the text in UPPERCASE, without punctuation, in between.  With that, you can build a language model with the CMU-Cambridge SLM as follows:

      Suppose your text file is called my.text.  run the following commands:

      cat my.text | text2wfreq | wfreq2vocab > my.vocab

      cat my.text | text2idngram -vocab my.vocab | \
         idngram2lm -vocab my.vocab -idngram - \
         -arpa my.lm

      The 'my.lm' file is the resulting language model; sphinx2-demo looks into a 'task' directory, and looks for a .lm file and uses it as the language model.

      Now you need a dictionary.  CMUDICT uses a slightly different phone set than the default sphinx2 phone set, so you need to convert the CMUDICT (or a subset of it) to the right format with utility 'stress2sphinx' and make a file called my.dict, that contains the WORD and the pronunciation (as output by stress2sphinx), and put it in the same task directory as the my.lm file above.

      That's it.  Those are the only two files sphinx2 really needs to run in continuous mode. 

      I can make a Wall Street Journal large vocabulary language model available if that would help.  It doesn't have many instances of the pronoun 'I' in it...

       
      • Kevin A. Lenzo

        Kevin A. Lenzo - 2000-03-30

        that should be

        cat my.text | text2wfreq | wfreq2vocab > my.vocab
        cat my.text | text2idngram -vocab my.vocab | -idngram - -arpa my.lm

        and note you'll need to get at least the words in your .vocab file into your .dict.  You can use the top 64000 or so words from CMUDICT, once converted with stress2sphinx, if you like. 

         
      • Anonymous

        Anonymous - 2001-08-15

        Lenzo, have you had the chance to make the WSJ language model (or any other large vocabulary model) available?  Also, are the .lm files the language-model format for Sphinx II and the .dmp.Z files the language-model format for Sphinx III?

         

Log in to post a comment.