Menu

language model size

Help
skatz_teyp
2007-11-01
2012-09-22
  • skatz_teyp

    skatz_teyp - 2007-11-01

    hello everyone... is there a limit to how big your language model can be? as of now, what has been the largest language model created?

     
    • Nickolay V. Shmyrev

      French model distributed here has 65k, hub4 has 64k. I wonder if bigger size is practical, you have to collect too much training data. Speed will be too slow and size of the model will be too big.

      Actually n-gram is applicable only to the limited number of languages, for Finnish or Turkish or for any language with rich morphology usual n-grams aren't useful at all. You can build 300k n-gram model but it will still have high perplexity. So if you think 60-80k is not enough, rethink language modeling as a whole.

       
    • skatz_teyp

      skatz_teyp - 2007-11-01

      so you mean to say that having a very big language model isn't that useful? i have gathered as of now very big data, say 10 gigabytes of ngrams (hehe so big)....

      btw, im creating an english language model...

       
      • Nickolay V. Shmyrev

        Usefulness depends on task :) What vocabulary size are you going to recognize and what is the number of unigrams, bigrams and trigrams?

         
        • skatz_teyp

          skatz_teyp - 2007-11-01

          ummm... i have a 200k++ words as of now on my vocabulary... about 300 million (356,443,775 to be exact) trigrams... 60+ million bigrams... it is supposed to decode any type of conversations... is it worth doing or should i limit my size?

           
          • Nagendra Kumar Goel

            Just another data point:
            I wanted to use FSA formalism of sphinx decoder to apply variable order ngrams
            decoding, but it looks like FSA formalism in the decoder does a malloc, that is square of the word count,
            and puts the basic constraint on vocab size to a few thousand.

            Hopefult someone will fix the logic and get rid of that malloc.

            Nagendra

             
            • skatz_teyp

              skatz_teyp - 2007-11-02

              yeah i've had that problem.... i just changed OS to 64-bit since malloc can allocate up to 64-bit variable in a 64-bit OS... and changed some data types to size_t instead of using int or unsigned int..

               
              • Nickolay V. Shmyrev

                > and changed some data types to size_t instead of using int or unsigned int..

                Great, It would be nice to get this changes into SVN.

                I still wonder is a big language model so better. What is the perplexity on a test data, was there any WER improvement in the end?

                This article discusses 200k n-gram model between:

                http://www.limsi.fr/Rapports/RS2005/chm/tlp/tlp14/index.html

                improvement in WER is only 0.3%, for English I suppose it will be even smaller.

                Also look on :)

                http://citeseer.ist.psu.edu/443554.html

                You can build small language model with good characteristics, size doesn't guarantee you good WER.

                But here people argue that large language model improve translation quality, although their language model is indeed very large (2M words), comparing to your one :) And also remember that translation is a bit different from ASR.

                http://acl.ldc.upenn.edu/D/D07/D07-1090.pdf

                 
                • skatz_teyp

                  skatz_teyp - 2007-11-02

                  thanks for these articles... it'll be useful to give proof to my research...
                  and about the code changes, i'll try to share it with you... also my WER and perplexity... though i can't open source my model.. :) thanks again..

                   

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.