French model distributed here has 65k, hub4 has 64k. I wonder if bigger size is practical, you have to collect too much training data. Speed will be too slow and size of the model will be too big.
Actually n-gram is applicable only to the limited number of languages, for Finnish or Turkish or for any language with rich morphology usual n-grams aren't useful at all. You can build 300k n-gram model but it will still have high perplexity. So if you think 60-80k is not enough, rethink language modeling as a whole.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
so you mean to say that having a very big language model isn't that useful? i have gathered as of now very big data, say 10 gigabytes of ngrams (hehe so big)....
btw, im creating an english language model...
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
ummm... i have a 200k++ words as of now on my vocabulary... about 300 million (356,443,775 to be exact) trigrams... 60+ million bigrams... it is supposed to decode any type of conversations... is it worth doing or should i limit my size?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Just another data point:
I wanted to use FSA formalism of sphinx decoder to apply variable order ngrams
decoding, but it looks like FSA formalism in the decoder does a malloc, that is square of the word count,
and puts the basic constraint on vocab size to a few thousand.
Hopefult someone will fix the logic and get rid of that malloc.
Nagendra
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
yeah i've had that problem.... i just changed OS to 64-bit since malloc can allocate up to 64-bit variable in a 64-bit OS... and changed some data types to size_t instead of using int or unsigned int..
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
You can build small language model with good characteristics, size doesn't guarantee you good WER.
But here people argue that large language model improve translation quality, although their language model is indeed very large (2M words), comparing to your one :) And also remember that translation is a bit different from ASR.
thanks for these articles... it'll be useful to give proof to my research...
and about the code changes, i'll try to share it with you... also my WER and perplexity... though i can't open source my model.. :) thanks again..
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
hello everyone... is there a limit to how big your language model can be? as of now, what has been the largest language model created?
French model distributed here has 65k, hub4 has 64k. I wonder if bigger size is practical, you have to collect too much training data. Speed will be too slow and size of the model will be too big.
Actually n-gram is applicable only to the limited number of languages, for Finnish or Turkish or for any language with rich morphology usual n-grams aren't useful at all. You can build 300k n-gram model but it will still have high perplexity. So if you think 60-80k is not enough, rethink language modeling as a whole.
so you mean to say that having a very big language model isn't that useful? i have gathered as of now very big data, say 10 gigabytes of ngrams (hehe so big)....
btw, im creating an english language model...
Usefulness depends on task :) What vocabulary size are you going to recognize and what is the number of unigrams, bigrams and trigrams?
ummm... i have a 200k++ words as of now on my vocabulary... about 300 million (356,443,775 to be exact) trigrams... 60+ million bigrams... it is supposed to decode any type of conversations... is it worth doing or should i limit my size?
Just another data point:
I wanted to use FSA formalism of sphinx decoder to apply variable order ngrams
decoding, but it looks like FSA formalism in the decoder does a malloc, that is square of the word count,
and puts the basic constraint on vocab size to a few thousand.
Hopefult someone will fix the logic and get rid of that malloc.
Nagendra
yeah i've had that problem.... i just changed OS to 64-bit since malloc can allocate up to 64-bit variable in a 64-bit OS... and changed some data types to size_t instead of using int or unsigned int..
> and changed some data types to size_t instead of using int or unsigned int..
Great, It would be nice to get this changes into SVN.
I still wonder is a big language model so better. What is the perplexity on a test data, was there any WER improvement in the end?
This article discusses 200k n-gram model between:
http://www.limsi.fr/Rapports/RS2005/chm/tlp/tlp14/index.html
improvement in WER is only 0.3%, for English I suppose it will be even smaller.
Also look on :)
http://citeseer.ist.psu.edu/443554.html
You can build small language model with good characteristics, size doesn't guarantee you good WER.
But here people argue that large language model improve translation quality, although their language model is indeed very large (2M words), comparing to your one :) And also remember that translation is a bit different from ASR.
http://acl.ldc.upenn.edu/D/D07/D07-1090.pdf
thanks for these articles... it'll be useful to give proof to my research...
and about the code changes, i'll try to share it with you... also my WER and perplexity... though i can't open source my model.. :) thanks again..