I am using CMU Sphinx online LMTool to generate language model, but which type of language model is that open or closed as I have seen some information on open and close vocabulary language models.
How does that variation reflect in recognition in both performance and accuracy
Regards,
Kalpana Challagulla
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Open vocabulary language model contains special word <unk> and allows to score probabilities of word sequences with arbitrary vocabulary. The probability of the unknown word is replaced with probability of <unk> tag. This is useful for some applications but not applicable for speech recognition since speech recognizer needs to know not just the probability of the word but also it's pronunciation.</unk></unk>
In our decoders <unk> word is not used at all and any open vocabulary language model is essentially converted to closed vocabulary language model.
So it doesn't matter what language model will you generate, it will be treated as closed vocabulary language model anyway.</unk>
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Thank You Nickolay, that clarified me.
also need to know perplexity score calculation,how it differs from the text which is in the language model and which is not in the language model and which is the correct way to get the perplexity score of a given language model to evaluate
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
also need to know perplexity score calculation,how it differs from the text which is in the language model and which is not in the language model and which is the correct way to get the perplexity score of a given language model to evaluate
Perplexity depends on both the language model and the text. Perplexity on training text is obviously smaller. You can not evaluate the perplexity of the model alone, you need to submit some text for evaluation. For the best estimation this text should be in domain but should not contain sentences from training text.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Using CMUCLMTK, I am able to get perplexity score for a specific small language model generated from CMUCLMTK steps, result is as below
evallm :
perplexity -text test2.txt
Computing perplexity of the language model with respect
to the text test2.txt
Perplexity = 2.90, Entropy = 1.54 bits
Computation based on 9 words.
Number of 3-grams hit = 2 (22.22%)
Number of 2-grams hit = 2 (22.22%)
Number of 1-grams hit = 5 (55.56%)
0 OOVs (0.00%) and 2 context cues were removed from the calculation.
Can we get perplexity score for the generic English Language model available on CMU site against our text..??
How to get the same for (Language models available in CMUSphinx)cmusphinx-5.0-en-us.lm..??
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
On unpacking the above mentioned model i am able to see .dmp model not arpa format , can we directly convert the .dmp language model to .arpa language model using sphinx_lm_convert.exe?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
As already discussed i am able to calculate the perplexity score for the small and specific language models which we have generated, but when i tried calculating the same for en-us generic language model which i have downloaded from CMU site i am getting the below error can you please help me how to over come this
D:\CMUSphinix\CMUCLMTK\cmuclmtk-0.7-win32\cmuclmtk-0.7-win32>evallm.exe -arpa lm.arpa
Reading in language model from file lm.arpa
Reading in a 3-gram language model.
Number of 1-grams = 19794.
Number of 2-grams = 1377200.
Number of 3-grams = 3178194.
Reading unigrams...
Warning, reading line -
- gave unexpected input.
Reading 2-grams...
....Error - Repeated 2-gram in ARPA format language model.
Regards,
Kalpana
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi,
I am using CMU Sphinx online LMTool to generate language model, but which type of language model is that open or closed as I have seen some information on open and close vocabulary language models.
How does that variation reflect in recognition in both performance and accuracy
Regards,
Kalpana Challagulla
Open vocabulary language model contains special word <unk> and allows to score probabilities of word sequences with arbitrary vocabulary. The probability of the unknown word is replaced with probability of <unk> tag. This is useful for some applications but not applicable for speech recognition since speech recognizer needs to know not just the probability of the word but also it's pronunciation.</unk></unk>
In our decoders <unk> word is not used at all and any open vocabulary language model is essentially converted to closed vocabulary language model.
So it doesn't matter what language model will you generate, it will be treated as closed vocabulary language model anyway.</unk>
Thank You Nickolay, that clarified me.
also need to know perplexity score calculation,how it differs from the text which is in the language model and which is not in the language model and which is the correct way to get the perplexity score of a given language model to evaluate
Perplexity depends on both the language model and the text. Perplexity on training text is obviously smaller. You can not evaluate the perplexity of the model alone, you need to submit some text for evaluation. For the best estimation this text should be in domain but should not contain sentences from training text.
Thank you Nickolay
Using CMUCLMTK, I am able to get perplexity score for a specific small language model generated from CMUCLMTK steps, result is as below
evallm :
perplexity -text test2.txt
Computing perplexity of the language model with respect
to the text test2.txt
Perplexity = 2.90, Entropy = 1.54 bits
Computation based on 9 words.
Number of 3-grams hit = 2 (22.22%)
Number of 2-grams hit = 2 (22.22%)
Number of 1-grams hit = 5 (55.56%)
0 OOVs (0.00%) and 2 context cues were removed from the calculation.
Can we get perplexity score for the generic English Language model available on CMU site against our text..??
How to get the same for (Language models available in CMUSphinx)cmusphinx-5.0-en-us.lm..??
Download lm in ARPA format, unpack and evaluate perplexity in the same way
http://sourceforge.net/projects/cmusphinx/files/Acoustic%20and%20Language%20Models/US%20English%20Generic%20Language%20Model/cmusphinx-5.0-en-us.lm.gz/download
On unpacking the above mentioned model i am able to see .dmp model not arpa format , can we directly convert the .dmp language model to .arpa language model using sphinx_lm_convert.exe?
You need to double-check what you are doing. The link above points to gzipped ARPA model
Yes
Hi Nickolay,
As already discussed i am able to calculate the perplexity score for the small and specific language models which we have generated, but when i tried calculating the same for en-us generic language model which i have downloaded from CMU site i am getting the below error can you please help me how to over come this
D:\CMUSphinix\CMUCLMTK\cmuclmtk-0.7-win32\cmuclmtk-0.7-win32>evallm.exe -arpa lm.arpa
Reading in language model from file lm.arpa
Reading in a 3-gram language model.
Number of 1-grams = 19794.
Number of 2-grams = 1377200.
Number of 3-grams = 3178194.
Reading unigrams...
Warning, reading line -
- gave unexpected input.
Reading 2-grams...
....Error - Repeated 2-gram in ARPA format language model.
Regards,
Kalpana
There might be a bug in cmucltmtk. Use SRILM instead.