But I am not sure how this correctly corresponds to the normal probability I expect.
For example, If I have a one-word sentence, like "Hello", and it occurs only once in the text corpus. So, the expected probability should be 1/(total number of words with repitions), right?
I tried to take 10^(numbers on the left or right), but the result is not what I expected.
A more clarification would be helpful.
Last edit: ahmedkhalaf92 2013-03-15
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Unfortunately it's not that straightforward. Various optimizations are done (google for smoothing in language model). Which language modelling toolkit are you using? What parameters did you pass to it? You will need to look into the specific implementation in that toolkit.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi,
As I understand from the FAQ, the language model lists the probabilities of occurrences of sequences of size n (n-gram).
My problem is that I don't understand the format of the language model file (.lm file).
For example, this is a sample from a .lm file I have, with n=1:
-2.0170 HELLO -0.2571
-2.0170 HOW -0.2883
How to interpret this format into a regular probability format? say, from 0 to 1, or a rank order?
Thanks.
This is a standard format called ARPA. Those numbers are logarithms to the
base 10. You can find more info here
http://msdn.microsoft.com/en-us/library/hh378460(v=office.14).aspx
On Sat, Mar 16, 2013 at 1:27 AM, ahmedkhalaf92
ahmedkhalaf92@users.sf.netwrote:
--
*
*
The best way to get something done is to begin. ~Author Unknown
Thanks.
But I am not sure how this correctly corresponds to the normal probability I expect.
For example, If I have a one-word sentence, like "Hello", and it occurs only once in the text corpus. So, the expected probability should be 1/(total number of words with repitions), right?
I tried to take 10^(numbers on the left or right), but the result is not what I expected.
A more clarification would be helpful.
Last edit: ahmedkhalaf92 2013-03-15
Total number of words includes sentence end tags , you probably do not count them.
You are welcome to provide exact data for the explanation
For example the following text:
The lm:
The prob for hello is 10^(-0.6) = 0.25
@ahmadkhalaf92
Unfortunately it's not that straightforward. Various optimizations are done (google for smoothing in language model). Which language modelling toolkit are you using? What parameters did you pass to it? You will need to look into the specific implementation in that toolkit.