Menu

Extracting word probability from the language model file

2013-03-15
2013-03-17
  • ahmedkhalaf92

    ahmedkhalaf92 - 2013-03-15

    Hi,

    As I understand from the FAQ, the language model lists the probabilities of occurrences of sequences of size n (n-gram).

    My problem is that I don't understand the format of the language model file (.lm file).

    For example, this is a sample from a .lm file I have, with n=1:

    -2.0170 HELLO -0.2571
    -2.0170 HOW -0.2883

    How to interpret this format into a regular probability format? say, from 0 to 1, or a rank order?

    Thanks.

     
    • Pranav Jawale

      Pranav Jawale - 2013-03-15

      This is a standard format called ARPA. Those numbers are logarithms to the
      base 10. You can find more info here
      http://msdn.microsoft.com/en-us/library/hh378460(v=office.14).aspx

      On Sat, Mar 16, 2013 at 1:27 AM, ahmedkhalaf92
      ahmedkhalaf92@users.sf.netwrote:

      Hi,

      As I understand from the FAQ, the language model lists the probabilities
      of occurrences of sequences of size n (n-gram).

      My problem is that I don't understand the format of the language model
      file (.lm file).

      For example, this is a sample from a .lm file I have, with n=1:

      -2.0170 HELLO -0.2571
      -2.0170 HOW -0.2883

      How to interpret this format into a regular probability format? say, from
      0 to 1, or a rank order?
      Thanks.

      Extracting word probability from the language model filehttps://sourceforge.net/p/cmusphinx/discussion/sphinx4/thread/bdba0a68/?limit=25#0f55

      Sent from sourceforge.net because you indicated interest in
      https://sourceforge.net/p/cmusphinx/discussion/sphinx4/

      To unsubscribe from further messages, please visit
      https://sourceforge.net/auth/prefs/

      --
      *
      *
      The best way to get something done is to begin. ~Author Unknown

       
      • ahmedkhalaf92

        ahmedkhalaf92 - 2013-03-15

        Thanks.

        But I am not sure how this correctly corresponds to the normal probability I expect.

        For example, If I have a one-word sentence, like "Hello", and it occurs only once in the text corpus. So, the expected probability should be 1/(total number of words with repitions), right?

        I tried to take 10^(numbers on the left or right), but the result is not what I expected.

        A more clarification would be helpful.

         

        Last edit: ahmedkhalaf92 2013-03-15
  • Nickolay V. Shmyrev

    So, the expected probability should be 1/(total number of words with repitions), right?

    Total number of words includes sentence end tags , you probably do not count them.

    You are welcome to provide exact data for the explanation

    For example the following text:

    <s> hello </s>
    <s> world </s>
    

    The lm:

    \1-grams:
    -0.30103    </s>
    -99 <s>
    -0.60206    hello
    -0.60206    world
    

    The prob for hello is 10^(-0.6) = 0.25

     
  • Pranav Jawale

    Pranav Jawale - 2013-03-16

    @ahmadkhalaf92

    Unfortunately it's not that straightforward. Various optimizations are done (google for smoothing in language model). Which language modelling toolkit are you using? What parameters did you pass to it? You will need to look into the specific implementation in that toolkit.

     

Log in to post a comment.