CMU Sphinx / Forums / Speech Recognition Theory: Extracting word probability from the language model file

ahmedkhalaf92 - 2013-03-15

Hi,

As I understand from the FAQ, the language model lists the probabilities of occurrences of sequences of size n (n-gram).

My problem is that I don't understand the format of the language model file (.lm file).

For example, this is a sample from a .lm file I have, with n=1:

-2.0170 HELLO -0.2571
-2.0170 HOW -0.2883

How to interpret this format into a regular probability format? say, from 0 to 1, or a rank order?

Thanks.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Pranav Jawale - 2013-03-15
  
  This is a standard format called ARPA. Those numbers are logarithms to the
  base 10. You can find more info here
  http://msdn.microsoft.com/en-us/library/hh378460(v=office.14).aspx
  
  On Sat, Mar 16, 2013 at 1:27 AM, ahmedkhalaf92
  ahmedkhalaf92@users.sf.netwrote:
  
  Hi,
  
  As I understand from the FAQ, the language model lists the probabilities
  of occurrences of sequences of size n (n-gram).
  
  My problem is that I don't understand the format of the language model
  file (.lm file).
  
  For example, this is a sample from a .lm file I have, with n=1:
  
  -2.0170 HELLO -0.2571
  -2.0170 HOW -0.2883
  
  How to interpret this format into a regular probability format? say, from
  0 to 1, or a rank order?
  Thanks.
  
  Extracting word probability from the language model filehttps://sourceforge.net/p/cmusphinx/discussion/sphinx4/thread/bdba0a68/?limit=25#0f55
  
  Sent from sourceforge.net because you indicated interest in
  https://sourceforge.net/p/cmusphinx/discussion/sphinx4/
  
  To unsubscribe from further messages, please visit
  https://sourceforge.net/auth/prefs/
  
  --
  *
  *
  The best way to get something done is to begin. ~Author Unknown
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - ahmedkhalaf92 - 2013-03-15
    
    Thanks.
    
    But I am not sure how this correctly corresponds to the normal probability I expect.
    
    For example, If I have a one-word sentence, like "Hello", and it occurs only once in the text corpus. So, the expected probability should be 1/(total number of words with repitions), right?
    
    I tried to take 10^(numbers on the left or right), but the result is not what I expected.
    
    A more clarification would be helpful.
    
    Last edit: ahmedkhalaf92 2013-03-15
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nickolay V. Shmyrev - 2013-03-15

So, the expected probability should be 1/(total number of words with repitions), right?

Total number of words includes sentence end tags , you probably do not count them.

You are welcome to provide exact data for the explanation

For example the following text:

<s> hello </s> <s> world </s>

The lm:

\1-grams: -0.30103 </s> -99 <s> -0.60206 hello -0.60206 world

The prob for hello is 10^(-0.6) = 0.25
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Pranav Jawale - 2013-03-16

@ahmadkhalaf92

Unfortunately it's not that straightforward. Various optimizations are done (google for smoothing in language model). Which language modelling toolkit are you using? What parameters did you pass to it? You will need to look into the specific implementation in that toolkit.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Extracting word probability from the language model file

Speech Recognition Toolkit

Forums

Help

Extracting word probability from the language model file

Extracting word probability from the language model filehttps://sourceforge.net/p/cmusphinx/discussion/sphinx4/thread/bdba0a68/?limit=25#0f55

Extracting word probability from the language model file

Speech Recognition Toolkit

Forums

Help

Extracting word probability from the language model file document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

Extracting word probability from the language model filehttps://sourceforge.net/p/cmusphinx/discussion/sphinx4/thread/bdba0a68/?limit=25#0f55

Extracting word probability from the language model file