Menu

probability in arpa file format

Help
Manita
2010-08-03
2012-09-22
  • Manita

    Manita - 2010-08-03

    Hi,
    I have the corpus of 5 sentences
    he ran away in the forest
    i am the one who found it
    he ran away in the garden
    he is my friend
    i am in the room

    When i find the probability of 'am the' using bigram formula P(Wn | Wn-1) =
    P(Wn-1, Wn) / P(Wn-1) then i get
    prob(the | am) = 1 / 2 = 0.5 which will equal to -0.3010299956639811 by taking
    the log base 10 of 0.5
    But using cmu language modelling tool i got the prob(the | am) = -0.6812 in
    arpa file format.
    How is it possible??

     
  • Nickolay V. Shmyrev

    discouting modifies probabilities, isn't it.

     
  • Manita

    Manita - 2010-08-03

    ya, i did it by using good turing formula P(w2|w1) = / N ,inside brackets is
    expected number of times that event occurs, N is the total number of bigrams,r
    is the number of times that bigram occurs. here i hv
    P(the | am) = / 20 After then i took the log 10 of this estimated probability
    nd got d answer -1.57403126.. which is not the correct one, plz tell me what
    should i do??

     
  • Manita

    Manita - 2010-08-05

    I have the snapshot of arpa file when i model these 5 sentences using cmu
    language model toolkit-
    he ran away in the forest
    i am the one who found it
    he ran away in the garden
    he is my friend
    i am in the room

    This is a 3-gram language model, based on a vocabulary of 17 words,
    which begins "am", "away", "forest"...
    This is an OPEN-vocabulary model (type 1)
    (OOVs were mapped to UNK, which is treated as any other vocabulary word)
    Good-Turing discounting was applied.
    1-gram frequency of frequency : 9
    2-gram frequency of frequency : 15 4 1 0 0 0 0
    3-gram frequency of frequency : 20 3 0 0 0 0 0
    1-gram discounting ratios : 0.82
    2-gram discounting ratios : 0.42 0.22
    3-gram discounting ratios : 0.00
    This file is in the ARPA-standard format introduced by Doug Paul.

    p(wd3|wd1,wd2)= if(trigram exists) p_3(wd1,wd2,wd3)
    else if(bigram w1,w2 exists) bo_wt_2(w1,w2)*p(wd3|wd2)
    else p(wd3|w2)

    p(wd2|wd1)= if(bigram exists) p_2(wd1,wd2)
    else bo_wt_1(wd1)*p_1(wd2)

    All probs and back-off weights (bo_wt) are given in log10 form.

    Data formats:

    Beginning of data mark: \data\
    ngram 1=nr # number of 1-grams
    ngram 2=nr # number of 2-grams
    ngram 3=nr # number of 3-grams

    \1-grams:
    p_1 wd_1 bo_wt_1
    \2-grams:
    p_2 wd_1 wd_2 bo_wt_2
    \3-grams:
    p_3 wd_1 wd_2 wd_3

    end of data mark: \end\

    \data\
    ngram 1=18
    ngram 2=20
    ngram 3=23

    \2-grams:
    -0.6812 am in 0.6021
    -0.6812 am the 0.0649
    -0.1249 away in 0.1249
    -0.3802 forest i 0.1072

    As you said that i should use discount probabilties, i calculated the
    probability of P(in | am) by using the formula

    Pabs(y | x) = / c(x) where D is calculated by, D = c(xy) / c(x), where c* =
    / N(c)
    i got the answer P(in | am) = -0.491844.. which doesnot match with the
    probability given in snapshot above.

    so please tell me how we calculate the discount ratios and backoff weight
    ??what is the significance of it in calculating probabilities??

     
  • Nickolay V. Shmyrev

    Sorry, I didn't have time to look on this in details. My general though is
    that it's probably easier to use a debugger to find out how calculations are
    made or just look in the code. Just trying to reproduce math isn't really
    productive

    Pabs(y | x) = / c(x) where D is calculated by, D = c(xy) / c(x), where c*
    = / N(c) i got the answer P(in | am) = -0.491844..

    This doesn't look like a Good-Turing discounting for me, more like Knesser-Ney
    one. By default idngram2lm applies Good-Turing discounting and your model was
    built with Good-Turing. Knesser-Ney one is not implemented.

    It's probably better to refer to formulas here

    http://www-speech.sri.com/projects/srilm/manpages/ngram-discount.7.html

    You can probably check how Knesser-Ney works with SRILM but don't forget to
    set cutoffs there manually.

     
  • Nickolay V. Shmyrev

    FYI, 0.6812 is calculated this way:

    0.6812 = log10(0.416667 / 2)

    where 0.416 is discounted_ng_count and 2 is marginal count (count of "am").
    0.416 (discounted_ng_count ) is calculated as 0.416 (GT discount for ratio for
    bigrams of count 1) * 1 ("am in" count).

     
  • Manita

    Manita - 2010-08-06

    Thanks for the reply...

     
  • vkumar

    vkumar - 2010-08-06

    Thanks for starting the topic.......

    In above......
    _0.416 (discounted_ng_count ) is calculated as 0.416 (GT discount for ratio
    for bigrams of count 1) * 1 ("am in" count). _

    The value of discounted_ng_count which is here 0.416......how is it
    calculated....i mean, which formula to apply...i tried...many but unable to
    get the answer....

    plz repl.

     
  • Nickolay V. Shmyrev

    The formula is the same as in the ngram-discount document I quoted above. It
    uses frequencies of frequencies:

    http://www-speech.sri.com/projects/srilm/manpages/ngram-discount.7.html

            first_term = ((double) ((r+1) * freq_of_freq[r+1])) 
                                 /  (r    * freq_of_freq[r]); 
            D[r]=(first_term - common_term)/(1.0 - common_term);
    
     
  • Manita

    Manita - 2010-08-10

    here common_ term = {(gtmax + 1) n} / n

    To find the probability of p( in | am) i took the value of gtmax as 2 and i
    got the correct answer but if i use this value of gtmax=2 in finding the
    probability p(ran | he), p(is | he) and p(away | ran) then i got the wrong
    answer.
    so, plz tell me how we choose the value of gtmax in each case : unigram ,
    bigram and trigram??

     
  • Manita

    Manita - 2010-08-16

    hii.. I am still on the problem of gtmax.
    please answer me..

     

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.