Hi,
I have the corpus of 5 sentences
he ran away in the forest
i am the one who found it
he ran away in the garden
he is my friend
i am in the room
When i find the probability of 'am the' using bigram formula P(Wn | Wn-1) =
P(Wn-1, Wn) / P(Wn-1) then i get
prob(the | am) = 1 / 2 = 0.5 which will equal to -0.3010299956639811 by taking
the log base 10 of 0.5
But using cmu language modelling tool i got the prob(the | am) = -0.6812 in
arpa file format.
How is it possible??
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
ya, i did it by using good turing formula P(w2|w1) = / N ,inside brackets is
expected number of times that event occurs, N is the total number of bigrams,r
is the number of times that bigram occurs. here i hv
P(the | am) = / 20 After then i took the log 10 of this estimated probability
nd got d answer -1.57403126.. which is not the correct one, plz tell me what
should i do??
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I have the snapshot of arpa file when i model these 5 sentences using cmu
language model toolkit-
he ran away in the forest
i am the one who found it
he ran away in the garden
he is my friend
i am in the room
This is a 3-gram language model, based on a vocabulary of 17 words,
which begins "am", "away", "forest"...
This is an OPEN-vocabulary model (type 1)
(OOVs were mapped to UNK, which is treated as any other vocabulary word)
Good-Turing discounting was applied.
1-gram frequency of frequency : 9
2-gram frequency of frequency : 15 4 1 0 0 0 0
3-gram frequency of frequency : 20 3 0 0 0 0 0
1-gram discounting ratios : 0.82
2-gram discounting ratios : 0.42 0.22
3-gram discounting ratios : 0.00
This file is in the ARPA-standard format introduced by Doug Paul.
\2-grams:
-0.6812 am in 0.6021
-0.6812 am the 0.0649
-0.1249 away in 0.1249
-0.3802 forest i 0.1072
As you said that i should use discount probabilties, i calculated the
probability of P(in | am) by using the formula
Pabs(y | x) = / c(x) where D is calculated by, D = c(xy) / c(x), where c* =
/ N(c)
i got the answer P(in | am) = -0.491844.. which doesnot match with the
probability given in snapshot above.
so please tell me how we calculate the discount ratios and backoff weight
??what is the significance of it in calculating probabilities??
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Sorry, I didn't have time to look on this in details. My general though is
that it's probably easier to use a debugger to find out how calculations are
made or just look in the code. Just trying to reproduce math isn't really
productive
Pabs(y | x) = / c(x) where D is calculated by, D = c(xy) / c(x), where c*
= / N(c) i got the answer P(in | am) = -0.491844..
This doesn't look like a Good-Turing discounting for me, more like Knesser-Ney
one. By default idngram2lm applies Good-Turing discounting and your model was
built with Good-Turing. Knesser-Ney one is not implemented.
where 0.416 is discounted_ng_count and 2 is marginal count (count of "am").
0.416 (discounted_ng_count ) is calculated as 0.416 (GT discount for ratio for
bigrams of count 1) * 1 ("am in" count).
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
In above......
_0.416 (discounted_ng_count ) is calculated as 0.416 (GT discount for ratio
for bigrams of count 1) * 1 ("am in" count). _
The value of discounted_ng_count which is here 0.416......how is it
calculated....i mean, which formula to apply...i tried...many but unable to
get the answer....
plz repl.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
To find the probability of p( in | am) i took the value of gtmax as 2 and i
got the correct answer but if i use this value of gtmax=2 in finding the
probability p(ran | he), p(is | he) and p(away | ran) then i got the wrong
answer.
so, plz tell me how we choose the value of gtmax in each case : unigram ,
bigram and trigram??
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi,
I have the corpus of 5 sentences
he ran away in the forest
i am the one who found it
he ran away in the garden
he is my friend
i am in the room
When i find the probability of 'am the' using bigram formula P(Wn | Wn-1) =
P(Wn-1, Wn) / P(Wn-1) then i get
prob(the | am) = 1 / 2 = 0.5 which will equal to -0.3010299956639811 by taking
the log base 10 of 0.5
But using cmu language modelling tool i got the prob(the | am) = -0.6812 in
arpa file format.
How is it possible??
discouting modifies probabilities, isn't it.
ya, i did it by using good turing formula P(w2|w1) = / N ,inside brackets is
expected number of times that event occurs, N is the total number of bigrams,r
is the number of times that bigram occurs. here i hv
P(the | am) = / 20 After then i took the log 10 of this estimated probability
nd got d answer -1.57403126.. which is not the correct one, plz tell me what
should i do??
I have the snapshot of arpa file when i model these 5 sentences using cmu
language model toolkit-
he ran away in the forest
i am the one who found it
he ran away in the garden
he is my friend
i am in the room
This is a 3-gram language model, based on a vocabulary of 17 words,
which begins "am", "away", "forest"...
This is an OPEN-vocabulary model (type 1)
(OOVs were mapped to UNK, which is treated as any other vocabulary word)
Good-Turing discounting was applied.
1-gram frequency of frequency : 9
2-gram frequency of frequency : 15 4 1 0 0 0 0
3-gram frequency of frequency : 20 3 0 0 0 0 0
1-gram discounting ratios : 0.82
2-gram discounting ratios : 0.42 0.22
3-gram discounting ratios : 0.00
This file is in the ARPA-standard format introduced by Doug Paul.
p(wd3|wd1,wd2)= if(trigram exists) p_3(wd1,wd2,wd3)
else if(bigram w1,w2 exists) bo_wt_2(w1,w2)*p(wd3|wd2)
else p(wd3|w2)
p(wd2|wd1)= if(bigram exists) p_2(wd1,wd2)
else bo_wt_1(wd1)*p_1(wd2)
All probs and back-off weights (bo_wt) are given in log10 form.
Data formats:
Beginning of data mark: \data\
ngram 1=nr # number of 1-grams
ngram 2=nr # number of 2-grams
ngram 3=nr # number of 3-grams
\1-grams:
p_1 wd_1 bo_wt_1
\2-grams:
p_2 wd_1 wd_2 bo_wt_2
\3-grams:
p_3 wd_1 wd_2 wd_3
end of data mark: \end\
\data\
ngram 1=18
ngram 2=20
ngram 3=23
\2-grams:
-0.6812 am in 0.6021
-0.6812 am the 0.0649
-0.1249 away in 0.1249
-0.3802 forest i 0.1072
As you said that i should use discount probabilties, i calculated the
probability of P(in | am) by using the formula
Pabs(y | x) = / c(x) where D is calculated by, D = c(xy) / c(x), where c* =
/ N(c)
i got the answer P(in | am) = -0.491844.. which doesnot match with the
probability given in snapshot above.
so please tell me how we calculate the discount ratios and backoff weight
??what is the significance of it in calculating probabilities??
Sorry, I didn't have time to look on this in details. My general though is
that it's probably easier to use a debugger to find out how calculations are
made or just look in the code. Just trying to reproduce math isn't really
productive
This doesn't look like a Good-Turing discounting for me, more like Knesser-Ney
one. By default idngram2lm applies Good-Turing discounting and your model was
built with Good-Turing. Knesser-Ney one is not implemented.
It's probably better to refer to formulas here
http://www-speech.sri.com/projects/srilm/manpages/ngram-discount.7.html
You can probably check how Knesser-Ney works with SRILM but don't forget to
set cutoffs there manually.
FYI, 0.6812 is calculated this way:
0.6812 = log10(0.416667 / 2)
where 0.416 is discounted_ng_count and 2 is marginal count (count of "am").
0.416 (discounted_ng_count ) is calculated as 0.416 (GT discount for ratio for
bigrams of count 1) * 1 ("am in" count).
Thanks for the reply...
Thanks for starting the topic.......
In above......
_0.416 (discounted_ng_count ) is calculated as 0.416 (GT discount for ratio
for bigrams of count 1) * 1 ("am in" count). _
The value of discounted_ng_count which is here 0.416......how is it
calculated....i mean, which formula to apply...i tried...many but unable to
get the answer....
plz repl.
The formula is the same as in the ngram-discount document I quoted above. It
uses frequencies of frequencies:
http://www-speech.sri.com/projects/srilm/manpages/ngram-discount.7.html
here common_ term = {(gtmax + 1) n} / n
To find the probability of p( in | am) i took the value of gtmax as 2 and i
got the correct answer but if i use this value of gtmax=2 in finding the
probability p(ran | he), p(is | he) and p(away | ran) then i got the wrong
answer.
so, plz tell me how we choose the value of gtmax in each case : unigram ,
bigram and trigram??
hii.. I am still on the problem of gtmax.
please answer me..