Hi, I have a question in Section 11.2.2 N-gram language models in SPOKEN LANGUAGE PROCESSING by Huang et al.
On page 553, just above "P(Mary loves that person)" example, authors say that "to make the sum of the probabilities of all strings equal 1, it is necessary to place a distinguished token at the end of the sentence"
Why is that so?
Consider a language where there are only two sentences. "yes no" and "no yes"
Hi, I have a question in Section 11.2.2 N-gram language models in SPOKEN LANGUAGE PROCESSING by Huang et al.
On page 553, just above "P(Mary loves that person)" example, authors say that "to make the sum of the probabilities of all strings equal 1, it is necessary to place a distinguished token at the end of the sentence"
Why is that so?
Consider a language where there are only two sentences. "yes no" and "no yes"
S1 = <s> yes no
S2 = <s> no yes
P(S1) = P(yes|<s>)*P(no|yes,<s>)
= Count(<s> yes)/Count(<s>) * Count(<s> yes no)/Count(<s> yes)
= 1/2 * 1/1
P(S2) = P(no|<s>)*P(yes|no,<s>)
= 1/2 * 1/1
There seems no need for </s> to make P(S1)+P(S2) = 1. What am I missing?
Last edit: dovark 2013-09-28
Consider language where there are sentences of different lenght:
You need to estimate short sentence probability
Thanks I got it.
If
S1 = <s> no
S2 = <s> no yes
P(S1)+P(S2) = 1 + 1*0.5 = 1.5
So we need < /s > at the end so that
P(S1) = 1 x 0.5
P(S2) = 1 x 0.5 x 1
P(S1)+P(S2) = 1