From: Christian S. <si...@mi...> - 2004-03-29 14:55:06
|
Hi Laird, On Sun, 28 Mar 2004, Laird Breyer wrote: > Here's the proof (a bit long, so I didn't want to include it earlier). hm, there are some points where I'm not sure what you mean. Not sure if we'll find the time to sort this out but maybe... > Let's first assume that D has finite cardinality |D|. This is the case > if the vocabulary is finite and documents have a maximum size. We can > take the limit afterwards. > > Claim: Assume |D| is finite. Define for any t \in D > > > > Prob( t is spam | t = (t_1,...,t_n) ) =(def) > > > P( T is spam | T = \phi(t_1,...,t_n) ) > > Then there exists no probability measure "Prob" defined on D such > that the above are conditional probabilities for all documents t \in D. > > Proof: > > Let M be the 2x|D| matrix, where the row label represents > {spam,notspam} and the column label represents an enumeration of the > documents t \in D. > > M_st =(def) Prob(d is "s" | d = t) =(def) P(T is "s" | T = \phi(t)), > > Let N be the |D|x2 matrix, where the row label is a document and the > column label is in {spam,notspam}: > > N_tr =(def) Prob(d = t | d is "r") = \prod_k P(T_k = \phi(t)_k | T is "r") This looks more or less as if you're applying the transformation function \phi to each feature t_k in the document t? But you can only apply this function to the whole document t, otherwise you'll get wrong results... > The last product is because on D_5, the terms T_k are independent, > conditionally on the spam/notspamc class. I'm only defining these > matrices so I don't have to write long formulas. > > If the measure Prob on D exists and is a probability, then we must > certainly have > > Prob(d is "s") = \sum_t Prob(d is "s"| d = t) Prob(d = t) What exactly is it you're postulating here? > = \sum_{t,r} Prob(d is "s"| d = t)Prob(d = t|d is "r")Prob(d is "r") > v = MNv > > where v is the column vector v = (Prob(d is "spam"), Prob(d is "notspam"))^T What's T here? The set of features T_k in the document T = \phi(t) ? > and we are summing over the real documents t \in D only. > > Let A = MN, then v is an eigenvector with eigenvalue 1, and v has the > special form v = (a, 1 - a)^T with 0 <= a <= 1. Now > > v_i = \sum_j A_ij v_j, and \sum_i v_i = 1 gives > > \sum_j (\sum_i A_ij) v_j = 1, which simplifies to > > (*) a (\sum_i A_{i,spam}) + (1-a) (\sum_i A_{i,notspam}) = 1. > > The elements of A are of course all nonnegative, and I claim there is > no solution for 0 <= a <= 1. If no such a exists, then there is no > measure Prob on D. > > Here is why no such a exists: Let M_5 and N_5 be defined exactly like > M and N, but on D_5, with respect to the probability measure P which > we agree exists. The only difference is that on D_5, there are many > more documents, but still a finite number. > > Let A_5 = (M_5)(N_5). You can see that \sum_i (A_5)_{i,spam} = 1, > because > > \sum_i (A_5)_{i,spam} = > \sum_{T \in D_5} [ P(spam|T)P(T|spam) + P(notspam|T)P(T|spam) ] > = \sum_{T \in D_5} P(T|spam) = 1. > > where we use P(notspam|T) = 1 - P(spam|T). > Similarly, you can show that \sum_i (A_5)_{i,notspam} = 1. > > But now, notice that > > (**) 1 = \sum_i (A_5)_{i,r} > \sum_i A_{i,r}, > > so in (*), both sums are < 1. It follows that 0 <= a <= 1 cannot exist.[] > > Now that we know that no Prob measure which equals P can exist for D > finite, consider the infinite case. Notice that the proof does not > use the finite nature of D at all. The finite sums on t can be > replaced by infinite sums, and can be interchanged by Fubini's > theorem, because all terms are nonnegative. All we need is to be able > to enumerate the t's on D and D_5. Moreover, the inequality in (**) > is strict even in the infinite case (easy, just use a nonreal > document in D_5 with positive P probability). So the same proof works > in the infinite case. ------------ Christian Siefkes ----------------------------------------- | Email: chr...@si... | Web: http://www.siefkes.net/ | Graduate School in Distributed IS: http://www.wiwi.hu-berlin.de/gkvi/ -------------------- Offline P2P: http://www.leihnetzwerk.de/ ---------- The documentation said "For Windows 95 or better", so I used Linux... ;-) |