From: Laird B. <lb...@us...> - 2004-03-30 00:08:46
|
On Mar 29 2004, Christian Siefkes wrote: > > Let M be the 2x|D| matrix, where the row label represents > > {spam,notspam} and the column label represents an enumeration of the > > documents t \in D. > > > > M_st =(def) Prob(d is "s" | d = t) =(def) P(T is "s" | T = \phi(t)), > > > > Let N be the |D|x2 matrix, where the row label is a document and the > > column label is in {spam,notspam}: > > > > N_tr =(def) Prob(d = t | d is "r") = \prod_k P(T_k = \phi(t)_k | T is "r") > > This looks more or less as if you're applying the transformation function > \phi to each feature t_k in the document t? But you can only apply this > function to the whole document t, otherwise you'll get wrong results... > Sorry, this was meant to be as succinct as possible, but is what you expect. If \phi(t) = U, then \phi(t)_k = U_k where U = (U_1,...,U_|D_5|). Each element of the matrix N_{tr} has a fixed full document t, and a fixed label r. I could have written this as N_{tr} = \prod_k P( G_k = T_k | G is "r"), where T = \phi(t) = (T_1,...,T_n) > > > The last product is because on D_5, the terms T_k are independent, > > conditionally on the spam/notspamc class. I'm only defining these > > matrices so I don't have to write long formulas. > > > > If the measure Prob on D exists and is a probability, then we must > > certainly have > > > > Prob(d is "s") = \sum_t Prob(d is "s"| d = t) Prob(d = t) > > What exactly is it you're postulating here? I am saying: if a probability measure Prob exists on D, then such a measure follows all the usual rules of probability theory. The above is sometimes called the law of total probability, and has the form Prob(A) = \sum_i Prob(A|B_i)Prob(B_i), This equation is true for all probabilities on D, whenever the B_i form an exhaustive partition of D. This is the case with the choice B_t = {d = t}. > > > > = \sum_{t,r} Prob(d is "s"| d = t)Prob(d = t|d is "r")Prob(d is "r") > > v = MNv > > > > where v is the column vector v = (Prob(d is "spam"), Prob(d is "notspam"))^T > > What's T here? The set of features T_k in the document T = \phi(t) ? > Oops, that T is for "transpose", because v is a column vector and I can't write columns in an email. Sorry, you can probably ignore that. I'm also using a latex convention that _ indicates a subscript, ^ indicates a superscript. Writing mathematical notation in emails is terribly inefficient. -- Laird Breyer. |