Re: [Crm114-discuss] Re: continuation: current model in crm114

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

On Mar 29 2004, Christian Siefkes wrote:

> > Let M be the 2x|D| matrix, where the row label represents
> > {spam,notspam} and the column label represents an enumeration of the
> > documents t \in D.
> >
> > M_st =(def) Prob(d is "s" | d = t) =(def) P(T is "s" | T = \phi(t)),
> >
> > Let N be the |D|x2 matrix, where the row label is a document and the
> > column label is in {spam,notspam}:
> >
> > N_tr =(def) Prob(d = t | d is "r") = \prod_k P(T_k = \phi(t)_k | T is "r")
> 
> This looks more or less as if you're applying the transformation function
> \phi to each feature t_k in the document t? But you can only apply this
> function to the whole document t, otherwise you'll get wrong results...
> 

Sorry, this was meant to be as succinct as possible, but is what you
expect. If \phi(t) = U, then \phi(t)_k = U_k where U = (U_1,...,U_|D_5|). 

Each element of the matrix N_{tr} has a fixed full document t, 
and a fixed label r. I could have written this as

N_{tr} = \prod_k P( G_k = T_k | G is "r"), where T = \phi(t) = (T_1,...,T_n)

> 
> > The last product is because on D_5, the terms T_k are independent,
> > conditionally on the spam/notspamc class. I'm only defining these
> > matrices so I don't have to write long formulas.
> >
> > If the measure Prob on D exists and is a probability, then we must
> > certainly have
> >
> > Prob(d is "s") = \sum_t Prob(d is "s"| d = t) Prob(d = t)
> 
> What exactly is it you're postulating here?

I am saying: if a probability measure Prob exists on D, then such a
measure follows all the usual rules of probability theory. The above is
sometimes called the law of total probability, and has the form

Prob(A) = \sum_i Prob(A|B_i)Prob(B_i), 

This equation is true for all probabilities on D, whenever the B_i form an
exhaustive partition of D. This is the case with the choice 
B_t = {d = t}. 

> 
> 
> >     = \sum_{t,r} Prob(d is "s"| d = t)Prob(d = t|d is "r")Prob(d is "r")
> > v   = MNv
> >
> > where v is the column vector v = (Prob(d is "spam"), Prob(d is "notspam"))^T
> 
> What's T here? The set of features T_k in the document T = \phi(t) ?
> 

Oops, that T is for "transpose", because v is a column vector and I
can't write columns in an email. Sorry, you can probably ignore that. 
I'm also using a latex convention that _ indicates a subscript, ^
indicates a superscript. Writing mathematical notation in emails is
terribly inefficient.

-- 
Laird Breyer.