Re: [Crm114-discuss] Re: continuation of current model in the crm114

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hi Laird,

On Tue, 30 Mar 2004, Laird Breyer wrote:
> (*) you might want to read the exchange I'm having with Christian
> Siefkes, who is trying to get around the difficulties in a promising
> way. He builds a parallel universe D_5 in which features are
> independent exactly as you assume, but (I'm fairly certain) he's in trouble
> when he wants to bring back the result into the real universe D.

Maybe you're right and it's not possible to consider the probabilities
calculated in D_5 (after the SBPH transformation \phi) as probabilities on
D (prior to this transformation i.e. with single-word features only). But
it might be enough to stay in D_5 and use the probabilities as calculated
there. After all each document has a unique representation in D_5 just as
in D, and even the untransformed feature vector in D is not the "real
document" because prior to that you decided how to tokenize, what to
discard (e.g. whitespace) etc...

OK, the documents we'll encounter in D_5 are sparse in the sense that
there are documents we'll never see because they cannot be generated by
the SBPH transformation (\phi(D) is a strict subset of D_5). So the Naive
Bayes classifier thinks it is operating on D_5 while truly it is operating
on the subset \phi(D) only. That's bad but I think it's not entirely
unreasonable to hope that the effects of this wrong assumption will
roughly cancel each other out because all classes are affected the same
way.

Here is a paper that seems to confirm this hope:
http://www.intellektik.informatik.tu-darmstadt.de/~tom/IJCAI01/Rish.pdf .
They state that Naive Bayes works best if _either_ the features are really
independent (that's obvious) _or_ if the dependencies between features are
functional (deterministic). Now the feature dependencies introduced by
SBPH are completely deterministic -- so this might help to understand why
CRM can introduce these dependencies and still proceed as if they wouldn't
exist without suffering a performance drop.

Bye
	Christian

------------ Christian Siefkes -----------------------------------------
|     Email: chr...@si...    |     Web: http://www.siefkes.net/
|  Graduate School in Distributed IS: http://www.wiwi.hu-berlin.de/gkvi/
-------------------- Offline P2P: http://www.leihnetzwerk.de/ ----------
Freedom is being able to make decisions that affect mainly you. Power is
being able to make decisions that affect others more than you. If we
confuse power with freedom, we will fail to uphold real freedom.
          -- Bradley M. Kuhn and Richard M. Stallman, Freedom Or Power?