From: Christian S. <si...@mi...> - 2004-03-30 16:49:34
|
Hi Laird, On Tue, 30 Mar 2004, Laird Breyer wrote: > (*) you might want to read the exchange I'm having with Christian > Siefkes, who is trying to get around the difficulties in a promising > way. He builds a parallel universe D_5 in which features are > independent exactly as you assume, but (I'm fairly certain) he's in trouble > when he wants to bring back the result into the real universe D. Maybe you're right and it's not possible to consider the probabilities calculated in D_5 (after the SBPH transformation \phi) as probabilities on D (prior to this transformation i.e. with single-word features only). But it might be enough to stay in D_5 and use the probabilities as calculated there. After all each document has a unique representation in D_5 just as in D, and even the untransformed feature vector in D is not the "real document" because prior to that you decided how to tokenize, what to discard (e.g. whitespace) etc... OK, the documents we'll encounter in D_5 are sparse in the sense that there are documents we'll never see because they cannot be generated by the SBPH transformation (\phi(D) is a strict subset of D_5). So the Naive Bayes classifier thinks it is operating on D_5 while truly it is operating on the subset \phi(D) only. That's bad but I think it's not entirely unreasonable to hope that the effects of this wrong assumption will roughly cancel each other out because all classes are affected the same way. Here is a paper that seems to confirm this hope: http://www.intellektik.informatik.tu-darmstadt.de/~tom/IJCAI01/Rish.pdf . They state that Naive Bayes works best if _either_ the features are really independent (that's obvious) _or_ if the dependencies between features are functional (deterministic). Now the feature dependencies introduced by SBPH are completely deterministic -- so this might help to understand why CRM can introduce these dependencies and still proceed as if they wouldn't exist without suffering a performance drop. Bye Christian ------------ Christian Siefkes ----------------------------------------- | Email: chr...@si... | Web: http://www.siefkes.net/ | Graduate School in Distributed IS: http://www.wiwi.hu-berlin.de/gkvi/ -------------------- Offline P2P: http://www.leihnetzwerk.de/ ---------- Freedom is being able to make decisions that affect mainly you. Power is being able to make decisions that affect others more than you. If we confuse power with freedom, we will fail to uphold real freedom. -- Bradley M. Kuhn and Richard M. Stallman, Freedom Or Power? |