Waffles / Discussion / Help: GNaiveBayes + GAutoFilter help

Peter Figliozzi - 2014-12-19

I'm comparing output of GNaiveBayes with that from the e1071 Naive Bayes classifier in R, using the same training data and queries. In this case my input vector is 2-D = (PC1, PC2). It seems the Waffles version changes the output (likelihood) only at large intervals:

Query P_good Waffles P_good R

(-80.5, -10.5) 0.545192 0.552

(-80.5, -11.5) 0.545192 0.560

(-80.5, -12.5) 0.545192 0.568

(-80.5, -13.5) 0.545192 0.577

(-80.5, -14.5) 0.642217 0.586

Is this a consequence of GNaiveBayes or the GAutoFilter I am using in front of it? Is it possible to get finer-grained behavior like in R?

Last edit: Peter Figliozzi 2014-12-19

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Query	P_good Waffles	P_good R
(-80.5, -10.5)	0.545192	0.552
(-80.5, -11.5)	0.545192	0.560
(-80.5, -12.5)	0.545192	0.568
(-80.5, -13.5)	0.545192	0.577
(-80.5, -14.5)	0.642217	0.586

Mike Gashler - 2014-12-20

Short answer: Yes, AutoFilter discretizes continuous features for GNaiveBayes.

Long answer: Naive Bayes is designed to work for categorical features. There are two common ways to make it work with continuous features: (1) Assume that for every class, the values in each feature are distributed according to a Normal distribution, and use the Normal PDF to calculate probabilities. This approach has the advantage of being continuously responsive to small changes in the features, but it will be less accurate when that assumption about the distribution of features is not true. (2) Discretize the continuous training data into buckets and treat it as categorical data. This approach has the advantage of being able to model any distribution, as long as you have plenty of training data, but it will behave as you have observed when the training data is not huge.

In general, naive Bayes is a very poor learning algorithm, but it has the nice property of being very fast. Consequently, it is primarily used in applications where huge amounts of high-dimensional training data are available, and long training times are unacceptable. (For example, it was used for many years in spam filters.) In such contexts, the large amount of training data compensates for the poor learning algorithm. So, assuming it will be used in such contexts, I thought it made sense to go with the discretization approach in Waffles.

I assume that R decided to go with the "Normal" approach. Maybe they know something I don't. It wouldn't be very much work to implement it, but I'm not yet convinced it would really add any value either.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Peter Figliozzi - 2014-12-22

For my current application, Naive Bayes is beating out everything else I've tried. There are 100-200 training vectors, all 2D, somewhat normally distributed. Basically the opposite of what you'd imagined. :) I understand what needs to be done here mathematically. I don't have a clear picture of how it should fit into the current Waffles design. (A subclass of GAutoFilter?)

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Mike Gashler - 2014-12-22

Interesting. Empirical results certainly trump my pompous theoretical speculations!

In src/GClasses/GNaiveBayes.h, if you change line 89 from

virtual bool canImplicitlyHandleContinuousFeatures() { return false; }

to

virtual bool canImplicitlyHandleContinuousFeatures() { return true; }

then GAutoFilter will no longer automatically discretize continuous features for naive Bayes. (Actually, you can just remove that line altogether because it overrides a method in a parent class that returns true.) Then, you would want to modify the code in src/GClasses/GNaiveBayes.cpp so it actually can handle continuous features.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

GNaiveBayes + GAutoFilter help

Forums

Help

GNaiveBayes + GAutoFilter help document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

GNaiveBayes + GAutoFilter help