Menu

GNaiveBayes + GAutoFilter help

Help
2014-12-19
2014-12-22
  • Peter Figliozzi

    Peter Figliozzi - 2014-12-19

    I'm comparing output of GNaiveBayes with that from the e1071 Naive Bayes classifier in R, using the same training data and queries. In this case my input vector is 2-D = (PC1, PC2). It seems the Waffles version changes the output (likelihood) only at large intervals:

    QueryP_good WafflesP_good R
    (-80.5, -10.5)0.5451920.552
    (-80.5, -11.5)0.5451920.560
    (-80.5, -12.5)0.5451920.568
    (-80.5, -13.5)0.5451920.577
    (-80.5, -14.5)0.6422170.586

    Is this a consequence of GNaiveBayes or the GAutoFilter I am using in front of it? Is it possible to get finer-grained behavior like in R?

     

    Last edit: Peter Figliozzi 2014-12-19
  • Mike Gashler

    Mike Gashler - 2014-12-20

    Short answer: Yes, AutoFilter discretizes continuous features for GNaiveBayes.

    Long answer: Naive Bayes is designed to work for categorical features. There are two common ways to make it work with continuous features: (1) Assume that for every class, the values in each feature are distributed according to a Normal distribution, and use the Normal PDF to calculate probabilities. This approach has the advantage of being continuously responsive to small changes in the features, but it will be less accurate when that assumption about the distribution of features is not true. (2) Discretize the continuous training data into buckets and treat it as categorical data. This approach has the advantage of being able to model any distribution, as long as you have plenty of training data, but it will behave as you have observed when the training data is not huge.

    In general, naive Bayes is a very poor learning algorithm, but it has the nice property of being very fast. Consequently, it is primarily used in applications where huge amounts of high-dimensional training data are available, and long training times are unacceptable. (For example, it was used for many years in spam filters.) In such contexts, the large amount of training data compensates for the poor learning algorithm. So, assuming it will be used in such contexts, I thought it made sense to go with the discretization approach in Waffles.

    I assume that R decided to go with the "Normal" approach. Maybe they know something I don't. It wouldn't be very much work to implement it, but I'm not yet convinced it would really add any value either.

     
  • Peter Figliozzi

    Peter Figliozzi - 2014-12-22

    For my current application, Naive Bayes is beating out everything else I've tried. There are 100-200 training vectors, all 2D, somewhat normally distributed. Basically the opposite of what you'd imagined. :) I understand what needs to be done here mathematically. I don't have a clear picture of how it should fit into the current Waffles design. (A subclass of GAutoFilter?)

     
  • Mike Gashler

    Mike Gashler - 2014-12-22

    Interesting. Empirical results certainly trump my pompous theoretical speculations!

    In src/GClasses/GNaiveBayes.h, if you change line 89 from

    virtual bool canImplicitlyHandleContinuousFeatures() { return false; }
    

    to

    virtual bool canImplicitlyHandleContinuousFeatures() { return true; }
    

    then GAutoFilter will no longer automatically discretize continuous features for naive Bayes. (Actually, you can just remove that line altogether because it overrides a method in a parent class that returns true.) Then, you would want to modify the code in src/GClasses/GNaiveBayes.cpp so it actually can handle continuous features.

     

Anonymous
Anonymous

Add attachments
Cancel





Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.