Menu

Why .. Cannot undiscretize a distribution ?

Help
2013-10-18
2013-10-22
  • Saswat Padhi

    Saswat Padhi - 2013-10-18

    Hi,

    I am getting this error: "cannot undiscretize a distribution" when I try to
    calibrate() a GNaiveBayes model.
    However, it trains alright and works as expected.

    I couldn't figure out why this error arises only while calibrate() ...

     
  • Saswat Padhi

    Saswat Padhi - 2013-10-18

    I get the same error when I use predictDistribution on the model.

    Please help me figure out the issue.

     
  • Mike Gashler

    Mike Gashler - 2013-10-18

    Naive Bayes inherently operates only on categorical values. Your labels apparently contain one or more continuous attributes, so I would not expect naiveBayes to give very good results with your data. I would recommend using algorithms that are designed for regression.

    Well my implementation of naive Bayes tries its best to work anyway by automatically using a discretization filter to enable it to work with continuous values. So, after naive Bayes predicts a categorical label, the discretization filter is asked to "unconvert" that prediction to a continuous label. This step is lossy because there is not really enough information in a categorical value to specify a precise continuous value, so it just uses the center of the discretization bucket for an estimate. However, if you ask it to predict a full distribution, instead of just a continuous value, you are essentially asking it to make up a variance as well as a mean. That seemed like too much fudging, so I implemented it like this (in GClasses/GTransform.cpp):

    void GDiscretize::untransformToDistribution(const double* pIn, GPrediction* pOut)
    {
        throw Ex("Sorry, cannot undiscretize to a distribution");
    }
    

    Now that I think about it, I suppose it could use the width of the bucket as an estimate for variance. So, if you really want to make this work, you could change that code to

    void GDiscretize::untransformToDistribution(const double* pIn, GPrediction* pOut)
    {
        if(!m_pMins)
            throw Ex("Train was not called");
        size_t attrCount = before().size();
        for(size_t i = 0; i < attrCount; i++)
        {
            size_t nValues = before().valueCount(i);
            if(nValues > 0)
                pOut[i].makeCategorical()->setSpike(nValues, pIn[i], 0);
            else
                pOut[i].makeNormal()->setMeanAndVariance((((double)pIn[i] + .5) * m_pRanges[i]) / m_bucketsOut + m_pMins[i], m_pRanges[i] * m_pRanges[i]);
        }
    }
    

    I would not expect this to give very good results, since it is so lossy, but at least it will fix the exception that you were seeing.

     
  • Saswat Padhi

    Saswat Padhi - 2013-10-21

    Thank you Mike, for the clear explanation.

    I tried to trace the cause why Waffles might be thinking the labels are continuous, but I couldn't.

    The file that I am importing the data from, doesn't have continuous labels. I am reading the data into a GMatrix and splitting it into features and labels as follows:

        GMatrix *features = training->cloneSub(0, 0, training->rows(), training->cols() - 1);
        Holder<GMatrix> holdFeaturesMatrix(features);
        GMatrix *labels = training->cloneSub(0, training->cols() - 1, training->rows(), 1);
        Holder<GMatrix> holdLabelsMatrix(labels);
    

    Why does Waffles think the labels are continuous? How do I make sure that they are not treated as continuous? Is there any parameter that I should set or something?

     

    Last edit: Saswat Padhi 2013-10-21
  • Mike Gashler

    Mike Gashler - 2013-10-21

    That looks like it should work. Here is some code that will print the number of values in each column in your label matrix. 0 indicates continuous. Values greater than zero indicate categorical.

    for(size_t i = 0; i < labels->cols(); i++){
        cout << i << ":" << labels->relation().valueCount(i) << "\n";
    }
    

    I may have made some false assumptions in trying to determine what was happening. If you could give a call stack to that exception, that would really tell what is happening. A good way to do this is to put a breakpoint in Ex::setMessage in GClasses/GError.cpp.

     
  • Saswat Padhi

    Saswat Padhi - 2013-10-21

    Hi,

    I tried the snippet and then it gave me

    error: ‘GClasses::sp_relation’ has no member named ‘valueCount’
    

    I changed the second line to:

        cout << i << ":" << labels->relation()->valueCount(i) << "\n";
    

    and it worked. I get "0:0" as output, which means the labels column is a continuous one! But why should it be?

    The call stack shows:

    0:0
    
    Breakpoint 1, GClasses::Ex::setMessage (this=0x7d97540, message=...) at GError.cpp:37
    37      if(g_exceptionExpected)
    (gdb) bt
    #0  GClasses::Ex::setMessage (this=0x7d97540, message=...) at GError.cpp:37
    #1  0x0000000000407e74 in GClasses::Ex::Ex (this=0x7d97540, a=...) at GError.h:49
    #2  0x00000000004a600e in GClasses::GDiscretize::untransformToDistribution (this=0x3193b00, pIn=0x7d974a0, pOut=0x7fffffffd4c0) at GTransform.cpp:1565
    #3  0x000000000041ea4c in GClasses::GSupervisedLearner::predictDistribution (this=0x7fffffffd7a0, pIn=0x7985b0, pOut=0x7fffffffd4c0) at GLearner.cpp:880
    #4  0x000000000041e2a4 in GClasses::GSupervisedLearner::calibrate (this=0x7fffffffd7a0, features=..., labels=...) at GLearner.cpp:819
    #5  0x0000000000403d84 in main (argc=<optimized out>, argv=0x7fffffffd918) at src/trainer_NB.cpp:30
    (gdb) 
    
     

    Last edit: Saswat Padhi 2013-10-21
  • Mike Gashler

    Mike Gashler - 2013-10-21

    What if you print the valueCounts for "training"? Are they also continuous?
    Are you loading "training" from an ARFF file? If so, what does the header portion (the lines beginning with "@ATTRIBUTE") say?

    I think answering these questions should narrow down the problem.

     
  • Saswat Padhi

    Saswat Padhi - 2013-10-22

    The training file has 50 attributes, all of them continuous. And 51st col has the label.
    I am loading training from a CSV file

     
  • Mike Gashler

    Mike Gashler - 2013-10-22

    When you load from a CSV file, Waffles must guess about the attribute types. It does this by assuming that all attributes are continuous unless one or more values in that column contains a character not in the set {0-9,-,.,e}.

    So, solutions include:
    1- use ARFF format, which explicitly specifies the metadata,
    2- insert an alphabetic character into one of the values in the 51st column,
    3- call training.relation()->setAttrValueCount(50, n) to specify the number of categorical attributes in your label column after you load the data. (Note that this call assumes the values in this column are {0,1,...}. If you use any other values, then unexpected behavior may occur.)

     
  • Saswat Padhi

    Saswat Padhi - 2013-10-22

    Thanks

    And what should the argument to likelihood() of GPrediction.asCategorical() be? It expects a double but my nominal values might not be double.

    Edit: I think I should pass a 0-indexed double for a nominal value. But how do I know which value maps to which index in the GMatrix? Can I retrieve it somehow?

    Edit2: My labels column has integer (categorical) labels but not 0 to N. I tried passing that integer, it says Out of Range :-/

     

    Last edit: Saswat Padhi 2013-10-22
  • Mike Gashler

    Mike Gashler - 2013-10-22

    GArffRelation::findEnumeratedValue will map from a string to the corresponding enumeration value.

    If your values are not 0 to N, then solution 3 in my previous post will not work for you. I recommed using

    waffles_transform import mydata.csv > mydata.arff
    

    then edit mydata.arff. Specifically, change "@attribute attr51 real" to explicitly mention the values you use. Example: "@attribute attr51 {2,3,5,7,11,13,19,23,29,31}"

     
  • Saswat Padhi

    Saswat Padhi - 2013-10-22

    Right .. I wrote a bash script to do exactly that. And it works now :-)

    As much as I hated it, just to watch it run .. I created a map with all the class labels and then mapped them to 0 to N.

    findEnumeratedValue() would be nicer.
    Thanks a lot Mike.

    I just had a few more questions:

    1] How long would calibrate take? I have a data set of over 100K points and calibrate() GNaiveBayes takes a long time.

    2] Is Ensemble and Random Forest training multi-threaded? Else, can I make it so?

     
  • Mike Gashler

    Mike Gashler - 2013-10-22

    1] Calibrate uses logistic regression. I usually use "predict" instead of "predictDistribution", so I have little intuition for how long it should take.

    2] The latest version of Waffles in our Git repository supports multi-threaded ensembles. However, my implementation does not seem to yield much speed-up. This new feature could greatly benefit from some more attention, but I am currently occupied with other pursuits.

     

Anonymous
Anonymous

Add attachments
Cancel





Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.