I am getting this error: "cannot undiscretize a distribution" when I try to
calibrate() a GNaiveBayes model.
However, it trains alright and works as expected.
I couldn't figure out why this error arises only while calibrate() ...
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Naive Bayes inherently operates only on categorical values. Your labels apparently contain one or more continuous attributes, so I would not expect naiveBayes to give very good results with your data. I would recommend using algorithms that are designed for regression.
Well my implementation of naive Bayes tries its best to work anyway by automatically using a discretization filter to enable it to work with continuous values. So, after naive Bayes predicts a categorical label, the discretization filter is asked to "unconvert" that prediction to a continuous label. This step is lossy because there is not really enough information in a categorical value to specify a precise continuous value, so it just uses the center of the discretization bucket for an estimate. However, if you ask it to predict a full distribution, instead of just a continuous value, you are essentially asking it to make up a variance as well as a mean. That seemed like too much fudging, so I implemented it like this (in GClasses/GTransform.cpp):
voidGDiscretize::untransformToDistribution(constdouble*pIn,GPrediction*pOut){throwEx("Sorry, cannot undiscretize to a distribution");}
Now that I think about it, I suppose it could use the width of the bucket as an estimate for variance. So, if you really want to make this work, you could change that code to
voidGDiscretize::untransformToDistribution(constdouble*pIn,GPrediction*pOut){if(!m_pMins)throwEx("Train was not called");size_tattrCount=before().size();for(size_ti=0;i<attrCount;i++){size_tnValues=before().valueCount(i);if(nValues>0)pOut[i].makeCategorical()->setSpike(nValues,pIn[i],0);elsepOut[i].makeNormal()->setMeanAndVariance((((double)pIn[i]+.5)*m_pRanges[i])/m_bucketsOut+m_pMins[i],m_pRanges[i]*m_pRanges[i]);}}
I would not expect this to give very good results, since it is so lossy, but at least it will fix the exception that you were seeing.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I tried to trace the cause why Waffles might be thinking the labels are continuous, but I couldn't.
The file that I am importing the data from, doesn't have continuous labels. I am reading the data into a GMatrix and splitting it into features and labels as follows:
Why does Waffles think the labels are continuous? How do I make sure that they are not treated as continuous? Is there any parameter that I should set or something?
Last edit: Saswat Padhi 2013-10-21
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
That looks like it should work. Here is some code that will print the number of values in each column in your label matrix. 0 indicates continuous. Values greater than zero indicate categorical.
I may have made some false assumptions in trying to determine what was happening. If you could give a call stack to that exception, that would really tell what is happening. A good way to do this is to put a breakpoint in Ex::setMessage in GClasses/GError.cpp.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
What if you print the valueCounts for "training"? Are they also continuous?
Are you loading "training" from an ARFF file? If so, what does the header portion (the lines beginning with "@ATTRIBUTE") say?
I think answering these questions should narrow down the problem.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
When you load from a CSV file, Waffles must guess about the attribute types. It does this by assuming that all attributes are continuous unless one or more values in that column contains a character not in the set {0-9,-,.,e}.
So, solutions include:
1- use ARFF format, which explicitly specifies the metadata,
2- insert an alphabetic character into one of the values in the 51st column,
3- call training.relation()->setAttrValueCount(50, n) to specify the number of categorical attributes in your label column after you load the data. (Note that this call assumes the values in this column are {0,1,...}. If you use any other values, then unexpected behavior may occur.)
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
And what should the argument to likelihood() of GPrediction.asCategorical() be? It expects a double but my nominal values might not be double.
Edit: I think I should pass a 0-indexed double for a nominal value. But how do I know which value maps to which index in the GMatrix? Can I retrieve it somehow?
Edit2: My labels column has integer (categorical) labels but not 0 to N. I tried passing that integer, it says Out of Range :-/
Last edit: Saswat Padhi 2013-10-22
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
GArffRelation::findEnumeratedValue will map from a string to the corresponding enumeration value.
If your values are not 0 to N, then solution 3 in my previous post will not work for you. I recommed using
waffles_transformimportmydata.csv>mydata.arff
then edit mydata.arff. Specifically, change "@attribute attr51 real" to explicitly mention the values you use. Example: "@attribute attr51 {2,3,5,7,11,13,19,23,29,31}"
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
1] Calibrate uses logistic regression. I usually use "predict" instead of "predictDistribution", so I have little intuition for how long it should take.
2] The latest version of Waffles in our Git repository supports multi-threaded ensembles. However, my implementation does not seem to yield much speed-up. This new feature could greatly benefit from some more attention, but I am currently occupied with other pursuits.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi,
I am getting this error: "cannot undiscretize a distribution" when I try to
calibrate() a GNaiveBayes model.
However, it trains alright and works as expected.
I couldn't figure out why this error arises only while calibrate() ...
I get the same error when I use predictDistribution on the model.
Please help me figure out the issue.
Naive Bayes inherently operates only on categorical values. Your labels apparently contain one or more continuous attributes, so I would not expect naiveBayes to give very good results with your data. I would recommend using algorithms that are designed for regression.
Well my implementation of naive Bayes tries its best to work anyway by automatically using a discretization filter to enable it to work with continuous values. So, after naive Bayes predicts a categorical label, the discretization filter is asked to "unconvert" that prediction to a continuous label. This step is lossy because there is not really enough information in a categorical value to specify a precise continuous value, so it just uses the center of the discretization bucket for an estimate. However, if you ask it to predict a full distribution, instead of just a continuous value, you are essentially asking it to make up a variance as well as a mean. That seemed like too much fudging, so I implemented it like this (in GClasses/GTransform.cpp):
Now that I think about it, I suppose it could use the width of the bucket as an estimate for variance. So, if you really want to make this work, you could change that code to
I would not expect this to give very good results, since it is so lossy, but at least it will fix the exception that you were seeing.
Thank you Mike, for the clear explanation.
I tried to trace the cause why Waffles might be thinking the labels are continuous, but I couldn't.
The file that I am importing the data from, doesn't have continuous labels. I am reading the data into a GMatrix and splitting it into features and labels as follows:
Why does Waffles think the labels are continuous? How do I make sure that they are not treated as continuous? Is there any parameter that I should set or something?
Last edit: Saswat Padhi 2013-10-21
That looks like it should work. Here is some code that will print the number of values in each column in your label matrix. 0 indicates continuous. Values greater than zero indicate categorical.
I may have made some false assumptions in trying to determine what was happening. If you could give a call stack to that exception, that would really tell what is happening. A good way to do this is to put a breakpoint in Ex::setMessage in GClasses/GError.cpp.
Hi,
I tried the snippet and then it gave me
I changed the second line to:
and it worked. I get "0:0" as output, which means the labels column is a continuous one! But why should it be?
The call stack shows:
Last edit: Saswat Padhi 2013-10-21
What if you print the valueCounts for "training"? Are they also continuous?
Are you loading "training" from an ARFF file? If so, what does the header portion (the lines beginning with "@ATTRIBUTE") say?
I think answering these questions should narrow down the problem.
The training file has 50 attributes, all of them continuous. And 51st col has the label.
I am loading training from a CSV file
When you load from a CSV file, Waffles must guess about the attribute types. It does this by assuming that all attributes are continuous unless one or more values in that column contains a character not in the set {0-9,-,.,e}.
So, solutions include:
1- use ARFF format, which explicitly specifies the metadata,
2- insert an alphabetic character into one of the values in the 51st column,
3- call training.relation()->setAttrValueCount(50, n) to specify the number of categorical attributes in your label column after you load the data. (Note that this call assumes the values in this column are {0,1,...}. If you use any other values, then unexpected behavior may occur.)
Thanks
And what should the argument to likelihood() of GPrediction.asCategorical() be? It expects a double but my nominal values might not be double.
Edit: I think I should pass a 0-indexed double for a nominal value. But how do I know which value maps to which index in the GMatrix? Can I retrieve it somehow?
Edit2: My labels column has integer (categorical) labels but not 0 to N. I tried passing that integer, it says Out of Range :-/
Last edit: Saswat Padhi 2013-10-22
GArffRelation::findEnumeratedValue will map from a string to the corresponding enumeration value.
If your values are not 0 to N, then solution 3 in my previous post will not work for you. I recommed using
then edit mydata.arff. Specifically, change "@attribute attr51 real" to explicitly mention the values you use. Example: "@attribute attr51 {2,3,5,7,11,13,19,23,29,31}"
Right .. I wrote a bash script to do exactly that. And it works now :-)
As much as I hated it, just to watch it run .. I created a map with all the class labels and then mapped them to 0 to N.
findEnumeratedValue() would be nicer.
Thanks a lot Mike.
I just had a few more questions:
1] How long would calibrate take? I have a data set of over 100K points and calibrate() GNaiveBayes takes a long time.
2] Is Ensemble and Random Forest training multi-threaded? Else, can I make it so?
1] Calibrate uses logistic regression. I usually use "predict" instead of "predictDistribution", so I have little intuition for how long it should take.
2] The latest version of Waffles in our Git repository supports multi-threaded ensembles. However, my implementation does not seem to yield much speed-up. This new feature could greatly benefit from some more attention, but I am currently occupied with other pursuits.