I'm running test on a baseline model I created for regression (continuous
values for output). The output is 237.21. What does that mean? It doesn't look
like it's the average error or sum of all the errors. Could you please advise?
Also, I'm not really clear on what algorithms can be used for regression and
what can be used for classification? Could you clarify this as well?
Thanks.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Sorry for the delayed reply. (SourceForge used to notify me when the forum was
updated, but I haven't been receiving any notifications lately.)
Baseline is probably the poorest of all learners. (This is by design as it is
used as a "baseline" for comparison.) For regression, it always predict the
centroid value in the training set, and for classification it always predicts
the most common label.
The best models for regression are GKNN (if you have very few feature
dimensions) and GNeuralNet (if you have a lot of feature dimensions).
GNeuralNet is the best choice if you know what you're doing. Unfortunately,
neural nets have a lot of parameters. If you want something easy to use, you
might try a bagging ensembles of decision trees, or mean-margins trees (use
GBag with GDecisionTree or GMeanMarginsTree).
You may notice that some of the models (like GNeuralNet) expect all values to
be continuous, and other models (like GNaiveBayes) expect all values to be
nominal. You can use the GFilter class to solve this problem. For example, If
you wrap GNeuralNet in a filter with the GNominalToCat transform, then it can
operate on any type of data. Likewise, if you wrap GNaiveBayes in a filter
with the GDiscretize transform, then it can operate on both nominal and
continuous data. Thus, the GFilter class can make all of my models suitable
for doing classification or regression. (GKNN and GDecisionTree implicitly
handle both types without using a filter.)
Here is a command-line example of how to test a neural net with some
regression problem:
Thanks for the reply, however I'm really looking for how the error is
calculated specifically. For example, here are my real values and
corresponding predicted values on a baseline regression:
Actual value Prediction
138.3206093 119.2596578
102.8708374 119.2596578
139.3020113 119.2596578
141.8271008 119.2596578
139.6765169 119.2596578
98.11396617 119.2596578
108.4569783 119.2596578
116.8517775 119.2596578
141.6157746 119.2596578
112.674653 119.2596578
119.3599495 119.2596578
119.9591068 119.2596578
127.3965653 119.2596578
127.0143511 119.2596578
81.17963669 119.2596578
90.81633395 119.2596578
115.0208661 119.2596578
115.6759148 119.2596578
132.5177658 119.2596578
102.8526855 119.2596578
129.059482 119.2596578
120.6570755 119.2596578
123.2283525 119.2596578
113.7573259 119.2596578
153.1416136 119.2596578
121.3513454 119.2596578
112.929818 119.2596578
114.9028338 119.2596578
126.4990625 119.2596578
125.491234 119.2596578
111.4105838 119.2596578
117.5874593 119.2596578
114.6671417 119.2596578
91.29407153 119.2596578
126.6072217 119.2596578
And I'm getting a reported error of 237.21418379122. Where is this number
coming from? I cannot find how it is related to the predicted values vs. the
actual values. Here are the commands:
The waffles_learn tool reports mean-squared-error (MSE) by default. In LaTeX,
the formula would be: \frac{1}{n}\sum_i^{i<n}(target_i - prediction_i)^2. (The
square root of this value is closely related to Euclidean distance.) So, in
this case, the prediction is about 15.4 away from the ideal an average.
Another common metric is mean-absolute-error (MAE), which is the average of
the absolute difference between the target and prediction. Perhaps this is the
value you were expecting. I chose to report MSE because it has nicer
properties, and because it is more commonly used in my field. Now that you
mention it, I should probably add a switch so you can specify how you want the
error to be reported.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I'm running test on a baseline model I created for regression (continuous
values for output). The output is 237.21. What does that mean? It doesn't look
like it's the average error or sum of all the errors. Could you please advise?
Also, I'm not really clear on what algorithms can be used for regression and
what can be used for classification? Could you clarify this as well?
Thanks.
Sorry for the delayed reply. (SourceForge used to notify me when the forum was
updated, but I haven't been receiving any notifications lately.)
Baseline is probably the poorest of all learners. (This is by design as it is
used as a "baseline" for comparison.) For regression, it always predict the
centroid value in the training set, and for classification it always predicts
the most common label.
The best models for regression are GKNN (if you have very few feature
dimensions) and GNeuralNet (if you have a lot of feature dimensions).
GNeuralNet is the best choice if you know what you're doing. Unfortunately,
neural nets have a lot of parameters. If you want something easy to use, you
might try a bagging ensembles of decision trees, or mean-margins trees (use
GBag with GDecisionTree or GMeanMarginsTree).
You may notice that some of the models (like GNeuralNet) expect all values to
be continuous, and other models (like GNaiveBayes) expect all values to be
nominal. You can use the GFilter class to solve this problem. For example, If
you wrap GNeuralNet in a filter with the GNominalToCat transform, then it can
operate on any type of data. Likewise, if you wrap GNaiveBayes in a filter
with the GDiscretize transform, then it can operate on both nominal and
continuous data. Thus, the GFilter class can make all of my models suitable
for doing classification or regression. (GKNN and GDecisionTree implicitly
handle both types without using a filter.)
Here is a command-line example of how to test a neural net with some
regression problem:
waffles_learn crossvalidate mydata.arff nominaltocat neuralnet -addlayer 32
Here is a command-line example of how to test a bagging ensemble of mean
margins trees with some regression problem:
waffles_learn crossvalidate mydata.arff bag 50 nominaltocat meanmarginstree
end
Thanks for the reply, however I'm really looking for how the error is
calculated specifically. For example, here are my real values and
corresponding predicted values on a baseline regression:
Actual value Prediction
138.3206093 119.2596578
102.8708374 119.2596578
139.3020113 119.2596578
141.8271008 119.2596578
139.6765169 119.2596578
98.11396617 119.2596578
108.4569783 119.2596578
116.8517775 119.2596578
141.6157746 119.2596578
112.674653 119.2596578
119.3599495 119.2596578
119.9591068 119.2596578
127.3965653 119.2596578
127.0143511 119.2596578
81.17963669 119.2596578
90.81633395 119.2596578
115.0208661 119.2596578
115.6759148 119.2596578
132.5177658 119.2596578
102.8526855 119.2596578
129.059482 119.2596578
120.6570755 119.2596578
123.2283525 119.2596578
113.7573259 119.2596578
153.1416136 119.2596578
121.3513454 119.2596578
112.929818 119.2596578
114.9028338 119.2596578
126.4990625 119.2596578
125.491234 119.2596578
111.4105838 119.2596578
117.5874593 119.2596578
114.6671417 119.2596578
91.29407153 119.2596578
126.6072217 119.2596578
And I'm getting a reported error of 237.21418379122. Where is this number
coming from? I cannot find how it is related to the predicted values vs. the
actual values. Here are the commands:
237.21418379122
Thanks.
The waffles_learn tool reports mean-squared-error (MSE) by default. In LaTeX,
the formula would be: \frac{1}{n}\sum_i^{i<n}(target_i - prediction_i)^2. (The
square root of this value is closely related to Euclidean distance.) So, in
this case, the prediction is about 15.4 away from the ideal an average.
Another common metric is mean-absolute-error (MAE), which is the average of
the absolute difference between the target and prediction. Perhaps this is the
value you were expecting. I chose to report MSE because it has nicer
properties, and because it is more commonly used in my field. Now that you
mention it, I should probably add a switch so you can specify how you want the
error to be reported.
That makes sense, thanks.