let me first tell you that waffles is an astonishing piece of software. I worked like one day to arrive at a point where waffles_learn is able to provide me with a random forest in order to classify some data.
I read the usage information and noticed, that (for random forests) you basically have two options: the number of trees and the samples. What I am currently calling is:
My stuff.arff file contains only 6 data-sets with 137 attributes each (this is just a small set for testing though, the real set I will use later will be much much larger).
Now, I was under the impression that the "-samples" parameter controls how many attributes are randomly chosen in order to create one tree, but it seems that the depth of a single tree never reaches 12. The maximum I encountered was 4.
I can give two possible explanations for this and I would like to know which one is correct :D
The tree-depth for a single tree in the forest is so small, because the tree (with such a few values) already perfectly separates all the data-sets (simply because there are only very few data-sets which can be easily separated).
The -samples parameter does not control the maximum tree depth.
In case the correct answer is 2: is there some way of controlling how "deep" a single tree in the random forest must grow?
Oliver
PS: Keep up the great work!
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
The -samples parameter does not control the maximum tree depth. I think what you want is:
waffles_learn train stuff.arff bag 100 decisiontree -random 1 -maxlevels 12 end
The -maxlevels parameter on the decisiontree model limits the depth of the tree. Another option is to tell each tree to stop dividing when in chops the data down to fewer than n points. For example,
waffles_learn train stuff.arff bag 100 decisiontree -random 1 -leafthresh 20 end
Here's what the -samples parameter does: Each time the tree makes a division, it choose k random attributes, measure how much dividing on each attribute would reduce entropy in the labels, and then divides using the one of those k attributes that reduces entropy the most. The -samples parameter sets the value for k. Usually, I leave k at 1. Bigger values make it more robust to irrelevant attributes, but they reduce diversity in the ensemble and make training take longer.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Anonymous
Anonymous
-
2013-06-22
Hi Mike,
thank you very much for your elaborate answer. I was testing around a bit more and have another question that I cannot seem to answer.
In case I do not limit the maxlevels or the leafthresh, how does waffles_learn determine how "deep" a single tree in the random forest has to be? Is it spillitng until a definitive answer is found in all leafs of the tree?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
That is correct. Random forest grows its trees until the labels are homogeneous.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Anonymous
Anonymous
-
2013-06-22
Mike, it seems that you are the only person within my grasp that has a proper knowledge about all this learning. I hope you don't mind that I am asking yet another question. Assume I am calling
Then, the only parameter that I need to figure out how to set is TREES. Is there some clever way to find out what the optimal number of TREES is?
I assume that this is not an easy question -- and probably related to what "optimal" means.
The idea is to be able to have a good correlation between the classes given in stuff.arff and a second set of data stuff2.arff, that basically contains the same classes, but created using different "problems". Here is a quick explanation how the .arff files are created.
I have a large collection of boolean formulas in conjunctive normal form (CNF), and an algorithm that computes tons of attributes related to these formulas (like number of variables, number of clauses, clause/variable ratio, all sorts of information on the clause graph...). The CNF formulas are separated into classes (basically, there is a generator for CNF formulas for all sorts of problems, like uniform random 3-SAT, edgematching,...). Then, I am translating the attribute vectors for all the CNF formulas into an .arff file; the training set stuff.arff. Now, I can create a second set of different CNF formulas using the same generators. The resulting CNF classes are the same, but the formulas themselves are not. I can now compute the second set of attribute vectors for all the classes and they go into stuff2.arff; the test set stuff2.arff.
In case I have both available, what do I need to do in order to determine TREES such that training the random forest on stuff.arff results in an "optimal" (correct) detection of the classes in stuff2.arff?
Hmmm... the post is quite long. I hope you don't get annoyed.
Oliver
PS: I have linked the Waffles on my website. It is really cool. Is there a way to donate?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
This will divide stuff.arff into n folds. It will train on n-1 of the folds, and test on the 1 fold that was withheld. It repeats this process until all n folds have been tested, and reports the average accuracy. The typical way to tune parameters is to try crossvalidation with several candidate parameter values, and finally choose the one that produced the best score. If stuff2.arff came from the same source as stuff1.arff, then the crossvaldation score you obtain using only stuff1.arff is usually a good predictor of how well it will do on stuff2.arff.
I once attempted to automate this whole process with the autotune feature. Example:
waffles_learn autotune stuff.arff decisiontree
Unfortunately, I have been busy, so I never hooked up the autotune feature to the randomforest algorithm.
You can also use crossvalidation to compare multiple algorithms. For example, you might try
waffles_learn crossvalidate stuff.arff bmc 100 decisiontree -random 1 end
If it scores higher than randomforest, then it is a better algorithm for your data. If it scores lower, this will increase your confidence that randomforest is the right model to use with this data.
Thanks for the link. Promotion is always helpful. I don't really want monetary donations, but you can pay it forward by declining to make some of your future works proprietary.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Anonymous
Anonymous
-
2013-06-24
Hi Mike,
thanks so much once again. This is exactly what I was looking for.
Since you do not want any money, I will promote your work at the next conference that I need to attend to.
Since my projects are always open source, people will always be able to study what I did. In this case, they can see how Waffles is used to learn a random forest to classify CNF formulas. I will post a link here as soon as the work is finished.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Anonymous
Anonymous
-
2013-06-28
Hi Mike,
as promised, I hereby provide the link to the Random Forest Generator (RandForGen) that is used to learn a random forest for CNF formula classification.
Hi,
let me first tell you that waffles is an astonishing piece of software. I worked like one day to arrive at a point where waffles_learn is able to provide me with a random forest in order to classify some data.
I read the usage information and noticed, that (for random forests) you basically have two options: the number of trees and the samples. What I am currently calling is:
waffles_learn train stuff.arff randomforest 100 -samples 12 > tree.out
My stuff.arff file contains only 6 data-sets with 137 attributes each (this is just a small set for testing though, the real set I will use later will be much much larger).
Now, I was under the impression that the "-samples" parameter controls how many attributes are randomly chosen in order to create one tree, but it seems that the depth of a single tree never reaches 12. The maximum I encountered was 4.
I can give two possible explanations for this and I would like to know which one is correct :D
The tree-depth for a single tree in the forest is so small, because the tree (with such a few values) already perfectly separates all the data-sets (simply because there are only very few data-sets which can be easily separated).
The -samples parameter does not control the maximum tree depth.
In case the correct answer is 2: is there some way of controlling how "deep" a single tree in the random forest must grow?
Oliver
PS: Keep up the great work!
Thanks!
The -samples parameter does not control the maximum tree depth. I think what you want is:
waffles_learn train stuff.arff bag 100 decisiontree -random 1 -maxlevels 12 end
The -maxlevels parameter on the decisiontree model limits the depth of the tree. Another option is to tell each tree to stop dividing when in chops the data down to fewer than n points. For example,
waffles_learn train stuff.arff bag 100 decisiontree -random 1 -leafthresh 20 end
Here's what the -samples parameter does: Each time the tree makes a division, it choose k random attributes, measure how much dividing on each attribute would reduce entropy in the labels, and then divides using the one of those k attributes that reduces entropy the most. The -samples parameter sets the value for k. Usually, I leave k at 1. Bigger values make it more robust to irrelevant attributes, but they reduce diversity in the ensemble and make training take longer.
Hi Mike,
thank you very much for your elaborate answer. I was testing around a bit more and have another question that I cannot seem to answer.
In case I do not limit the maxlevels or the leafthresh, how does waffles_learn determine how "deep" a single tree in the random forest has to be? Is it spillitng until a definitive answer is found in all leafs of the tree?
That is correct. Random forest grows its trees until the labels are homogeneous.
Mike, it seems that you are the only person within my grasp that has a proper knowledge about all this learning. I hope you don't mind that I am asking yet another question. Assume I am calling
waffles_learn train stuff.arff randomforest [TREES] > tree.out
Then, the only parameter that I need to figure out how to set is TREES. Is there some clever way to find out what the optimal number of TREES is?
I assume that this is not an easy question -- and probably related to what "optimal" means.
The idea is to be able to have a good correlation between the classes given in stuff.arff and a second set of data stuff2.arff, that basically contains the same classes, but created using different "problems". Here is a quick explanation how the .arff files are created.
I have a large collection of boolean formulas in conjunctive normal form (CNF), and an algorithm that computes tons of attributes related to these formulas (like number of variables, number of clauses, clause/variable ratio, all sorts of information on the clause graph...). The CNF formulas are separated into classes (basically, there is a generator for CNF formulas for all sorts of problems, like uniform random 3-SAT, edgematching,...). Then, I am translating the attribute vectors for all the CNF formulas into an .arff file; the training set stuff.arff. Now, I can create a second set of different CNF formulas using the same generators. The resulting CNF classes are the same, but the formulas themselves are not. I can now compute the second set of attribute vectors for all the classes and they go into stuff2.arff; the test set stuff2.arff.
In case I have both available, what do I need to do in order to determine TREES such that training the random forest on stuff.arff results in an "optimal" (correct) detection of the classes in stuff2.arff?
Hmmm... the post is quite long. I hope you don't get annoyed.
Oliver
PS: I have linked the Waffles on my website. It is really cool. Is there a way to donate?
crossvalidation is a good way to estimate how well a learner will generalize. Example:
waffles_learn crossvalidate stuff.arff randomforest [TREES]
This will divide stuff.arff into n folds. It will train on n-1 of the folds, and test on the 1 fold that was withheld. It repeats this process until all n folds have been tested, and reports the average accuracy. The typical way to tune parameters is to try crossvalidation with several candidate parameter values, and finally choose the one that produced the best score. If stuff2.arff came from the same source as stuff1.arff, then the crossvaldation score you obtain using only stuff1.arff is usually a good predictor of how well it will do on stuff2.arff.
I once attempted to automate this whole process with the autotune feature. Example:
waffles_learn autotune stuff.arff decisiontree
Unfortunately, I have been busy, so I never hooked up the autotune feature to the randomforest algorithm.
You can also use crossvalidation to compare multiple algorithms. For example, you might try
waffles_learn crossvalidate stuff.arff bmc 100 decisiontree -random 1 end
If it scores higher than randomforest, then it is a better algorithm for your data. If it scores lower, this will increase your confidence that randomforest is the right model to use with this data.
Thanks for the link. Promotion is always helpful. I don't really want monetary donations, but you can pay it forward by declining to make some of your future works proprietary.
Hi Mike,
thanks so much once again. This is exactly what I was looking for.
Since you do not want any money, I will promote your work at the next conference that I need to attend to.
Since my projects are always open source, people will always be able to study what I did. In this case, they can see how Waffles is used to learn a random forest to classify CNF formulas. I will post a link here as soon as the work is finished.
Hi Mike,
as promised, I hereby provide the link to the Random Forest Generator (RandForGen) that is used to learn a random forest for CNF formula classification.
https://www.gableske.net/randforgen