Ideally, we'd like to add switches so that the user could specify which
technique is used. Currently, however, we only support two techniques: random
splits, and splitting to maximize information gain like ID3, which is the
default. Since ID3 doesn't really specify how continuous attributes are
handled, we use an approach similar to C4.5 for those. Also, when missing
values exist in the data, we replace those lazily with the most common value
in the data--although I might change that to use a stochastically-selected
value since that seems to yield better results in bagging ensembles. If you'd
like to help us implement alternative methods of building the trees, that
would be a welcome contribution.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I think the main reason I have not taken time to implement alternate decision
tree algorithms is because I have not been convinced that they would be
significantly better. Bagging, however, seems to improve accuracy with
decision trees quite dramatically, so I think it may be a better use of time
to work on improved ensemble techniques rather than to spend a lot of time
fine-tuning the decision tree algorithm itself.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I found that for building decision trees, several algorithms exist.
Do you have one or several of them in waffles_learn ?
Which specific algo is used in waffles_learn decisiontree ?
Ideally, we'd like to add switches so that the user could specify which
technique is used. Currently, however, we only support two techniques: random
splits, and splitting to maximize information gain like ID3, which is the
default. Since ID3 doesn't really specify how continuous attributes are
handled, we use an approach similar to C4.5 for those. Also, when missing
values exist in the data, we replace those lazily with the most common value
in the data--although I might change that to use a stochastically-selected
value since that seems to yield better results in bagging ensembles. If you'd
like to help us implement alternative methods of building the trees, that
would be a welcome contribution.
I think the main reason I have not taken time to implement alternate decision
tree algorithms is because I have not been convinced that they would be
significantly better. Bagging, however, seems to improve accuracy with
decision trees quite dramatically, so I think it may be a better use of time
to work on improved ensemble techniques rather than to spend a lot of time
fine-tuning the decision tree algorithm itself.