I trained a model on a training data full of tied ranks. I got bad results. Then I broke the ties in the training set according to some information and I got better result.
Now I wonder if the tied trianing samples are discarded during the ties or not. What is the reason for this improvement?
Note that my tied instances are not necessarily similar. Their feature vectors might differ.
I'm using Random Forest with the pairwise strategy option.
Any Idea?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Which bagging ranker (rtype parameter) did you use? Default is MART (ranker 0), but you might use LambdaMART if your doing pairwise comparisons.
The algorithms look over all the features individually (and in random sets) keeping track of max, min and unique values, along with the variance and deviation of the samples for each feature.
Furthermore, ensemble weights are based on label values of samples, skipping over pairs that have matching labels. This might be a problem if you have a large number of identical labels in the training set (more difficult to generalize over the features).
If you have a feature that has the same or very closely the same values across many samples, then it's not especially useful for learning. The artificially low variance/deviation for the feature may aid in producing a model that underfits the data.
If your model evaluated OK against training data but poorly against test/validation data, your model likely overfit the data; if poorly against both training and test/validation data, underfitting is a possibility and you might consider adding more or better features.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I trained a model on a training data full of tied ranks. I got bad results. Then I broke the ties in the training set according to some information and I got better result.
Now I wonder if the tied trianing samples are discarded during the ties or not. What is the reason for this improvement?
Note that my tied instances are not necessarily similar. Their feature vectors might differ.
I'm using Random Forest with the pairwise strategy option.
Any Idea?
Which bagging ranker (rtype parameter) did you use? Default is MART (ranker 0), but you might use LambdaMART if your doing pairwise comparisons.
The algorithms look over all the features individually (and in random sets) keeping track of max, min and unique values, along with the variance and deviation of the samples for each feature.
Furthermore, ensemble weights are based on label values of samples, skipping over pairs that have matching labels. This might be a problem if you have a large number of identical labels in the training set (more difficult to generalize over the features).
If you have a feature that has the same or very closely the same values across many samples, then it's not especially useful for learning. The artificially low variance/deviation for the feature may aid in producing a model that underfits the data.
If your model evaluated OK against training data but poorly against test/validation data, your model likely overfit the data; if poorly against both training and test/validation data, underfitting is a possibility and you might consider adding more or better features.