Interpreting the k-cross validation scores

Search engine and data mining applications and ClueWeb datasets.

Brought to you by: cammiemw, david_fisher, gregorybrooks, jamiecallan, sm-harding

Interpreting the k-cross validation scores

Forum: RankLib

Creator: Brian Yee

Created: 2017-11-28

Updated: 2017-11-29

Brian Yee - 2017-11-28

Fold-5 model saved to: lambdamart_model.xml
Summary:
NDCG@12 | Train | Test

Fold 1 | 0.281 | 0.252
Fold 2 | 0.268 | 0.2579
Fold 3 | 0.2971 | 0.2474
Fold 4 | 0.2967 | 0.277
Fold 5 | 0.2906 | 0.2638

Avg. | 0.2867 | 0.2596

Total | | 0.2596

I am building a lambdamart model using k-cross validation. I was hoping to get some advice with interpreting these results. Which model should I choose as the best performing? Fold 3 has the highest Train score but Fold 4 has the highest Test score. Where are the validation scores? What is the difference between test and validation?

Are these scores any good? What are some things I can tweak to improve these scores?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Lemur Project - 2017-11-29

Validation is really a part of the training step. It helps indicate if the resulting model has been over-trained or over-fitted. It doesn't change any model parameters.

Your training and test results are extremely close. Usually the training result metrics are large compared with the validation or test values. In the output, the model with the best test result is probably the one to choose. Certainly focus of the model versions that do better than the average.

Try individual validation and tests on your kCV models. You can use -validate and then -test arguments to RankLib if you have separate validation and test data. If not, you'll have to split the data using -tvs and -tts arguments. The kCV process should give you good results since it is randomizing so much of the training, validation and test process.

I guess I have to wonder how good your features are to get such close performance of training and test values. Can you play around with adding or dropping features? Seems the models aren't learning very well, or learning well with what they have but not effectively so.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Brian Yee - 2017-11-29

Would dropping features help? My reasoning is that if a feature isn't very helpful, it doesn't do any harm, it just is not a good indicator of relevance and the model accounts for that and essentially ignores it or places very little weight on it.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Lemur Project - 2017-11-29

I'm not certain about your reasoning. These algorithms make use of min, max and average within feature values, so a "bad" feature could end up skewing those statistics a bit. You can look to see what sort of range of values features have. If they are all pretty close to one another and don't really change much between label values, maybe the feature isn't that useful, but I can't really say.

Have there been others who did LTR on the data you are using? If so, you can look to see what sort of values they get for performance metrics.

I might add that your values aren't necessarily "bad". I recall seeing results with differing data using different algorithms with values in the 0.3 - 0.4 range that were considered "good", so you might have a result where the training was really good (closely matched by test results), but is just a difficult ranking job.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Log in to post a comment.