k-fold Cross Validation

Search engine and data mining applications and ClueWeb datasets.

Brought to you by: cammiemw, david_fisher, gregorybrooks, jamiecallan, sm-harding

k-fold Cross Validation

Forum: RankLib

Creator: Brian Yee

Created: 2017-08-21

Updated: 2017-08-21

Brian Yee - 2017-08-21

If I'm reading about the -kcv param correctly, I can have just one training_data.txt and it will split it appropriately for me for use with testing/validation. Is that correct? In this case, I would not have to define a test_data.txt or validation_data.txt.

Is there any advantage to defining my own separate train/test/validate data sets? Why would one ever do that?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Lemur Project - 2017-08-22

You can use the -kcv parameter for k cross-validation in which the data set will be broken up into K parts with one part reserved for testing and others for training. The parts are shuffled so that all parts eventually get used for training and testing.

If you have one large data set, you can also split it using -tvs and -tts arguments to split into validation or test sets.

Typically you want the bulk of your data to be for training while less for validation and/or test. Split values I've typically seen are 50/25/25 (train/validate/test), 60/20/20 and 70/30 (train/test).

I don't know if there are any benefits to having separate train/validation/test data sets. I suppose if you have a set that has been developed at great effort with great attention to detail (relevance), you would want to keep it as a separate test set.

There are some voices that feel nothing is really gained by cross-validation as it isn't as precise in prediction accuracy as random selections of all the data for developing and testing the model.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Brian Yee - 2017-08-25

So doing k cross validation is working and I end up saving k number of models, but aside from the console output, my application has no way to know which model is the best performing. Is there a way to only save the best model?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Lemur Project - 2017-08-28

The kcv output should have a summary of each model created in terms of evaluation metric selected, although depending on the ranking algorithm used, it can be difficult viewing. Save the output so you can sort through it. Should be at the bottom.

You can use the Evaluator to directly compare models, preferably against a baseline (which could be one of your models). This will give you direct comparisons, complete with selected statistical tests on difference significance.

This is a case where it might be good to have a separate test set that has never been seen by the models during their creation. Use it for the comparison tests. You can compare evaluation metrics on a per query basis if desired.

You can create hard-copy versions of training, validation and test sets used for the kcv runs using the FeatureManager.

See the RankLib Wiki page
https://sourceforge.net/p/lemur/wiki/RankLib%20How%20to%20use/ to see an example.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Log in to post a comment.