Comparing Learnt and standard retrieval

  • George Paltoglou

    I am trying to examine whether some new features that I've implemented are effective for retrieval. For this, I am comparing a standard retrieval algorithm vs. a learnt one that is using the new features (in addition to the score produced by the standard retrieval algorithm).

    If I understand the process right, for the learnt approach I need to initially run a standard retrieval algorithm, save the returned documents, calculate the value of the new features I've designed for those documents, cross-reference those documents with the qrels of the dataset to produce the RankLib datafile and run some of the LETOR algorithms provided.

    The problem with this process is that the learnt approach will be evaluated using a subset of the relevant documents that are available, since any relevant document that wasn't retrieved by the standard retrieval algorithm won't be used in the LETOR process.

    Am I missing something? Is there a step to alleviate the problem? For precision-based metrics that may not be a problem but for MAP or recall-oriented ones, comparisons become inapplicable.

    Any advice would be appreciated. Thanks for the help.

    Kind regards,

  • Van Dang

    Van Dang - 2013-09-25

    You are absolutely right about the relevant documents that are not present in the training/test data. Those who use RankLib on standard LTR datasets usually don't have to worry this problem because the LTR task is set up to compare one algorithm to another. From this perspective, the comparison is fair as long as they all use the same training/test data. This is, in fact, the standard practice.

    Getting back to your case where you generate the training/test data yourself, I assume you have the "complete" qrels file. What you have to do is:

    • Make sure it's in TREC format. Also make sure the relevance label starts from 0 (which indicates non-relevant).
    • Add "-qrel your-qrels-file" to every RankLib commands that you usually do

    That will make sure the calculation of MAP/recall and NDCG takes into account those additional relevant documents. Watch out for the early output messages: it should say "external relevance judgment loaded" or something like that.

  • George Paltoglou

    Thanks! Superb advice!

  • George Paltoglou

    A related question: if I am doing a held-out run (training on one dataset and testing on another), which qrels will the '-qrel' command refer to? The one that is used for training or the one for testing? Ideally, one could use the complete qrels for both training/testing.

    If I create a single qrel file containing all the relevant documents for all queries, will that help in using the full qrels for both?


    • Van Dang

      Van Dang - 2013-10-01

      It's best to think of it this way: whenever "-qrel" is specified, the calculation MAP and NDCG will take into account that qrel file (other measures by nature will not be affected by "-qrel"). As a result, if you do:

      java -jar RankLib.jar -train ... -test... -qrel ...

      "-qrel" will apply to both training and testing. On the other hand, if you do:

      java -jar RankLib.jar -train ... -save model.txt -qrel ...
      java -jar RankLib.jar -test ... -load model.txt

      then "-qrel" will only apply to the training phase, but not to the testing phase. If you want "-qrel" to apply to testing as well, you need to add "-qrel" again.

      If I create a single qrel file containing all the relevant documents for all queries, will that help in using the full qrels for both?

      Yes, you only need a single qrel file. RankLib only loads the portion of the qrel file that matches the queries in the training data ;)

      Last edit: Van Dang 2013-10-01

Log in to post a comment.