The Lemur Project Wiki

Search engine and data mining applications and ClueWeb datasets.

Brought to you by: cammiemw, david_fisher, gregorybrooks, jamiecallan, sm-harding

RankLib File Format

RankLib is a learning to rank library. It does no document retrieval. If you are attempting to use RankLib using your own, as opposed to some provided data set, you will need to generate a query set, relevance information and feature values on your own. Note also that some metrics used in generating an algorithm's model require a complete set of relevance judgments and not limited only to the documents in your result ranked lists.

The basic procedure is to use some initial set of rankings for a set of queries, complete with feature values. You will need to come up with the set of features and their values on your own. Feature values can be few or numerous. Some common ones are document tf-idf, BM25 scores, document length, number of matching query terms, number of query terms in important sections of a document, such as a title or web page anchor text, in or out links, etc., etc., etc..

Obviously, using RankLib will be much easier if one has access to one or more of the available data sets put out by various research entities rather than generating one's own data.

The file format for the training data (also testing/validation data) is the same as for SVM-Rank. This is also the format used in LETOR datasets. Each of the following lines represents one training example and is of the following format:

<line> .=. <target> qid:<qid> <feature>:<value> <feature>:<value> ... <feature>:<value> # <info>
<target> .=. <positive integer>
<qid> .=. <positive integer>
<feature> .=. <positive integer>
<value> .=. <float>
<info> .=. <string>

Here's an example: (taken from the SVM-Rank website). Note that everything after "#" are ignored.

3 qid:1 1:1 2:1 3:0 4:0.2 5:0 # 1A
2 qid:1 1:0 2:0 3:1 4:0.1 5:1 # 1B 
1 qid:1 1:0 2:1 3:0 4:0.4 5:0 # 1C
1 qid:1 1:0 2:0 3:1 4:0.3 5:0 # 1D  
1 qid:2 1:0 2:0 3:1 4:0.2 5:0 # 2A  
2 qid:2 1:1 2:0 3:1 4:0.4 5:0 # 2B 
1 qid:2 1:0 2:0 3:1 4:0.1 5:0 # 2C 
1 qid:2 1:0 2:0 3:1 4:0.2 5:0 # 2D  
2 qid:3 1:0 2:0 3:1 4:0.1 5:1 # 3A 
3 qid:3 1:1 2:1 3:0 4:0.3 5:0 # 3B 
4 qid:3 1:1 2:0 3:0 4:0.4 5:1 # 3C 
1 qid:3 1:0 2:1 3:1 4:0.5 5:0 # 3D

The Lemur Project Wiki

Search engine and data mining applications and ClueWeb datasets.

RankLib File Format

Related