The Lemur Project / Discussion / RankLib: JSON data format for RankLib

Itai Gabay - 2018-05-22

Hello,

I am working on an IR project, involving Yahoo! Webscope L6 collection dataset which is given in a JSON format.
Tha dataset is in Q&A structure, which is built in the following way:

"main catagory": "Education & Reference"
"question": "What is the meaning filipino physicists?",
"n_best_answers": ["Filipino physicist are physicists in the Philippines or physicists of Filipino decent..."],
"answer": "Filipino physicist are physicists in the...",
"id": "530045"

where "question" is the query, and "answer" is the best ranked answer.

I would like to serve this dataset to the ranklib LambdaMART algorithm. My questions are:

How can I serve my JSON Yahoo dataset to ranklib, which uses its own unique data format? is there any way to convert my JSON data to the format requested in the ranklib library?

How can I specify the basic structure of the dataset to the learning algorithm? I want to "let him know" that the best answer to the query is the one named: answer.

After training the model and stabilizing it, I want to test it by entering a user query as input and finding out what results will be returned. Where can I enter a specific query for the trained model? is there a cmd line for it?

Thanks in advance to all helpers.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Lemur Project - 2018-05-22

RankLib is a learning to rank library, not a retrieval library. You need to provide a list of documents that have been retrieved via some means along with useful features and their values for the re-ranking. RankLib doesn't do the retrieval.

See https://sourceforge.net/p/lemur/wiki/RankLib%20How%20to%20use/ for use overview.

Ideally, one would like to incorporate the ranking algorithm developed into the actual retrieval, but that is beyond the scope of RankLib.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Itai Gabay - 2018-05-22
  
  I know it's a basic question, but as a first time user I dont get where and how does the algorithm eventually get the text itself to learn upon? the file format in the tutorial does not indicate the actual data we want the algorithm to learn upon
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Lemur Project - 2018-05-23

The numbers in the input file are relevance label, query ID and largely feature values (and a comment if you want). It is up to the user to come up with good features and their values.

One can have few or quite a few features defined. Some common ones are document tf-idf values, BM25 scores, document length, number of matching query terms, matching query terms in some parts of a document, e.g. title and more. You'll have to look around the web to see what sort of features others have used in LTR research or come up with some of your own.

The ranked list is obtained from submitting queries to some retrieval system, such as Indri or Galago or something else. You then submit those documents, with your calculated feature values to the ranking algorithm of your choice. Hopefully, you end up with an improved ranking from your model.

The retrieval and the RankLib model [re]ranking are entirely separate processes.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Thank you for the information,

1) how can i build the trainig data format
<target> qid:<qid> <feature>:<value> <feature>:<value> ... <feature>:<value>
if i have 80,000 questions and answers with a lot of features to compute? is there a script for that?</value></feature></value></feature></value></feature></qid></target>

2)what is the meaning of "target" value in the format?

example of Q&A data set file:

"question": "What is the meaning filipino physicists?",
"n_best_answers": ["Filipino physicist are physicists in the Philippines or physicists of Filipino decent..."],
"answer": "Filipino physicist are physicists in the...",
"id": "530045"


"question": "How can I finance investment property? ",
"n_best_answers": ["You could try to get the owners of the property..."],
"answer": I can provide you non-owner financing all..",
"id": "530046"

features file:

  {
    "store" : "myEfiFeatureStore",
  "name":  "originalScore",
  "class": "org.apache.solr.ltr.feature.OriginalScoreFeature",
  "params": { }
  },
  {
    "store" : "myEfiFeatureStore",
  "name":  "titleLength",
  "class": "org.apache.solr.ltr.feature.FieldLengthFeature",
  "params": {
      "field": "question"
  }
  },

Thanks

Last edit: Itai Gabay 2018-05-27

Alessandro - 2018-05-30

Hi , I answered to your message as well, which is pretty much the same.
1) you need to build your training set. You need to identify the set of features of your interest and you need to build the training set vectors accordingly.

2) Target is the relevance label, normally an integer 0-5 where 0 means not relevant and 5 means extremely relevant. It is assigned to a <query,document> pair.</query,document>

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Itai Gabay - 2018-05-30
  
  Thank you so much!
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

JSON data format for RankLib

Search engine and data mining applications and ClueWeb datasets.

Forums

Help

JSON data format for RankLib

JSON data format for RankLib

Search engine and data mining applications and ClueWeb datasets.

Forums

Help

JSON data format for RankLib document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

JSON data format for RankLib