MAP score not working?

Search engine and data mining applications and ClueWeb datasets.

Brought to you by: cammiemw, david_fisher, gregorybrooks, jamiecallan, sm-harding

MAP score not working?

Forum: RankLib

Creator: tfSpark

Created: 2017-03-31

Updated: 2017-03-31

tfSpark - 2017-03-31

I'm running Ranklib and am finding my MAP score is the same (and high) irrespective of what paramters I put in. So a LambdaMART with 1 tree and 1 leaf will give me the same MAP score as the default.

Furthermore upon inspection of a small validation set I can see the predictions are wrong. It's giving me scores of ~0.05's when my ranking is 0-7 and the validation set has only a handful of 0's -- yet the MAP score during training on this validation set is 1 (so 100%).

I am using the following command
java -jar javaLib/RankLib.jar -train df_LETOR_train.csv -test df_LETOR_testPublic.csv -validate df_LETOR_testPrivate.csv -ranker 6 -metric2t MAP -gmax 7 -metric2T MAP -save mymodel.txt

My data looks like this:
5 qid:13992 1:0.3383766223195068 2:0.0365138200666939 3:58 4:58 5:0.6535008201674869 6:0.6535008201674869 7:0.443495865111506 8:0.28390882043043475 9:10 10:0.8571428571428571 11:0.991869918699187 12:0.5 13:0.38095238095238093 14:3 15:2 16:1

What is going on?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Lemur Project - 2017-03-31

You should try using the -qrels argument which defines a full relevance judgments file for the query set. MAP (and NDCG) need full relevance information for best results. The MAP metric assumes the user is interested in many relevant docs for each query (note no @k for MAP - and if you provide one, it is ignored in the scorer).

If you have access to LETOR data, there should be relevance judgments included in the data.

I thought LETOR used only three relevance labels (but perhaps I'm mis-remembering). You seem to have 7, not that it matters that much since MAP is really geared for binary relevance, so any judgment value > 0 is a 1.

Do you have a large query set? The QIDs in your example data seems to indicate a large set, but if you're only using a subset of LETOR queries, MAP reliability falls off with small query sets. Larger is better.

Do the suspicious evaluation metrics exist when using other metrics, or just MAP?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

tfSpark - 2017-04-05

Thanks for the reply.
So the data I'm using is actually from the Kaggle Home Depot challenge (https://www.kaggle.com/c/home-depot-product-search-relevance) for a university assignment, and I've converted it to the LETOR format. You are correct LETOR has 3 relevence labels but for my data I wanted to have 7 with three of these being <=0.

The dataset is about 80,000 rows of data.

Will have to try other metrics and get back to you,

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Lemur Project - 2017-04-05

MAP calculations consider label values > 0 as relevant (i.e. 1), so any negative valued labels are essentially 0 (non-relevant).

It seems your label values would contribute towards questionable MAP scores.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Log in to post a comment.