The Lemur Project / Discussion / RankLib: QueryId generation for NOT full text search query

Alessandro - 2016-04-29

Hi guys,
let me describe my use case :

Scenario
Not full text search engine, so every search is a set of boolean filter queries .
All my signals have both Document level features and query level features (
e.g.
user_device_smarphone :1.0
user_device_tablet:0.0
ect
ect
One hot encoded categorical features.
.

Algorithm
LambdaMART

Metric
NDCG@k

Training Set generation
I am a little bit struggling to decide how to build the queryId.
Let's assume we use all the query level features, we calculate an hash and we store it as the queryId.
This should work well in the case we generate : the training set, the validation set and the test set manually.
Because we are able to collect all the impressions and a build a proper model that will be evaluated queryId per queryId in a consistent way ( indeed it makes sense to calculate NDCG per queryID if the queryId is the hash) .

But in the case we are using cross validation to generate all the sets I think using the hash as the queryId could be risky.
The reason of my concern is the fact that we are going to miss entire categories of observations ( search level features combinations).
So the model risk to be lower quality.

In the case of using cross validation, is it better to generate a random qId ?
generate random cluster of 1000 entries for example and then run the different folds ?
Also in this case I have concern that the metric would not be that accurate as we are going to rank together documents that actually could potentially belong to different result set.

Hope my question is clear, would be nice to discuss this!

Cheers

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nicolas Fiorini - 2016-05-03

Hi!

I'm interested in your problem, but I don't understand your concern. What are the search level feature combinations? Do you have an example?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Alessandro - 2016-05-03

I admit I have to study in details LambdaMART, so sorry if the question does not make sense.
Let's simplify, let's assume we have :
Document Level Features
hotel_star_rating
hotel_tripadvisor_rating
hotel_tripadvisor_review_count
Query Level Features
search_number_adults
search_departure_airport_LGW
search_number_children

At this point I have my traing set, each entry with a feature vector like this :
<relevancyscore> <queryid> hotel_star_rating:3 hotel_tripadvisor_rating:3.5 hotel_tripadvisor_review_count:234 search_number_adults:2 search_departure_airport_LGW:1.0 search_number_children:0</queryid></relevancyscore>

Now we have milion of entries and I want to train my model.
How should I choose my queryId ?
Will this have impact ?
The fact is that if using cross validation the sample set is split per queryId.
In the case we are manually building training set, validation set and test set, I assume it make sense to generate the queryId as an hash of the queryLevel Features but I am still not sure if this is going to make any difference, for sure will try to group training vector in a more meaningful way...
Not sure if having the queryId as random identifier ( changing for example every 10.000 entries) will affect the algorithm.
And what about the evaluation metric ? If am using NDCG@50 would make sense to have the test set with more than 50 entries per queryId ?

Let me know if you got my doubts and if I can help, describing more my concerns !
Thanks for the interest !

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Lemur Project - 2016-05-04

I think you are over-thinking the process Alessandro (or perhaps I'm still not following you).

First of all, the LTR process does not care about feature names. They are just integer values
to the process; a simple way to keep mathematical book-keeping clean on a per-feature basis
and for making use of a subset of features for the process if one had chosen that.

Similarly, query IDs are also just simple numbers that make it easy to partition up query sets
for randomized use (data splits, validation, folding). A query ID of 12345678 isn't really any
more meaningful than query ID 1 to the process.

Now it is possible that meaningful query IDs are useful to YOU, but the process itself isn't really
going to care. You would have to do some sort of post processing yourself to extract some
sort of meaning from query ranking results, if that is what you were looking to do.

There are some LTR processing that allow labels to be placed on query or feature IDs, but
that is merely for convenience of users when reading printouts of results.

This might be a fair feature to add to RankLib as a sort of convenience function, but I don't
see how it wouldl make any difference to the actual ranking process.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nicolas Fiorini - 2016-05-04

I'd say the same thing as Stephen.

Maybe I'm not following you, but normally the query IDs have no impact at all. How would splitting a training set while keeping the qids together can lead to a less good model?
If it has no impact (which is what I think, unless you specify why), then a hash is perfectly fine. It guarantees unicity and gives you some meaning – although I'd store the queries/facets somewhere else anyway.

As for the evaluation/optimization metric, it depends on the use case. For example, I'm working on a system that displays 20 results on the first page, so I will maximize NDCG@20. Google would maximize NDCG@10, or maybe even NDCG@5. You just have to keep in mind that you want to optimize what the user looks at. If you have any clue about how many results users tend to have a look at, then you can use it to define k in NDCG@k. I'd recommend to keep k low though, unless you have specific reasons to use 50. If a challenge evaluates the 50 first results then you should optimize them, otherwise, 1,5 and 10 are the most common ones.

And finally, I think it's even better if you have more than 50 instances! Keep in mind that LTR is useful to optimize the top k results, no matter how many results you have. In my use case again, I have up to 500 instances for each query, and I optimize NDCG@20. The reason why I do this is we saw that some very relevant results are around the 300th-400th position, and we need to bring them at the top.

I hope this is clear and you'll be able to get this working very soon :-)

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Lemur Project - 2016-05-05

Excellent information.

However I would add that this is all so when emphasis is on precision.

There are some use cases (for example some intelligence analyst) who is going to look through all 1000 documents returned because the cost of missing something is very high, so recall is of greater overall/long term importance.

And of course, one never quite knows where that last relevant document is going to appear!

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Alessandro - 2016-05-05

Thanks for all the useful information guys, really appreciate it !

Let me try to explain better:

I am not interested at all about the queryId label itself, I have clear the fact that it stands only to partition the trainingSet :)
So my follow questions :
1) is this partition affecting the training ? let's assume I use a validation set.
The validation set will be used during the training phase, and the NDCG will lead improvement in the model. The way I choose the queryId will affect how I build the rankLists, and NDCG will calculate an average of all the NDCG per queryId group.
If I am right, it is important that the queryId in the validation set is grouping the samples in a meaningful way.
Am I right so far ?
What about the training set ? will the way I group the records matter in some way ?

2) My concern is not related the label of the queryId, but how to build the groups.
let's assume I have a training set with 3 binaries features and a very simple scenario:

a) Random queryId, let's say it is going to change every 2 records
5 1 1:1 1:0 1:1
4 1 1:0 1:1 1:1
3 2 1:1 1:0 1:1
4 2 1:1 1:0 1:1
5 3 1:1 1:1 1:1
4 3 1:1 1:0 1:0

b) the queryId is an hash calculated, so each group will contain exactly the same status for all those features

5 4 1:1 1:0 1:0
4 4 1:0 1:0 1:0
3 5 1:1 1:0 1:1
4 5 1:1 1:0 1:1
5 7 1:1 1:1 1:1
4 7 1:1 1:1 1:1

Will the training change for those 2 scenarios ?

Related the NDCG@k i studied how it works and I agree with you, that probably 20 is a good k for my use case.
Sorry If i didn't explain myself properly, please let me know if this time I have been clearer :)
I am still learning to Learning to Rank :)

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

i think this code in the LambdaMART algorithm solves my questions :

int dpCount = 0;
        for(int i=0;i<samples.size();i++)
        {
            RankList rl = samples.get(i);
            dpCount += rl.size();
        }
        int current = 0;
        martSamples = new DataPoint[dpCount];
        modelScores = new double[dpCount];
        pseudoResponses = new double[dpCount];
        weights = new double[dpCount];
        for(int i=0;i<samples.size();i++)
        {
            RankList rl = samples.get(i);
            for(int j=0;j<rl.size();j++)
            {
                martSamples[current+j] = rl.get(j);
                modelScores[current+j] = 0.0F;
                pseudoResponses[current+j] = 0.0F;
                weights[current+j] = 0;
            }
            current += rl.size();
        }

This makes clear that during the training we merge all the datapoints, so we don't care anymore about the original groups.
On the other hand these groups are still valid for evaluating the metrics ( on the validation set, on the trainingSet and TestSet) .
This means that choosing a proper queryId will produce a meaningful grouping that will help in evaluating properly the performances.

let me know if my understanding are ok :)

Cheers

Alessandro - 2016-05-06

P.S. when I was referring to generate the queryId as an hash , I was referring of course using only the query related features.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Lemur Project - 2016-05-09

Keep in mind that there also can be some query set randomization going on depending on training/test or training/validation splits and any cross-fold validation settings.

Each iteration of an algorithm can end up with different sets of queries.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Alessandro - 2016-05-10

Yes Stephen, but the RankLists are never broken,
according to my knowledge when we split for cross validation or train-validation we always split at RankList level, we never break the single RankLists :)
Anyway I contributed a patch that could be useful, take a look when you have time, I will commit it only if approved !
Cheers

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Lemur Project - 2016-05-11

Correct, single ranked lists don't get split or shuffled. That wouldn't be good!

I confess I still don't understand your need for encoding query IDs. I want to make sure that your patch enables you to do what you need to do with query IDs, yet does not alter effects for persons who just use query IDs of "1", "2", etc., with no particular meaning beyond a simple integer valued name.

I'll look closer at your patch and maybe try some testing, but won't get to it until next week. I'll talk with you then.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

QueryId generation for NOT full text search query

Search engine and data mining applications and ClueWeb datasets.

Forums

Help

QueryId generation for NOT full text search query

QueryId generation for NOT full text search query

Search engine and data mining applications and ClueWeb datasets.

Forums

Help

QueryId generation for NOT full text search query document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

QueryId generation for NOT full text search query