Training data

Search engine and data mining applications and ClueWeb datasets.

Brought to you by: cammiemw, david_fisher, gregorybrooks, jamiecallan, sm-harding

Training data

Forum: RankLib

Creator: Brian Yee

Created: 2017-07-21

Updated: 2017-07-21

Brian Yee - 2017-07-21

I am gathering my training data from user history logs and using their actions to determine the labels.

Consider this scenario:
User 1 searches X and sees 10 products. This user purchases product ABC.
User 2 searches the same query X and sees the same 10 products. This user purchases product DEF.

Is that 20 rows of training data all with relevance labels 0 except one for product ABC and one for product DEF both with label 1?

Or since each row of training data is a query-document pair, should i be consolidating this into 10 rows of training data and trying to assign some kind of score for ABC and DEF by factoring in that one person bought ABC but the other one didn't?

Put another way, should each row of training data be a unique "query-feature set" ? Or can I have many rows with the same query and same features, but different labels?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Lemur Project - 2017-07-28

I don't think much learning is going to happen if you use the same features producing different labels.

I am unclear on exactly what is going to be learned. You want a model that predicts the best possible ranking (relevant "documents") for each query. So what is a "relevant" document for a query? A specific product?

If the same query produces a relevant document of product ABC and also of product DEF, then for that query, those two products would be in the relevant document set and have a label of 1. Any feature set for that query that didn't come up with those products would have a label of 0 (assuming those were "relevant products" for the query).

There must be a million-zillion possible products, so you'd probably have a very large non-relevant product set, which is OK, but you need to have some idea of what the relevant product set would be for each query.

Relevance judgments are hard to come by!

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Log in to post a comment.