You need a phrase operator to get statistics of -- and then the text operators might need to be extents, e.g.,: Node n = new Node("od", 1); n.addChild(new Node("extents", "natural")); n.addChild(new Node("extents", "language")); In the StructuredQuery/Galago QL language, this would be: #od:1(#extents:natural() #extents:language()) Pseudocode basically, but the thing you want to count is an "ordered window" or phrase of the "positions" called "extents" of your words. The generic "text" operator sometimes...
It's a redefinition of the metric under a single swap of ordering. This is used to accelerate some of the learning algorithms (to quickly compare the benefits of different rankings based on which documents they are able to swap). It is a difficult thing to implement and test but is critical to LambdaMART, I believe.
Yep, that's what I get for answering without trying.
Just by query type -- RankLib's loading is naive about that, it assumes adjacent lines with the same qid are the same query.
Doesn't matter, because RankLib ignores it. I usually put a zero, e.g.,: qid:001 0 1:0.5 2:0.7 #docid
Threshold candidates are within a feature: it's how many times a feature is allowed to be split -1 says that any difference in floating point values may be used - if you have less than 256 distinct values for a feature, -1 is equivalent.
Correct. The key e.g., (#1A) is how RankLib stores the document names.
qrel is in trec_eval format, e.g., from the answer below: https://stackoverflow.com/questions/4275825/how-to-evaluate-a-search-retrieval-engine-using-trec-eval