The Lemur Project Wiki

Search engine and data mining applications and ClueWeb datasets.

Brought to you by: cammiemw, david_fisher, gregorybrooks, jamiecallan, sm-harding

IndriRunQuery

Specifying Retrieval Parameters

The basic usage for retrieval is to use the !IndriRunQuery application. The basic command line usage is:

`  $ ./IndriRunQuery <parameter_file>`

The full set of parameters accepted by IndriRunQuery.

Preparing Queries

For IndriRunQuery, the input queries are specified in the parameters file:

query: specifies a query to process. This is a complex element consisting of:
number: The query number or identifier. This may be a non-numeric symbol. The default is to number the queries in the parameters in order, starting with 0. This element may appear 0 or 1 times.
text: The query text, eg, "#combine(query terms)". This element may appear 0 or 1 times and must be used if any of the other parameters are supplied.
type: one of indri, to use the indri query language, or nexi to use the nexi query language. The default is indri. This element may appear 0 or 1 times.
workingSetDocno: The external document id of a document to add to the working set for the query. This element may appear 0 or more times. When specified, query evaluation is restricted to the document ids specified.
feedbackDocno: The external document id of a document to add to the relevance feeedback set for the query. This element may appear 0 or more times. When specified, query expansion is performed using only the document ids specified. It is still necessary to specify a non-zero value for the fbDocs parameter when specifying feedbackDocno elements.

For example, the following query had id 503 and will evaluate the query "#combine(prime factor)" on the 3 listed documents.

<parameters>
<query>
<number>503</number>
<text>#combine(prime factor)</text>
<workingSetDocno>clueweb09-en0000-00-00004</workingSetDocno>
<workingSetDocno>clueweb09-en0000-00-00005</workingSetDocno>
<workingSetDocno>clueweb09-en0000-00-00006</workingSetDocno>
</query>
</parameters>

Querying Multiple/Distributed Indexes

You can query multiple indexes by specifying them in a parameter file:

`<parameters>
  <index>/path/to/index1</index>
  <index>/path/to/index2</index>
</parameters>`

Smoothing

Optionally, you can also specify smoothing rules for the method to use. For example:

`  <rule>method:linear,collectionLambda:0.4,documentLambda:0.2</rule>
  <rule>method:dirichlet,mu:1000</rule>
  <rule>method:twostage,mu:1500,lambda:0.4</rule>`

You can also specify different smoothing rules for different types of fields.
The following set of rules uses two level Dirichlet smoothing, and smooths
sentence fields differently from the default. The default smooths a document
with the collection by Dirichlet smoothing with mu=50, and then smooths any
field (that is not a sentence) with the smoothed document model by Dirichlet with mu=5:

`<parameters>
  <rule>method:d,mu:50,documentMu:5</rule>
  <rule>method:d,mu:1200,documentMu:150,field:sentence</rule>
</parameters>`

If you do not specify smoothing rules, default is Dirichlet smoothing with mu:2500,
which may not be the best parameter for your collection and set of queries.
Table 7 and 8 of Fang et al 2004 include optimal mu's and Lambda's for different collections and queries.

Additionally, Zhao and Callan 2008 and Zhao and Callan 2009 include field smoothing setup guidelines.

Formatting your Query and results

For formatting results in TREC format, you can also use the following parameters:
* runID: a string specifying the id for a query run, used in TREC scorable output.
* trecFormat: true to produce TREC scoreable output, otherwise use false (default).

You can also format results for INEX processing:
* participant-id: specifies the participant-id attribute used in submissions.
* task: specifies the task attribute (default CO.Thorough).
* query: specifies the query attribute (default automatic).
* topic-part: specifies the topic-part attribute (default T).
* description: specifies the contents of the description tag.

Interpreting Retrieval Results

The default output from !IndriRunQuery will return a list of results, 1 result per line, with 4 columns:
* score: the score of the returned document. An Indri query will always return a negative value for a result.
* docID: the document ID
* extent_begin: the starting token number of the extent that was retrieved
* extent_end: the ending token number of the extent that was retrieved

As an example:

  -4.83646 AP890101-0001 0 485
  -7.06236 AP890101-0015 0 385

If the results were formatted with TREC formatting as described above, the output will be in the format:

`  <queryID> Q0 <DocID> <rank> <score> <runID>`