The Lemur Project Wiki

Search engine and data mining applications and ClueWeb datasets.

Brought to you by: cammiemw, david_fisher, gregorybrooks, jamiecallan, sm-harding

Quick Start

This Quick Start guide provides information for indexing and retrieving using out of box Indri. Linux or Mac command line is assumed, i.e. not GUI, not Windows.

Downloading and Compiling

Download source code for http://sourceforge.net/projects/lemur/files/lemur/indri-5.0/indri-5.0.tar.gz/download

Run the following commands:

tar -zxf indri-5.0.tar.gz
cd indri-5.0
./configure
make

More information about configuration options and installing libraries see [Compiling and Installing]

Indexing

Before running indexer, make sure the output directory is empty or does not exist. And indexing is as simple as running the following command:

indri-5.0/buildindex/IndriBuildIndex parameter_file

Example Formats for input data files

TRECTEXT format looks like this:

`    <DOC>
    <DOCNO>1</DOCNO>
    <TEXT>
    document content
    </TEXT>
    </DOC>`

TRECWEB format looks like this:

`    <DOC> 
    <DOCNO>...</DOCNO> 
    <DOCHDR> 
    ... e.g. URL
     and other metadata information
    </DOCHDR> 
    ... HTML content
    </DOC>`

Example Parameter Files

Example parameter file for indexing:

<parameters>
     <memory>200m</memory>
     <index>/path/to/outputIndex</index>

     <stemmer>    
       <name>krovetz</name>
     </stemmer>

     <corpus>
       <path>/path/to/collection1/</path>        
       <class>trectext</class>
     </corpus>
     <corpus>
       <path>/path/to/collection2/</path>
       <class>trecweb</class>
     </corpus>

     <field><name>title</name></field>
     <field><name>date</name><numeric>true</numeric><parserName>DateFieldAnnotator</parserName></field> 

</parameters>

Notes:
* memory parameter provides a rough limit on the memory consumption of the indexer. Total memory usage should be 3x the parameter value or less in most cases.
* krovetz stemmer does not overgeneralize much, and porter stemmer overgeneralizes (e.g. strips too long a suffix from a word).
* indexing title fields allows the fields to be searchable through the Indri query language, for example "#combine[title](query)" will return titles as results, and rank by "query".
* when indexing fields, and querying, make sure the field names are in lower case, e.g. "title" instead of "TITLE".

Example parameter file for .GOV or .GOV2 collections (including inlink anchor texts):

First use [harvestlinks] to extract the inlink anchor:

indri-5.0/harvestlinks/harvestlinks -corpus=/path/to/.GOVcollection/ -output=/path/to/gov/anchor_text/

The indexing parameter file looks like the following:

<parameters>
  <memory>200m</memory>
  <index>/path/to/GOVindex</index>

  <stemmer>
    <name>krovetz</name>
  </stemmer>

  <corpus>
    <path>/path/to/.GOVcollection/</path>
    <inlink>/path/to/gov/anchor_text/sorted</inlink>
    <class>trecweb</class>
  </corpus>

  <field><name>inlink</name></field>
  <field><name>title</name></field>
  <field><name>date</name><numeric>true</numeric><parserName>DateFieldAnnotator</parserName></field> 

</parameters>

Note, indexing inlink fields allows the fields to be searchable through the Indri query language, for example "#combine(query.inlink)" will return documents matching "query" in the inlink field.

More about these !IndriBuildIndex parameter files can be found in [IndriBuildIndex Parameters].

More about indexing fields either inline with the text or offline: [Inline and Offset Annotations].

For indexing XML documents, see [Indexing XML document].

Retrieval

Retrieval is simply running:

indri-5.0/runquery/IndriRunQuery query_parameter_file -count=1000 -index=/path/to/index -trecFormat=true > result_file

Command line options include:
-count=N is used to restrict number of results returned for each query.

-trecFormat=true is used to format the output format so that trec_eval and [ireval] can recognize the results

You can also specify the query in command line, e.g. -query="apple juice" or -query="#combine(apple juice)", these two should return the same results and scores.

Results for all queries in the query_parameter_file are saved in the file result_file.

Example Query Parameter File

<parameters>
  <query>
    <type>indri</type>
    <number>751</number>
    <text>
      #combine( popular scrabble players )
    </text>
  </query>
  <query>
    <type>indri</type>
    <number>752</number>
    <text>
      #combine( dam removal environmental impact )
    </text>
  </query>
</parameters>

More about [The Indri Query Language].

For XML element retrieval in NEXI language, see example.

Smoothing parameters for Language Modeling

The default smoothing method for Indri is Dirichlet smoothing with mu parameter set to 2500.

You can specify your own smoothing parameter in command line (e.g. -rule=method:d,mu:2500) or in the parameter file being passed to IndriRunQuery (<rule>method:d,mu:2500</rule>).

More about optimal smoothing parameters: [IndriRunQuery].

More on querying multiple indexes, see [IndriRunQuery] or [The Indri Daemon].

Query Tokenization

TREC queries cannot be fed into Indri directly, punctuations need to be removed. One simple strategy is to replace everything that's not a number (0x30-0x39) or letter with a space (0x20). However, tokenization should be performed similar to how the indexer indexes texts. And in Indri, "U.S." will be translated into "us" in the indexer.

Evaluation

Simply use trec_eval or [ireval].

trec_eval -q QREL_file Retrieval_Results > eval_output

This gives per query evaluation and a summary averaged over all queries.

java -jar lemur/ireval/src/ireval.jar baseline_result treatment_result QREL_file > comparison_output

This also gives statistical significance tests comparing treatment_result with baseline_result.

Wiki: Compiling and Installing
Wiki: Home
Wiki: IndriBuildIndex Parameters
Wiki: IndriRunQuery
Wiki: Inline and Offset Annotations
Wiki: The Indri Daemon
Wiki: The Indri Query Language
Wiki: Toolkit Usage Overview
Wiki: harvestlinks
Wiki: ireval