This Quick Start guide provides information for indexing and retrieving using out of box Indri. Linux or Mac command line is assumed, i.e. not GUI, not Windows.
Download source code for http://sourceforge.net/projects/lemur/files/lemur/indri-5.0/indri-5.0.tar.gz/download
Run the following commands:
tar -zxf indri-5.0.tar.gz cd indri-5.0 ./configure make
More information about configuration options and installing libraries see [Compiling and Installing]
Before running indexer, make sure the output directory is empty or does not exist. And indexing is as simple as running the following command:
indri-5.0/buildindex/IndriBuildIndex parameter_file
TRECTEXT format looks like this:
` <DOC> <DOCNO>1</DOCNO> <TEXT> document content </TEXT> </DOC>`
TRECWEB format looks like this:
` <DOC> <DOCNO>...</DOCNO> <DOCHDR> ... e.g. URL and other metadata information </DOCHDR> ... HTML content </DOC>`
Example parameter file for indexing:
<parameters> <memory>200m</memory> <index>/path/to/outputIndex</index> <stemmer> <name>krovetz</name> </stemmer> <corpus> <path>/path/to/collection1/</path> <class>trectext</class> </corpus> <corpus> <path>/path/to/collection2/</path> <class>trecweb</class> </corpus> <field><name>title</name></field> <field><name>date</name><numeric>true</numeric><parserName>DateFieldAnnotator</parserName></field> </parameters>
Notes:
* memory parameter provides a rough limit on the memory consumption of the indexer. Total memory usage should be 3x the parameter value or less in most cases.
* krovetz stemmer does not overgeneralize much, and porter stemmer overgeneralizes (e.g. strips too long a suffix from a word).
* indexing title fields allows the fields to be searchable through the Indri query language, for example "#combine[title](query)" will return titles as results, and rank by "query".
* when indexing fields, and querying, make sure the field names are in lower case, e.g. "title" instead of "TITLE".
Example parameter file for .GOV or .GOV2 collections (including inlink anchor texts):
First use [harvestlinks] to extract the inlink anchor:
indri-5.0/harvestlinks/harvestlinks -corpus=/path/to/.GOVcollection/ -output=/path/to/gov/anchor_text/
The indexing parameter file looks like the following:
<parameters> <memory>200m</memory> <index>/path/to/GOVindex</index> <stemmer> <name>krovetz</name> </stemmer> <corpus> <path>/path/to/.GOVcollection/</path> <inlink>/path/to/gov/anchor_text/sorted</inlink> <class>trecweb</class> </corpus> <field><name>inlink</name></field> <field><name>title</name></field> <field><name>date</name><numeric>true</numeric><parserName>DateFieldAnnotator</parserName></field> </parameters>
Note, indexing inlink fields allows the fields to be searchable through the Indri query language, for example "#combine(query.inlink)" will return documents matching "query" in the inlink field.
More about these !IndriBuildIndex parameter files can be found in [IndriBuildIndex Parameters].
More about indexing fields either inline with the text or offline: [Inline and Offset Annotations].
For indexing XML documents, see [Indexing XML document].
Retrieval is simply running:
indri-5.0/runquery/IndriRunQuery query_parameter_file -count=1000 -index=/path/to/index -trecFormat=true > result_file
Command line options include:
-count=N is used to restrict number of results returned for each query.
-trecFormat=true is used to format the output format so that trec_eval and [ireval] can recognize the results
You can also specify the query in command line, e.g. -query="apple juice" or -query="#combine(apple juice)", these two should return the same results and scores.
Results for all queries in the query_parameter_file are saved in the file result_file.
<parameters> <query> <type>indri</type> <number>751</number> <text> #combine( popular scrabble players ) </text> </query> <query> <type>indri</type> <number>752</number> <text> #combine( dam removal environmental impact ) </text> </query> </parameters>
More about [The Indri Query Language].
For XML element retrieval in NEXI language, see example.
The default smoothing method for Indri is Dirichlet smoothing with mu parameter set to 2500.
You can specify your own smoothing parameter in command line (e.g. -rule=method:d,mu:2500) or in the parameter file being passed to IndriRunQuery (<rule>method:d,mu:2500</rule>).
More about optimal smoothing parameters: [IndriRunQuery].
More on querying multiple indexes, see [IndriRunQuery] or [The Indri Daemon].
TREC queries cannot be fed into Indri directly, punctuations need to be removed. One simple strategy is to replace everything that's not a number (0x30-0x39) or letter with a space (0x20). However, tokenization should be performed similar to how the indexer indexes texts. And in Indri, "U.S." will be translated into "us" in the indexer.
Simply use trec_eval or [ireval].
trec_eval -q QREL_file Retrieval_Results > eval_output
This gives per query evaluation and a summary averaged over all queries.
Or
java -jar lemur/ireval/src/ireval.jar baseline_result treatment_result QREL_file > comparison_output
This also gives statistical significance tests comparing treatment_result with baseline_result.
Wiki: Compiling and Installing
Wiki: Home
Wiki: IndriBuildIndex Parameters
Wiki: IndriRunQuery
Wiki: Inline and Offset Annotations
Wiki: The Indri Daemon
Wiki: The Indri Query Language
Wiki: Toolkit Usage Overview
Wiki: harvestlinks
Wiki: ireval