The Lemur Project Wiki

Search engine and data mining applications and ClueWeb datasets.

Brought to you by: cammiemw, david_fisher, gregorybrooks, jamiecallan, sm-harding

IndriBuildIndex Parameters

Authors:

Parameter files for !IndriBuildIndex are well-formed XML documents that must be wrapped in <parameter> </parameter> tags. To specify the use of a parameter file on the command line, use:

  $ IndriBuildIndex <parameter_file> [<parameter_file_2> ... <parameter_file_n>]

Note that you can specify more than one parameter file (say, if you have a standard set of stopwords you wish to use for all the indexes you build).

Alternatively, you can specify various parameter values directly on the command line as specified below.

Specifying Source Data

corpus : a complex element containing parameters related to a corpus. This element can be specified multiple times. For each corpus parameter, you can specify the following items:
path : The pathname of the file or directory containing documents to index. Specified as <corpus><path>/path/to/file_or_directory</path></corpus> in the parameter file and as -corpus.path=/path/to/file_or_directory on the command line.
class : The FileClassEnviroment of the file or directory containing documents to index. Specified as <corpus><class>trecweb</class></corpus> in the parameter file and as -corpus.class=trecweb on the command line. For a list of default known classes, see the [Indexer File Formats].
annotations : The pathname of the file containing offset annotations for the documents specified in path. Specified as <corpus><annotations>/path/to/file</annotations></corpus> in the parameter file and as -corpus.annotations=/path/to/file on the command line. For a full description of how to use offset annotations, see [Inline and Offset Annotations].
metadata : The pathname of the file or directory containing offset metadata for the documents specified in path. Specified as <corpus><metadata>/path/to/file</metadata></corpus> in the parameter file and as -corpus.metadata=/path/to/file on the command line.

Index Parameters

index : path to the Indri Repository to create or to add to. Specified as <index>/path/to/repository</index> in the parameter file and as -index=/path/to/repository on the command line.

Memory and Optimizations

memory : an integer value specifying the number of bytes to use for the indexing process. The value can include a scaling factor by adding a suffix. Valid values are (case insensitive) K=1000, M=1000000, G=1000000000. So 100M would be equivalent to 100000000. The value should contain only decimal digits and the optional suffix. Specified as <memory>100M</memory> in the parameter file and as -memory 100M on the command line.
offsetannotationhint : An optional parameter to provide a hint to the indexer to speed up indexing of offset annotations when using offset annotation files as specified in the <corpus> parameter. Valid values here are "unordered" and "ordered". An "unordered" hint (the default) will inform the indexer that the document IDs of the annotations are not necessarily in the same order as the documents in the corpus. The indexer will adjust its internal memory allocations appropriately to pre-allocate enough memory before reading in the annotations file. If you are absolutely certain that the annotations in the offset annotation file are in the exact same order as the documents, then you can use the "ordered" hint. This will tell the indexer to not read in the entire file at once, but rather read in the offset annotations file as needed for only the annotations that are specified for the currently indexing document ID.

Stopwords and Stemming

stopper : a complex element containing one or more subelements named word, specifying the stopword list to use. Specified as <stopper><word>stopword</word></stopper> and as -stopper.word=stopword on the command line. This is an optional parameter with the default of no stopping.
stemmer : a complex element specifying the stemming algorithm to use in the subelement name. Default valid options are 'Porter' or 'Krovetz' (case insensitive). Specified as <stemmer><name>stemmername</name></stemmer> and as -stemmer.name=stemmername on the command line. This is an optional parameter with the default of no stemming.

Specifying Metadata and Fields

metadata : a complex element containing one or more entries specifying the metadata fields to index, eg title, headline. There are three options:
- field : Make the named field available for retrieval as metadata. Specified as <metadata><field>fieldname</field></metadata> in the parameter file and as metadata.field=fieldname on the command line.
- forward : Make the named field available for retrieval as metadata and build a lookup table to make retrieving the value more efficient. Specified as <metadata><forward>fieldname</forward></metadata> in the parameter file and as metadata.forward=fieldname on the command line. The external document id field "docno" is automatically added as a forward metadata field.
- backward : Make the named field available for retrieval as metadata and build a lookup table for inverse lookup of documents based on the value of the field. Specified as <metadata><backward>fieldname</backward></metadata> in the parameter file and as metadata.backward=fieldname on the command line. The external document id field "docno" is automatically added as a backward metadata field.
field : a complex element specifying the fields to index as data, eg TITLE. This parameter can appear multiple times in a parameter file. If provided on the command line, only the first field specified will be indexed. The subelements are:
- name : a required field specifying the field name, specified as <field><name>fieldname</name></field> in the parameter file and as -field.name=fieldname on the command line.
- numeric : and optional parameter that specifies if the field contains integer numeric data (by specifying "true"), otherwise the symbol false, specified as <field><numeric>true</numeric></field> in the parameter file and as -field.numeric=true on the command line. This is an optional parameter, defaulting to false. Note that 0 can be used for false and 1 can be used for true.
- parserName : an optional parameter that contains the name of the parser to use to convert a numeric field to an unsigned integer value. The default is NumericFieldAnnotator. If numeric field data is provided via offset annotations, you should use the value OffsetAnnotationAnnotator. If the field contains a formatted date (see [Numeric and Date Fields in Indri]) you should use the value DateFieldAnnotator.

Example Parameter File

The example parameter file below will create (or add to) and index at /home/lemur/testindex. The indexer will use a soft-limit of 1GB of RAM before flushing out its internal indexing buffers to disk. The source data for the example comes from two different corpora, one at /home/lemur/testdata/firstCorpus and the other located at /home/lemur/testdata/secondCorpus. Note that the classes of the two corpora are different. The parameter file also specifies that stemming is to be performed using the Krovetz method, and one field (the HTML paragraph tag "p") should be made available for searching on.

  <parameters>
    <index>/home/lemur/testindex</index>
    <memory>1G</memory>
    <corpus>
      <path>/home/lemur/testdata/firstCorpus</path>
      <class>trectext</class>
    </corpus>
    <corpus>
      <path>/home/lemur/testdata/secondCorpus</path>
      <class>trecweb</class>
    </corpus>
    <stemmer><name>krovetz</name></stemmer>
    <field>
      <name>p</name>
    </field>
  </parameters>