The Lemur Project Wiki

Search engine and data mining applications and ClueWeb datasets.

Brought to you by: cammiemw, david_fisher, gregorybrooks, jamiecallan, sm-harding

Indexer File Formats

It is our hope that for most indexing needs (especially for research purposes), IndriBuildIndex should be sufficient. If you think IndriBuildIndex is missing a critical feature, please let us know. If you want something easier to use than the IndriBuildIndex tool, consider using the Java interface.

IndriBuildIndex understands the following file types:
* html (HTML formatted data, one document per file)
* xml (XML formatted data, one document per file - same as html, but without link processing)
* trecweb (TREC web collections, such as WT10G or GOV2, with many documents per file)
* trectext (TREC newswire collections, such as AP89, with many documents per file)
* warc WARC (Web ARChive) format, such as can be output by the heritrix webcrawler.
* warcchar WARC (Web ARChive) format, such as can be output by the heritrix webcrawler. Tokenizes individual characters, enabling indexing of unsgemented text.
* mbox (Unix mailbox files)
* doc (Microsoft Word documents - Windows only, requires Microsoft Office)
* ppt (Microsoft PowerPoint documents - Windows only, requires Microsoft Office)
* pdf (Adobe PDF)
* txt (Text documents)

If you don't specify a corpus type (using the class parameter), Indri will index files based on their extensions. Any file that doesn't use a known extension will be skipped. If you do include a corpus.class parameter, Indri assumes all files in the directory are of that type.

Many tasks that users want to do during index time are probably done best by using pre-processing scripts or programs to add tags to your corpus before Indri indexes them.

There are at least two broad categories of tasks where IndriBuildIndex won't work for you:

You want to index structured documents that don't look anything like SGML/XML/HTML documents, or
you want to index documents within another application (like a desktop search tool)

In the first case, you'll need to write your own parser. Make a parser that can output a !ParsedDocument structure, then call !IndexEnvironment::addParsedDocument() to add your document to the index. In the second case, you can use the !IndexEnvironment::addDocument() or !IndexEnvironment::addString() calls, and let Indri do the parsing for you.

What does a trectext file look like?

A trectext file contains one or more documents, separated by <DOC> tags. Each document has a unique document number, specified by the <DOCNO> tag, which comes right after the opening <DOC> tag. The text of the document is contained within <TEXT> tags. Here is an example document:

<DOC>
<DOCNO> AP890101-0005 </DOCNO> 
<TEXT>
The Associated Press reported erroneously on
Dec. 29 that Sen. James Sasser, D-Tenn., wrote a letter to the
chairman of the Federal Home Loan Bank Board, M. Danny Wall, that
questioned the bailouts of insolvent savings and loan associations.
The letter was written by Sen. Timothy Wirth, D-Colo.
</TEXT>
</DOC>

What does a trecweb file look like?

A trecweb file is similar to a trectext file, except for the additional DOCHDR section, and the missing TEXT tags.

A trecweb file contains one or more documents, separated by <DOC> tags. Each document has a unique document number, specified by the <DOCNO> tag, which comes right after the opening <DOC> tag. After a few optional tags, the <DOCHDR> section contains the HTTP request information. Indri uses the <DOCHDR> section to extract the URL. Immediately following the <DOCHDR> section comes the HTML text of the document. The </DOC> tag signifies the end of the document.

<DOC> 
 <DOCNO>WTX001-B01-10</DOCNO> 
 <DOCOLDNO>IA001-000000-B008-97>/DOCOLDNO>
 <DOCHDR> 
 http://sd48.mountain-inter.net:80/hss/teachers/Prothero.html 204.244.59.33 19970101013145 text/html 440 
 HTTP/1.0 200 OK
 Date: Wed, 01 Jan 1997 01:21:13 GMT 
 Server: Apache/1.0.3 
 Content-type: text/html 
 Content-length: 270 
 Last-modified: Mon, 25 Nov 1996 05:31:24 GMT 
 </DOCHDR>
 <HTML>
 <BODY>
 <a href="teachers.html">Back to Teachers' Home Page</a> 
 </BODY>
 </HTML> 
 </DOC>

The Lemur Project Wiki

Search engine and data mining applications and ClueWeb datasets.

Indexer File Formats

Related