Indri

David Fisher

About Indri

Indri is a text search engine developed at UMass. It is a part of the Lemur project.

From an academic perspective, Indri is interesting because it combines inference networks with language modeling. The query language, which is reminicent of the Inquery query language, allows researchers to experiment with proximity, document structure, text passages, and other document features without writing code. Like other academic engines, Indri can parse TREC newswire and web collections, and it is able to return results in the TREC standard format.

From an industrial perspective, Indri is interesting because it is efficient, supported, and easy to integrate. Indri is freely available from UMass with a flexible BSD-inspired license. Indri includes an API that is accessible from C++, Java, C# and PHP. Indri also can be distributed across a cluster of nodes for high speed query performance. In version 2.0, Indri adds true multithreaded operation, so documents can be added, queried and deleted concurrently.

About the Indri Applications

The IndriBuildIndex application can build Indri repositories from TREC formatted documents, HTML documents, text documents, and PDF files. Additionally, on Windows it can index Word and PowerPoint documents. IndriBuildIndex understands tags in HTML/XML documents, and it can be instructed to index them as well.

The IndriRunQuery application evaluates queries against one or more Indri repositories, and returns the results in a ranked list of documents. IndriRunQuery can be instructed to print the document text as well, or the text of passages if the query is a passage retrieval query.

The IndriDaemon application is a repository server. It waits for connections from IndriRunQuery (or from other applications using the QueryEnvironment interface) and processes queries from network requests. One copy of IndriRunQuery can connect to many !IndriDaemon instances at once, making retrieval using a cluster of machines possible.

Using the Indri API

Indri provides the QueryEnvironment and IndexEnvrionment classes, which can be used from C++, Java, C# or PHP (although indexing is not supported from PHP). The IndriBuildIndex and IndriRunQuery applications use these classes exclusively. Please keep in mind that we reserve the right to change any classes within Indri that are not in the indri::api namespace. If you write your code to use only indri::api classes, we will do our best to make sure they still work in future versions of Indri.

IndexEnvironment understands many different file types. However, you can create your own file type, as long as it is XML-like, and tell IndexEnvironment how to index it. Then, using the addFile method, IndexEnvironment can index your document(s). If you want to do more complex processing on your data, or if your data is arriving in real time, you may parse your document into a ParsedDocument structure. The IndexEnvrionment object can index these structures directly.

QueryEnvironment allows you to run queries and retrieve a ranked list of results. You can use runAnnotatedQuery to retrieve match information (annotations), which is useful for highlighting matched words in documents. By using the addIndex method with an instance of IndexEnvironment, you can evaluate queries on an index that is currently being built. The addServer method allows you to connect to IndriDaemon processes for distributed retrieval.

How do I use the Indri API from Java?

First, you need to build Indri including the Java wrappers. On Unix, you do this by adding the --enable-java line when running the configure script. The script should find your Java installation automatically, but if it doesn't, you can show it where to find java by using the --with-javahome parameter. If you are using Windows, use the swig project file from Visual Studio to build the Java API. You may need to change the include path on the project to point to your Java installation.

Once that's built, indri.jar and liblemur_jni.so should be in your lemur/swig/obj/java directory. If you are using Mac OS X, liblemur_jni.so will be called liblemur_jni.jnilib. The indri.jar file contains all of the Java support files for Indri, while liblemur_jnii.so contains the Indri C++ code.

If you run an application that uses the indri.jar file, it will attempt to load the liblemurjni.so file automatically. For this to work, you need to set the java.library.path variable correctly. You can do this on the java command line:

java -cp indri.jar -Djava.library.path=lemur/swig/obj/java MyIndriApplication

For more information on the structure of Indri indexes, see: [Indri Repository Structure].

For information on building queries with the Indri Query Language, see: [The Indri Query Language].


Related

Wiki: Home
Wiki: Indri Repository Structure
Wiki: Overview
Wiki: The Indri Query Language