The Lemur Project Wiki

Search engine and data mining applications and ClueWeb datasets.

Brought to you by: cammiemw, david_fisher, gregorybrooks, jamiecallan, sm-harding

Inline and Offset Annotations

In-line Annotations

Most people are familiar with in-line annotations, or in-line field definitions. These can be found in tagged text such as HTML or XML. For example, if you have the following HTML snippet:

  <h1>The Lemur Toolkit</h1>
  <h2>for Language Modeling and Information Retrieval</h2>
  <p>Language modeling has recently emerged as an attractive new
  framework for text information retrieval, leveraging work on language
  modeling from other areas such as speech recognition and statistical
  natural language processing.</p>

When this document is parsed and readied to be indexed, the text within any of the HTML markup tags (<>) is indexed, but we can also consider the indexing of the markup tags themselves. For instance, in HTML, the <h1> tag is typically used for title text. If we can tell the indexer to mark where any <h1> fields exist, then we can perform queries based on that field.

Offset Annotations

This is definitely useful, but what if you have a document that has no markup in it, but wish to run it through, say, a named entity tagger or a part-of-speech tagger? In this case, you can certainly have the tagger edit and mark up the original document, and then index that. Alternatively, you can use an "offset annotation" file to tell the indexer where the fields would exist if they were in the source text.

An offset annotation file contains annotations to be used with a document, but are not in-lined with the document itself. That is to say, the original document does not have to be modified, rather you can create an "offset annotation" file to tell the indexer what tag and attribute annotations to add to a document while indexing.

Preparing Text for Offset Annotations

Creating an offset annotation file basically consists of two parts. First, the appropriate annotation tags must be found within the source text, and secondly, and process must take place to align the annotation tags with the byte offsets of the original text.

''Note: the following code on this page has not been thoroughly tested. It is only intended to give the reader an example of how processing offset annotations might happen. Use at your own risk.''

For this example scenario, we will be using the Monty Tagger, a simple part-of-speech tagger.

A simple java wrapper to the Monty Tagger might look like:

  public class MontyTaggerWrapper {
    public static void main(String[] args) {
    // create a new tagger
    JMontyTagger mt=new JMontyTagger();
    //
    String inLine;
    String document;
    //
    try {
      // assumes a plain text file
      // args[0] is the filename of the file to tag
      BufferedReader in=new BufferedReader(new java.io.FileReader(args[0]));
      // read in the file
      // append the lines to the overall document
      while ((inLine=in.readLine())!=null) {
        document += inLine + " ";
      }
      in.close();
      // tag it
      String taggedDocument=mt.Tag(document);
      // print the tagged file to stdout
      System.out.println(taggedDocument);
    } catch (java.io.FileNotFoundException e) {
      System.err.println("!! Cannot find file: " + args[0]);
    } catch (IOException e) {
      System.err.println("!! I/O Error reading: " + args[0]);
    }
   }
  }

Running a piece of plain-text document through this wrapper will result in the tagged tokens being printed to stdout. For example, if we had a file named testfile.txt that contained the following text:

Lemur is a toolkit designed to facilitate research in language modeling and
information retrieval.

The output of running "java MontyTaggerWrapper testfile.txt" may look like:

  Lemur/NNP is/VBZ a/DT toolkit/NN designed/VBN to/TO facilitate/VB research/NN in/IN
  language/NN modeling/NN and/CC information/NN retrieval/NN ./.

If you are working with TREC text, you will want to strip the surrounding TREC markup tags (<DOC>, <DOCNO> and <TEXT>) from the text to be tagged.

Creating an Offset Annotation File

Once the data has been tagged, it needs to be lined up with the original byte offsets of the text and output to a properly formatted offset annotation file.

The structure of an offset annotation file consists of 9 columns (tab-delimited). The description of the columns are as follows:

docno: external document id corresponding to the document in which the annotation occurs.
type: TAG or ATTRIBUTE
id: an id number for the annotation; each line should have a unique id >= 1.
name: for TAG, name or type of the annotation for ATTRIBUTE, the attribute name, or key
start: start and length define the annotation's extent. The values should be byte offsets relative to the start of the document.
length: meaningless for an ATTRIBUTE. For a TAG, it's the number of bytes the annotation spans.
value: for TAG, an optional INT64 (for numeric values) for ATTRIBUTE, a string that is the attribute's value
parentid: for TAG, refers to the id number of another TAG to be considered the parent of this one; this is how hierarchical annotations can be expressed. a TAG that has no parent has parentid = 0 for ATTRIBUTE, refers to the id number of a TAG to which it belongs and from which it inherits its start and length. NOTE: the file must be sorted such that any line that uses a given id in this column must be after the line that uses that id in the id column.
debug: ignored by the OffsetAnnotator; can contain any information that is beneficial to a human reading the file

To align the above text and part-of-speech tagged text and create an offset annotation file, we can use a PERL script much like the following:

  #!/usr/bin/perl

  # usage: lineupAnnotations.pl <docno> <orig_text> <pos_tagged_text>
  $thisDocNo=$ARGV[0];
  $inputFilename=$ARGV[1];
  $posTagFilename=$ARGV[2];

  # load in the pos tags into an array and split them by space
  $posTaggedText="";
  open(POSIN, $posTagFilename) || die ("Cannot open POS tagged file.\n");
  while (<POSIN>) {
    $posTaggedText.=$_;
  }
  close(POSIN);

  # read in and split the original text...
  $origDocText="";
  open(ORIGDOC, $inputFilename) || die ("Cannot open original document.\n");
  while (<ORIGDOC>) {
    $origDocText.=$_;
  }
  close(ORIGDOC);

  # now split into characters..
  @origText=split(//, $origDocText);

  # now split on the whitespace...
  @taggedTokens=split(/\s/, $posTaggedText);

  $currentOffset=0;

  # loop through the tokens and print out our annotations
  # to stdout

  $currentTagID=1;
  foreach $thisToken (@taggedTokens) {
    $tagLen=0;

    # split the token at the / break
    my ($thisToken, $thisTag) = $thisLine =~ m/^(.*)\/(.*)$/;

    # find the next offset in the original text
    @tokenChars=split(//, $thisToken);

    # look through until we get to the next token...
    $keepLooping=1;
    $loopPos=0;
    while ($keepLooping==1) {
      if ($tokenChars[$loopPos]!=$origText[$currentOffset]) {
        $keepLooping=0;
      } else {
        $loopPos++;
      }
      $currentOffset++;
    }

    # get the length of this token
    $tagLen=length($thisToken);

    # now, print the offset annotation information
    print "$thisDocNo\tTAG\t$currentTagID\t$thisTag\t$currentOffset\t$tagLen\t$currentTagID\t0\t$thisToken\n";
  }

If the above PERL file was called with a DOCNO of "01" and the original text and the text output from the tagger, the output of the file would produce the following:

`  01  TAG  1   nnp   0   5   1   0  Lemur
  01  TAG  2   vbz   7   2   2   0  is
  01  TAG  3   dt    10  1   3   0  a
  01  TAG  4   nn    12  7   4   0  toolkit
  01  TAG  5   vbn   20  8   5   0  designed
  01  TAG  6   to    29  2   6   0  to
  01  TAG  7   vb    32  10  7   0  facilitate
  01  TAG  8   nn    43  8   8   0  research
  01  TAG  9   in    52  2   9   0  in
  01  TAG  10  nn    55  8   10  0  language
  01  TAG  11  nn    64  8   11  0  modeling
  01  TAG  12  cc    77  3   12  0  and
  01  TAG  13  nn    81  11  13  0  information
  01  TAG  14  nn    93  9   14  0  retrieval`

Looking at the first line in the data above, this represents an offset annotation with the following attributes:

The document ID is "01"
It is a TAG (as opposed to an ATTRIBUTE)
The annotation ID is "1"
The annotation field is "nnp"
The starting byte offset of this annotation is at 0 (from the beginning of the document. If you are using a TRECText, or TRECWeb format, be sure this value is from the start of the document - starting from the beginning of the opening <DOC> tag)
The length of the annotation is 5 bytes
The tag's value is set to 1 (this is optional and can be arbitrary, but this could be used for such things as searching for tags within a certain numeric range)
The annotation's parent ID is 0 (meaning that it has no parent)
And finally, for the optional debug field, we have chosen to include the word or phrase that this annotation corresponds to.

You can then use a similar methodology to process your whole corpus, continously appending the stdout output to your final offset annotations file.

Indexing a Corpus with Offset Annotations

Once you have your offset annotations file created, indexing your corpus with the annotations is easy.

IndriBuildIndex accepts the parameter annotations within the corpus tag to specify a file containing offset annotations for the documents in a collection. Specified as:

  <corpus>
    <annotations>/path/to/file</annotations>
  </corpus>

in the parameter file. This parameter may be either a single annotations file or the name of a directory containing a separate annotations file for each input file in the corpus path entry. For numeric fields given in offset annotations, the field parameter for the given field needs to specify a different parserName parameter, eg:

  <parserName>OffsetAnnotationAnnotator</parserName>

Indexing Offset Annotations as Fields

For your offset annotation fields to be searchable, you must provide a <field> reference with the name of your annotation tag in the parameter file. This will tell the indexer to be certain to include the various annotation tags as indexable fields.

Using our offset annotation example from the last page, we would want to add the following field definitions to our indexing parameter file:

  <field><name>nnp</name></field>
  <field><name>vbz</name></field>
  <field><name>dt</name></field>
  <field><name>nn</name></field>
  <field><name>vbn</name></field>
  <field><name>to</name></field>
  <field><name>vb</name></field>
  <field><name>in</name></field>
  <field><name>cc</name></field>

Retrieval with Offset Annotations

If you have included your offset annotation fields in your indexing parameters, then retrieval with offset annotations works exactly the same as it would with any field type using the Indri query language.

Using our sample text, if we wanted to search for all documents that contained the word "Lemur" as a proper noun (nnp) and not as a generic noun (nn), we could issue the following query:

  #combine( lemur.nnp )

We would expect our search results to include our sample text. We would not expect to see any results from our corpus that might have references to lemurs, as the the animals native to Madagascar (unless, of course, our part-of-speech tagger tagged the word "lemur" in one of these documents as a proper noun).

Wiki: Creating your own Parser
Wiki: Home
Wiki: IndriBuildIndex Parameters
Wiki: Quick Start