Most people are familiar with in-line annotations, or in-line field definitions. These can be found in tagged text such as HTML or XML. For example, if you have the following HTML snippet:
<h1>The Lemur Toolkit</h1> <h2>for Language Modeling and Information Retrieval</h2> <p>Language modeling has recently emerged as an attractive new framework for text information retrieval, leveraging work on language modeling from other areas such as speech recognition and statistical natural language processing.</p>
When this document is parsed and readied to be indexed, the text within any of the HTML markup tags (<>
) is indexed, but we can also consider the indexing of the markup tags themselves. For instance, in HTML, the <h1>
tag is typically used for title text. If we can tell the indexer to mark where any <h1>
fields exist, then we can perform queries based on that field.
This is definitely useful, but what if you have a document that has no markup in it, but wish to run it through, say, a named entity tagger or a part-of-speech tagger? In this case, you can certainly have the tagger edit and mark up the original document, and then index that. Alternatively, you can use an "offset annotation" file to tell the indexer where the fields would exist if they were in the source text.
An offset annotation file contains annotations to be used with a document, but are not in-lined with the document itself. That is to say, the original document does not have to be modified, rather you can create an "offset annotation" file to tell the indexer what tag and attribute annotations to add to a document while indexing.
Creating an offset annotation file basically consists of two parts. First, the appropriate annotation tags must be found within the source text, and secondly, and process must take place to align the annotation tags with the byte offsets of the original text.
''Note: the following code on this page has not been thoroughly tested. It is only intended to give the reader an example of how processing offset annotations might happen. Use at your own risk.''
For this example scenario, we will be using the Monty Tagger, a simple part-of-speech tagger.
A simple java wrapper to the Monty Tagger might look like:
public class MontyTaggerWrapper { public static void main(String[] args) { // create a new tagger JMontyTagger mt=new JMontyTagger(); // String inLine; String document; // try { // assumes a plain text file // args[0] is the filename of the file to tag BufferedReader in=new BufferedReader(new java.io.FileReader(args[0])); // read in the file // append the lines to the overall document while ((inLine=in.readLine())!=null) { document += inLine + " "; } in.close(); // tag it String taggedDocument=mt.Tag(document); // print the tagged file to stdout System.out.println(taggedDocument); } catch (java.io.FileNotFoundException e) { System.err.println("!! Cannot find file: " + args[0]); } catch (IOException e) { System.err.println("!! I/O Error reading: " + args[0]); } } }
Running a piece of plain-text document through this wrapper will result in the tagged tokens being printed to stdout. For example, if we had a file named testfile.txt that contained the following text:
Lemur is a toolkit designed to facilitate research in language modeling and
information retrieval.
The output of running "java MontyTaggerWrapper testfile.txt" may look like:
Lemur/NNP is/VBZ a/DT toolkit/NN designed/VBN to/TO facilitate/VB research/NN in/IN language/NN modeling/NN and/CC information/NN retrieval/NN ./.
If you are working with TREC text, you will want to strip the surrounding TREC markup tags (<DOC>
, <DOCNO>
and <TEXT>
) from the text to be tagged.
Once the data has been tagged, it needs to be lined up with the original byte offsets of the text and output to a properly formatted offset annotation file.
The structure of an offset annotation file consists of 9 columns (tab-delimited). The description of the columns are as follows:
To align the above text and part-of-speech tagged text and create an offset annotation file, we can use a PERL script much like the following:
#!/usr/bin/perl # usage: lineupAnnotations.pl <docno> <orig_text> <pos_tagged_text> $thisDocNo=$ARGV[0]; $inputFilename=$ARGV[1]; $posTagFilename=$ARGV[2]; # load in the pos tags into an array and split them by space $posTaggedText=""; open(POSIN, $posTagFilename) || die ("Cannot open POS tagged file.\n"); while (<POSIN>) { $posTaggedText.=$_; } close(POSIN); # read in and split the original text... $origDocText=""; open(ORIGDOC, $inputFilename) || die ("Cannot open original document.\n"); while (<ORIGDOC>) { $origDocText.=$_; } close(ORIGDOC); # now split into characters.. @origText=split(//, $origDocText); # now split on the whitespace... @taggedTokens=split(/\s/, $posTaggedText); $currentOffset=0; # loop through the tokens and print out our annotations # to stdout $currentTagID=1; foreach $thisToken (@taggedTokens) { $tagLen=0; # split the token at the / break my ($thisToken, $thisTag) = $thisLine =~ m/^(.*)\/(.*)$/; # find the next offset in the original text @tokenChars=split(//, $thisToken); # look through until we get to the next token... $keepLooping=1; $loopPos=0; while ($keepLooping==1) { if ($tokenChars[$loopPos]!=$origText[$currentOffset]) { $keepLooping=0; } else { $loopPos++; } $currentOffset++; } # get the length of this token $tagLen=length($thisToken); # now, print the offset annotation information print "$thisDocNo\tTAG\t$currentTagID\t$thisTag\t$currentOffset\t$tagLen\t$currentTagID\t0\t$thisToken\n"; }
If the above PERL file was called with a DOCNO of "01" and the original text and the text output from the tagger, the output of the file would produce the following:
` 01 TAG 1 nnp 0 5 1 0 Lemur 01 TAG 2 vbz 7 2 2 0 is 01 TAG 3 dt 10 1 3 0 a 01 TAG 4 nn 12 7 4 0 toolkit 01 TAG 5 vbn 20 8 5 0 designed 01 TAG 6 to 29 2 6 0 to 01 TAG 7 vb 32 10 7 0 facilitate 01 TAG 8 nn 43 8 8 0 research 01 TAG 9 in 52 2 9 0 in 01 TAG 10 nn 55 8 10 0 language 01 TAG 11 nn 64 8 11 0 modeling 01 TAG 12 cc 77 3 12 0 and 01 TAG 13 nn 81 11 13 0 information 01 TAG 14 nn 93 9 14 0 retrieval`
Looking at the first line in the data above, this represents an offset annotation with the following attributes:
<DOC>
tag)You can then use a similar methodology to process your whole corpus, continously appending the stdout output to your final offset annotations file.
Once you have your offset annotations file created, indexing your corpus with the annotations is easy.
IndriBuildIndex accepts the parameter annotations within the corpus tag to specify a file containing offset annotations for the documents in a collection. Specified as:
<corpus> <annotations>/path/to/file</annotations> </corpus>
in the parameter file. This parameter may be either a single annotations file or the name of a directory containing a separate annotations file for each input file in the corpus path entry. For numeric fields given in offset annotations, the field parameter for the given field needs to specify a different parserName parameter, eg:
<parserName>OffsetAnnotationAnnotator</parserName>
For your offset annotation fields to be searchable, you must provide a <field> reference with the name of your annotation tag in the parameter file. This will tell the indexer to be certain to include the various annotation tags as indexable fields.
Using our offset annotation example from the last page, we would want to add the following field definitions to our indexing parameter file:
<field><name>nnp</name></field> <field><name>vbz</name></field> <field><name>dt</name></field> <field><name>nn</name></field> <field><name>vbn</name></field> <field><name>to</name></field> <field><name>vb</name></field> <field><name>in</name></field> <field><name>cc</name></field>
If you have included your offset annotation fields in your indexing parameters, then retrieval with offset annotations works exactly the same as it would with any field type using the Indri query language.
Using our sample text, if we wanted to search for all documents that contained the word "Lemur" as a proper noun (nnp) and not as a generic noun (nn), we could issue the following query:
#combine( lemur.nnp )
We would expect our search results to include our sample text. We would not expect to see any results from our corpus that might have references to lemurs, as the the animals native to Madagascar (unless, of course, our part-of-speech tagger tagged the word "lemur" in one of these documents as a proper noun).
Wiki: Creating your own Parser
Wiki: Home
Wiki: IndriBuildIndex Parameters
Wiki: Quick Start