Creating your own Parser

David Fisher

Overview

The basic steps for parsing a document involves:
1. reading in and tokenizing the raw text input
1. filling in a !ParsedDocument structure from the input
1. sending the !ParsedDocument to the indexer

Getting the Input

The ParsedDocument Structure

The simplest way of getting a parsed document into an index is to have your parser transform the raw data into a indri::api::ParsedDocument object.

ParsedDocument Elements

The basic ParsedDocument structure looks like the following:

  struct ParsedDocument {
    const char* text;
    size_t textLength;

    indri::utility::greedy_vector<char*> terms;
    indri::utility::greedy_vector<indri::parse::TagExtent*> tags;
    indri::utility::greedy_vector<indri::parse::TermExtent> positions;
    indri::utility::greedy_vector<indri::parse::MetadataPair> metadata;
  };
  • text: this is a pointer to an null-terminated string (char array) containing the raw characters of the original source document. The indexer will compress this and store it for retrieval purposes.
  • textLength: this is the size (in characters) of the text in the text array.
  • terms: this is a vector containing the parsed terms (one term per entry) - in order - of the text. Your parser should determine what constitutes a term and what does not. When terms are indexed, some punctuation will be removed and all letters will be converted to lower-case. Any stemming you want should not occur here, but rather, you should let the indexer perform this by way of the <stemmer> parameter.
  • tags: the tags are the fields used by the indexer to create field extents within a document. The tags are a vector comprised of indri::parse::TagExtent objects, used in [Inline and Offset Annotations].
  • positions: the term positions within the document itself. These are stored as indri::parse::TermExtent elements with one entry for each item in the terms vector.
  • metadata:

TagExtent Elements

The TagExtent object has the following attributes:
name: the tag name.
begin: the beginning token number (word offset relative to the content).
end: the ending token number (word offset relative to the content).
number: The optional numeric component of the tag (if it is a numeric valued tag).
parent: the parent of this tag (or NULL if there is none).
attributes: any tag attributes. Much like XML attributes, but the attributes are not indexed and cannot be searched upon. They can, however, be retrieved programatically.

For example, in the snippet of source text:

  <document>
    This <a href="anotherdoc.html">is some source text</a> from <date>2007</date>.
  </document>

There are three distinct tags:

  • <document> would begin at word offset 0, and end at word offset 6; have no parent and no attributes.
  • <a> would begin at offset 1 and end at offset 4. Its parent would be <document> and would have an attribute pair for (href, anotherdoc.html).
  • <date> would begin at offset 6 and end at offset 6. Its parent would be <document> and would have no attributes, but it would have 2007 in the number property.

TermExtent Elements

An indri::parse::TermExtent element is comprised of two attributes:

  • begin: the byte offset of the beginning of the term.
  • end: the byte offset of the ending of the term.

MetadataPair Elements

Metadata is text about a document that should be kept, but not indexed such as a document ID, a URL, the crawl date, etc. Each indri::parse::MetadataPair element is comprised of three parts:

  • key: the metadata key value (name)
  • value: the actual value of the metadata item
  • valueLength: the length (in bytes) of the metadata item

Related

Wiki: Home
Wiki: Inline and Offset Annotations

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:





No, thanks