Re: [larm-dev] Dublin Core Metadata Engine

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Thanks for these thoughts, Jeff. Let's see how we can get them together with
the current design. Maybe we find out that the design must be changed. So I
just go through what I have written and ponder whether it fits together.

Ok, let's see. I take the documents in the larm-cvs / doc as a basis.

There you see under Part III in contents.txt the record processors. A
"record" is something to be indexed, in this case, a document that contains
DC metadata that is to be extracted. At the start of the pipeline, the
content can be anything from binary PDF to HTML. The record processors are
responsible for changing this into a format that the indexer understands - a
Lucene document that is already divided into fields which are marked stored,
tokenized, indexed. So I suppose the Dublin Core Metadata Indexing (DCMI)
can be developed as a record processor.

Record processors may form a pipeline like

        +------------ PDFToText --> TextToField ---------+
        !                                                !
   [contentType=pdf]                                     !
        !                                                !
------->+--[=html]--- HTMLLinkExtractor -> HTMLToField ->+--> ProcessorX -->
...
        !                                                !
        +--[=xml]--------------------------------------->+

The output of a processor may be a converted version of the original
document and a set of fields to be indexed.
I suppose you have to do different things if you have an RDF file in
contrast to HTML or other formats, which means different extractors.

RecordProcessors are passed instances of the following classes:

interface IndexRecord
{
    // enum Command
    final static byte CMD_ADD = (byte)'a';
    final static byte CMD_UPDATE = (byte)'u'; // maybe unnecessary
    final static byte CMD_DELETE = (byte)'d';

    byte         command;          // type: Command
    URI          primaryURI;       // identifier
    ArrayList    secondaryURIs;    // an ArrayList<URI>
    MD5Hash      MD5Hash;
    Date         indexedDate;
    Date         lastChangedDate;
    float        documentWeight;
    String       MIMEtype;
    ArrayList    fields;            // an ArrayList<FieldInfo>
}

Maybe we should add the original document here as well:
    Object       record;

interface FieldInfo
{
    // enum MethodTypes (?)
    final static byte MT_INDEX = 0x01;
    final static byte MT_STORE  = 0x02;
    final static byte MT_TOKENIZE = 0x04;

    // enum FieldType
    final static byte FT_TEXT = (byte)'t';
    final static byte FT_DATE = (byte)'d';

    byte    methods;   // type: MethodTypes
    byte    type;      // type: FieldType
    String  fieldName;
    float   weight;
    char[]  contents;
}

Now it is crucial that we define how input and output of the different
record processors look like, since this is not modelled on the Java level
but forms an important unnegligible dependency

> 1.  The Metadata Retriever
>
> This retriever can read the Dublin Core metadata from a content element.
> It will support HTML, XML using the Dublin Core schema, and RDF files
> using the Dublin Core schema.  It will not be responsible for getting
> the content element or RDF file from its location, but it will extract
> the relevant metadata from the pages. The retriever will be pluggable to
> support additional content formats.

These are DC Extractors put into the branches of the pipeline that cope with
the different file formats. They produce fields that are saved within the
IndexRecord.fields array.

> 2. The Metadata Engine
>
> The retriever will feed the data to the engine, which is responsible for
> any validation rules may be configured for the metadata to prevent
> spamming the search engine or inappropriate results. In addition, some
> metadata elements may not be allowed, and they can be removed here.
> Other metadata elements may only be relevant with a certain subset of
> URL's, and that filter may be applied here as well.

This would be a processor applied after the format conversion is done, which
may alter IndexRecords or delete them from the pipe.

> 3. The Metadata Builder
>
> The builder retrieves the metadata from the engine and adds it to the
> Lucene document as a set of fields.  The fields on the document will be
> mapped to metadata elements using a configuration, or defaults will be
> used.

This would be the generic Lucene indexer, no need to develop that. Or am I
wrong?

> Title
> Creator
> Subject
> Description
> Publisher
> Contributor
> Date
> Type
> Format
> Identifier
> Source
> Language
> Relation
> Coverage
> Rights

I could imagine extracting some of these fields from the text itself, using
linguistic analysis. A primitive example would be "Subject" which could be
extracted from HTML title or H1 tags. Language is also a feature that can be
detected by comparing the words in a text with lexicons of different
languages. I think this will be necessary at some time since metatags are
only used in very restricted areas (news, medical information, etc.)

Clemens