[larm-dev] Dublin Core Metadata Engine

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hi,

This would go under record processing.  Let me know what you think about 
the design.

Jeff

Dublin Core Metadata Indexing

The record processor should be able to optionally handle Dublin Core 
metadata elements inside the content that is being indexed, or as part 
of an RDF record that is external to the content.  Because these 
metadata elements are standard, we can use that to add fields to the 
Lucene Document for each of the metadata elements.  This support is 
entirely optional, and can be configured.

1.  The Metadata Retriever

This retriever can read the Dublin Core metadata from a content element. 
It will support HTML, XML using the Dublin Core schema, and RDF files 
using the Dublin Core schema.  It will not be responsible for getting 
the content element or RDF file from its location, but it will extract 
the relevant metadata from the pages. The retriever will be pluggable to 
support additional content formats.

2. The Metadata Engine

The retriever will feed the data to the engine, which is responsible for 
any validation rules may be configured for the metadata to prevent 
spamming the search engine or inappropriate results. In addition, some 
metadata elements may not be allowed, and they can be removed here. 
Other metadata elements may only be relevant with a certain subset of 
URL's, and that filter may be applied here as well.

3. The Metadata Builder

The builder retrieves the metadata from the engine and adds it to the 
Lucene document as a set of fields.  The fields on the document will be 
mapped to metadata elements using a configuration, or defaults will be 
used.

4. Dublin Core metadata elements (from 
http://www.dublincore.org/documents/dces/)

Title
Creator
Subject
Description
Publisher
Contributor
Date
Type
Format
Identifier
Source
Language
Relation
Coverage
Rights

5. References
http://www.dublincore.org/
http://www.dublincore.org/documents/dces/