Re: [larm-dev] Dublin Core Metadata Engine
Brought to you by:
cmarschner,
otis
From: Clemens M. <Cle...@in...> - 2003-06-15 10:04:48
|
Thanks for these thoughts, Jeff. Let's see how we can get them together with the current design. Maybe we find out that the design must be changed. So I just go through what I have written and ponder whether it fits together. Ok, let's see. I take the documents in the larm-cvs / doc as a basis. There you see under Part III in contents.txt the record processors. A "record" is something to be indexed, in this case, a document that contains DC metadata that is to be extracted. At the start of the pipeline, the content can be anything from binary PDF to HTML. The record processors are responsible for changing this into a format that the indexer understands - a Lucene document that is already divided into fields which are marked stored, tokenized, indexed. So I suppose the Dublin Core Metadata Indexing (DCMI) can be developed as a record processor. Record processors may form a pipeline like +------------ PDFToText --> TextToField ---------+ ! ! [contentType=pdf] ! ! ! ------->+--[=html]--- HTMLLinkExtractor -> HTMLToField ->+--> ProcessorX --> ... ! ! +--[=xml]--------------------------------------->+ The output of a processor may be a converted version of the original document and a set of fields to be indexed. I suppose you have to do different things if you have an RDF file in contrast to HTML or other formats, which means different extractors. RecordProcessors are passed instances of the following classes: interface IndexRecord { // enum Command final static byte CMD_ADD = (byte)'a'; final static byte CMD_UPDATE = (byte)'u'; // maybe unnecessary final static byte CMD_DELETE = (byte)'d'; byte command; // type: Command URI primaryURI; // identifier ArrayList secondaryURIs; // an ArrayList<URI> MD5Hash MD5Hash; Date indexedDate; Date lastChangedDate; float documentWeight; String MIMEtype; ArrayList fields; // an ArrayList<FieldInfo> } Maybe we should add the original document here as well: Object record; interface FieldInfo { // enum MethodTypes (?) final static byte MT_INDEX = 0x01; final static byte MT_STORE = 0x02; final static byte MT_TOKENIZE = 0x04; // enum FieldType final static byte FT_TEXT = (byte)'t'; final static byte FT_DATE = (byte)'d'; byte methods; // type: MethodTypes byte type; // type: FieldType String fieldName; float weight; char[] contents; } Now it is crucial that we define how input and output of the different record processors look like, since this is not modelled on the Java level but forms an important unnegligible dependency > 1. The Metadata Retriever > > This retriever can read the Dublin Core metadata from a content element. > It will support HTML, XML using the Dublin Core schema, and RDF files > using the Dublin Core schema. It will not be responsible for getting > the content element or RDF file from its location, but it will extract > the relevant metadata from the pages. The retriever will be pluggable to > support additional content formats. These are DC Extractors put into the branches of the pipeline that cope with the different file formats. They produce fields that are saved within the IndexRecord.fields array. > 2. The Metadata Engine > > The retriever will feed the data to the engine, which is responsible for > any validation rules may be configured for the metadata to prevent > spamming the search engine or inappropriate results. In addition, some > metadata elements may not be allowed, and they can be removed here. > Other metadata elements may only be relevant with a certain subset of > URL's, and that filter may be applied here as well. This would be a processor applied after the format conversion is done, which may alter IndexRecords or delete them from the pipe. > 3. The Metadata Builder > > The builder retrieves the metadata from the engine and adds it to the > Lucene document as a set of fields. The fields on the document will be > mapped to metadata elements using a configuration, or defaults will be > used. This would be the generic Lucene indexer, no need to develop that. Or am I wrong? > Title > Creator > Subject > Description > Publisher > Contributor > Date > Type > Format > Identifier > Source > Language > Relation > Coverage > Rights I could imagine extracting some of these fields from the text itself, using linguistic analysis. A primitive example would be "Subject" which could be extracted from HTML title or H1 tags. Language is also a feature that can be detected by comparing the words in a text with lexicons of different languages. I think this will be necessary at some time since metatags are only used in very restricted areas (news, medical information, etc.) Clemens |