[larm-dev] Re: [larm-developer] Specs: Indexer
Brought to you by:
cmarschner,
otis
From: otisg <ot...@ur...> - 2003-06-17 05:23:31
|
Clemens, Brief and minor comments. primariURI is normalized (e.g. hosts lowercased, port included only if != 80, I suppose, the path part of the URL cleaned up, etc.). Maybe add this, although it's pretty obvious. > 3. secondaryURIs: Collection: A list of secondary URIs of the document. If > the URI is ambiguous (e.g. if a document is represented by more than one > URL) this contains a list of all URIs (found) under which this document is > accessible. This can also be null. Obvious, too. > 4. MD5Hash: MD5Hash: The MD5 hash of the doc. In case of a recrawl this hash > will be sent to the gatherer to determine whether the document contents have > changed. I'd call this a 'fingerprint'. That term is implementation-agnostic. > 6. lastChangedDate: Date: The time the last change has occurred as Something seems to be missing here... > 7. documentWeight: float. It is left to the processing pipeline to What is this going to be used for? Lucene field boosting? Comments about the Crawler document follow. CHECK_FOR_SERVER_RUNNING sounded bogus to me. What is that method for? Regarding "deferedURL", I'd call the two URLs "initialURL" and "finalURL" or some such. Isn't "lastChanged" date the same as "indexedDate"? In the Fetcher paragraph there is a hint about Fetchers polling the FetcherManager. I am not sure what you have in mind, but if that is not such an important feature, I'd remove references to it, in order to keep things simple and avoid over-engineering. That's all for now. Otis ________________________________________________ Get your own "800" number Voicemail, fax, email, and a lot more http://www.ureach.com/reg/tag |