|
From: Shay L. <sea...@gm...> - 2006-12-06 15:31:49
|
Hi all, I'm using NutchWax (Version 0.7.0-200611082313) and Wera (Version 0.5.0-200611082313) to Index a collection of ARC files generated by a web crawl using the Heritrix web crawler (Version 1.4.0). When I check the metadata tag on the wera front-end the following list of tags are displayed ARC Identifier URL Time of Archival Last Modified Time Mime-Type File Status Content Checksum HTTP Header When I click on the explain link in the NutchWax front-end the following list of tags are displayed Segment Digest Date ARCDate Encoding Collection ARCName ARCOffset ContentLength PrimaryType subType URL Title Boost Is there a full list of the metadata fields that NutchWax/Nutch creates when indexing? I'm particularly interested in tags relating to the actual content on each page i.e. content type, description etc etc When searching does NutchWax/Nutch search across such tags or just across the parsed text of each page for occurances of keywords etc? Any help you can provide would be greatly appreciated! Shay |