|
From: Maximilian S. <sch...@ci...> - 2006-07-31 12:22:39
|
Hi *, I'want to do regular crawls of a bigger website. I've already crawled it successfully with heritrix, indexed the resulting arcs with nutchwax and also searched/browsed them with wera. Works pretty well! Now I wanted to do a second crawl but I've read that incremental indexing is broken in nutchwax 0.6 (which I'am using). I guess I need incremental indexing if I want to be able to search across all versions of the site? Now I think I have three options: 1. wait until incremental indexing is fixed 2. use the 4.3 branch 3. index only the newly crawled arcs and let the user select on which dat= e she want's to search So my questions are: - Is it foreseeable when incremental indexing will be fixed - and if - what performance can I expect compared to completely reindexing all arc files? - Will the 4.3 branch be maintained beside the 0.6 branch and will it be possible to convert the webdb/indices later (doesn't seem to be the case right now)? What solution would you suggest? Thanks & Best regards, Max |