|
From: James G. <jg...@si...> - 2006-11-15 20:37:53
|
Forwarded by St.Ack's request. Michael Stack wrote: > I'm glad its working for you now. Suggestions for improving doc. so > others don't fall into your little wormhole? > Thanks James, > St.Ack The "Getting Started" document was great for initial testing of the system. I had a problem with Hadoop early, but I was using the hadoop that came packaged with nutch 0.8.1... which turned out to be version 0.4. I had assumed incorrectly that nutch itself would be using a recent version of hadoop. When I wanted to begin working with WERA and keep multiple versions of a page around, however, my resources were: St.Ack's response to someone else on this list, the bug report about keeping multiple versions of a webpage, and revisiting the "Getting Started" document (since it contained the listing of commands in order). So I'd say a guide outlining the steps to take to preserve multiple versions of a webpage would have been a plus. Current documentation about how to do incremental indexing would be nice too, as this is something I'll be working on soon (I suppose the old FAQ solution applies?). Outside of documentation, most of my desires from Heritrix/NutchWAX/WERA would be for automation and integration: - I'm looking forward to the automatic recrawling that I've seen on the roadmap of Heritrix. - A non-manual way of importing new crawls from Heritrix to NutchWAX would be desirable - It would have been really nice if WERA was Tomcat friendly, so WERA, NutchWAX, ArcRetriever, and even Heritrix could coexist on one server. - It would have also been nice if ArcRetriever had the same args as the wayback machine, so that either could be used with NutchWAX. (though perhaps they are compatible and I missed it?) But, as I said, I realize that most of these tools are pre-version-1.0, and I'm happy that they're around to begin with. jamesG |