Forwarded by St.Ack's request.
Michael Stack wrote:
> I'm glad its working for you now. Suggestions for improving doc. so
> others don't fall into your little wormhole?
> Thanks James,
> St.Ack
The "Getting Started" document was great for initial testing of the system.
I had a problem with Hadoop early, but I was using the hadoop that came
packaged with nutch 0.8.1... which turned out to be version 0.4. I had
assumed incorrectly that nutch itself would be using a recent version of
hadoop.
When I wanted to begin working with WERA and keep multiple versions of a
page around, however, my resources were: St.Ack's response to someone
else on this list, the bug report about keeping multiple versions of a
webpage, and revisiting the "Getting Started" document (since it
contained the listing of commands in order).
So I'd say a guide outlining the steps to take to preserve multiple
versions of a webpage would have been a plus.
Current documentation about how to do incremental indexing would be nice
too, as this is something I'll be working on soon (I suppose the old FAQ
solution applies?).
Outside of documentation, most of my desires from Heritrix/NutchWAX/WERA
would be for automation and integration:
- I'm looking forward to the automatic recrawling that I've seen on the
roadmap of Heritrix.
- A non-manual way of importing new crawls from Heritrix to NutchWAX
would be desirable
- It would have been really nice if WERA was Tomcat friendly, so WERA,
NutchWAX, ArcRetriever, and even Heritrix could coexist on one server.
- It would have also been nice if ArcRetriever had the same args as the
wayback machine, so that either could be used with NutchWAX. (though
perhaps they are compatible and I missed it?)
But, as I said, I realize that most of these tools are pre-version-1.0,
and I'm happy that they're around to begin with.
jamesG
|