Thread: [Archive-access-discuss] [Fwd: Re: Search multiple versions of one URL - working!]

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Forwarded by St.Ack's request.

Michael Stack wrote:
 > I'm glad its working for you now.  Suggestions for improving doc. so
 > others don't fall into your little wormhole?
 > Thanks James,
 > St.Ack

The "Getting Started" document was great for initial testing of the  system.

I had a problem with Hadoop early, but I was using the hadoop that came 
packaged with nutch 0.8.1... which turned out to be version 0.4.   I had 
assumed incorrectly that nutch itself would be using a recent version of 
hadoop.

When I wanted to begin working with WERA and keep multiple versions of a 
page around, however, my resources were: St.Ack's response to someone 
else on this list, the bug report about keeping multiple versions of a 
webpage, and revisiting the "Getting Started" document (since it 
contained the listing of commands in order).

So I'd say a guide outlining the steps to take to preserve multiple 
versions of a webpage would have been a plus.

Current documentation about how to do incremental indexing would be nice 
too, as this is something I'll be working on soon (I suppose the old FAQ 
solution applies?).

Outside of documentation, most of my desires from Heritrix/NutchWAX/WERA
would be for automation and integration:
- I'm looking forward to the automatic recrawling that I've seen on the 
roadmap of Heritrix.
- A non-manual way of importing new crawls from Heritrix to NutchWAX 
would be desirable
- It would have been really nice if WERA was Tomcat friendly, so WERA, 
NutchWAX, ArcRetriever, and even Heritrix could coexist on one server.
- It would have also been nice if ArcRetriever had the same args as the 
wayback machine, so that either could be used with NutchWAX.   (though 
perhaps they are compatible and I missed it?)

But, as I said, I realize that most of these tools are pre-version-1.0, 
and I'm happy that they're around to begin with.

jamesG

Thread: [Archive-access-discuss] [Fwd: Re: Search multiple versions of one URL - working!]

archive-access-discuss