Re: [Archive-access-discuss] [ANN] wayback-1.2.0 released

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hello Brad,

I just started "playing" with this new version of Wayback, and there is one
thing that seems very extrange to me.

On every page resource I visit, I always get the header information
plastered at the top of the page.
(i.e. HTTP/1.1 200 OK Server: Microsoft-IIS/5.0 Date: Tue, 03 Oct 2000
07:31:49 GMT Connection: Keep-Alive Content-Length: 13027 Content-Type:
text/html Set-Cookie:
GWBSiteCookie=header%5Ftype=Text&mode=false&browser=Default&browser%5Fchecked=true&browser%5Fwidth=0;
path=/ Cache-control: private)

This information denotes the header information that was retrieved at the
time of crawl (as you can see by the date), the thing I do not understand is
why am I seeing it when I access a page via Wayback.
It appears at the very top, even over the TimeLine section.

Any ideas on why this might be, or how to get rid of it?

Thanks.

On 2/29/08, Brad Tofel <br...@ar...> wrote:
>
> Hi Thomas,
>
> Thanks for the kind feedback.
>
> Couple of suggestions, and also some follow-up questions interspersed:
>
> Thomas Beekman wrote:
> > Hi all,
> >
> > At the KB we are severely testing Wayback 1.2.0 at the moment. My first
> > impression is quite positive; many new functions are added, it is quite
> > easy to implement different modules for different access points and
> > several indexing threads can live side by side now.
> >
> > I have a few questions though. First of all, I'm experiencing errors
> > which did not occur in older versions; java.lang.OutOfMemoryError: GC
> > overhead limit exceeded. Does anyone know how to fix this?
> >
> >
> I haven't seen this before, and some quick google searches indicate it
> may be one of:
>
> A) a JVM problem (which JVM are you using?)
> B) too little heap space in the java startup arguments
> C) the wayback software doing lots of object creation+destruction.
>
> Since we have large installations in production at the IA, one using
> 700+ Collections and 1400+ AccessPoints. Note that these all use CDX
> indexes, which are more resource efficient. I'm hoping that C is not the
> problem, but we haven't yet needed to do a heavy optimization pass over
> the code, so it could be Wayback itself. Are you using IBM's JVM? Have
> you tried increasing the heap? If that doesn't address the problem, can
> you please send me a copy of your wayback.xml Spring configuration?
>
> > Second; when closing down Wayback in Tomcat, the lock file for the
> > localbdb is not erased. A restart is therefore not possible. Could this
> > be fixed so that if the webapp is closed down, the lock file is erased?
> >
> >
>
> On what platform (OS+JVM) are you running Wayback? Is the BDB index
> stored over NFS or another networked file system? I haven't experienced
> this problem on any of our systems -- the BDBJE just starts up, even
> with the lock file still existing. I haven't looked into this, but
> guessed that it was using the lock file via flock() type semantics,
> instead of using it's existence to indicate a lock. BDBJE may determine
> that the DB is on a remote system, where flock() semantics don't work,
> in which case it may be falling back to using the existence of the lock
> file to indicate usage..
>
> In any case, I've just implemented the "clean shutdown" processing in my
> development environment, but will probably hold off to do more testing
> before including it in a release.
>
> We are preparing a 1.2.1 release which addresses a couple bugs
> discovered by folks in the field, but are holding this release for
> feedback from one more user having trouble reading some ARC files.
>
> > Third; with a few websites the timeline GUI is scrambled. I get a full
> > yellow screen with on every line a mark. After scrolling down that page,
> > the website is presented normally. This is not the case with every
> > website.
> >
> >
> Yes, the css implementation in the current timeline is prone to
> inheriting some styles from some web pages. Could you please send me a
> few example pages on the live web that demonstrate the problem you're
> seeing?
>
> > My fourth and last problem is in the configuration. I would like to do
> > some tests using the remote NutchWAX search, but there is not a clear
> > manual of how to implement this precisely, which beans to use for
> > example. Does anyone have a good example for me?
> >
> >
>
> Setting up a collection with this bean:
>
> <property name="resourceIndex">
> <bean class="org.archive.wayback.resourceindex.NutchResourceIndex"
> init-method="init">
> <property name="searchUrlBase"
> value="http://webteam-ws.us.archive.org:8080/katrina/opensearch" />
> <property name="maxRecords" value="100" />
> </bean>
> </property>
>
> Should do the trick. Note that if using Archival URL mode, you should be
> sure to set the maxRecords property on the RequestParser to the same
> value for maxRecords.. This may be a bug -- would be more friendly to
> use the min() of both values..
>
> <property name="parser">
> <bean class="org.archive.wayback.archivalurl.ArchivalUrlRequestParser"
> init-method="init">
> <property name="maxRecords" value="100" />
> <property name="earliestTimestamp" value="1996" />
> </bean>
> </property>
>
>
> Hopefully this works for you, and please let me know about the questions
> above.
>
> Brad
>
>
> -------------------------------------------------------------------------
> This SF.net email is sponsored by: Microsoft
> Defy all challenges. Microsoft(R) Visual Studio 2008.
> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
> _______________________________________________
> Archive-access-discuss mailing list
> Arc...@li...
> https://lists.sourceforge.net/lists/listinfo/archive-access-discuss
>