[Archive-access-discuss] LiveWeb.xml

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hi Brad,
Below, from a previous query you say that there is "some complexity in implementing" LiveWeb "which will
probably require some additional documentation."

We'd like to try this out for an onsite crawl project of a single but very large complex LC web site on formats/digital preservation. We crawl this site once or twice a year, but we are interested to see if LiveWeb's "backfilling" possibilities, as you describe below, might help with interim capture of new single urls on the seed.

When you have some time could you provide that additional documentation?

I have to be honest, all I've done thus far is import the LiveWeb.xml in wayback.xml, which auto-created a set of dirs, liveweb/arcs, off the basedir specified in wayback.  And I've looked at the LiveWeb.xml but am not sure how to proceed.

Thanks,
Laura Graham
Library of Congress

*****************
Hi Laura,

Wayback 1.6.0 contains code to run a special AccessPoint which acts as a
"modified" proxy server. When proxy requests are received by this
AccessPoint, a request to the live web, for the URL requested by the
client, is recorded into an ARC file on the spot. The single compressed
ARC record is then returned as the HTTP entity to the requesting client.
Note this means you cannot point a web browser directly at this service,
since the browser doesn't know how to unpack the enclosed ARC record
(there is another "unwrapping" proxy AccessPoint which does this,
allowing experimenting with recording a web browser session.) However, a
client which expects to be returned an ARC record, can then unpack the
returned ARC record and use it, to access the entire HTTP response to a
robots.txt request, for example.

This service is used in Wayback 1.6.0 to request content from the live
web for both checking robots.txt files, and for "backfilling" content
requested via replay sessions, but which is not in the archive.

Some of the driving factors behind returning a compressed ARC record
instead of proxy returning the actual response is to simplify inserting
an HTTP cache between the Wayback service and the live web proxy
AccessPoint. We use varnish, which handles caching of the returned ARC
record, and coalescing of multiple concurrent requests into a single
request to the live web proxy AccessPoint.

We intend to make this service record WARC files in the near term -
porting the old Wayback ARC recording code was more expediant for 1.6.0.

Currently, there's some complexity in implementing this, which will
probably require some additional documentation.

If you're interested, please let me know, and we'll try to prioritize
this documentation.

Lastly, note that we've discovered some significant bugs in the 1.6.0
codebase specifically related to this live web proxy AccessPoint, mostly
in bad handling of connection errors and timeouts. These fixes are all
in SVN currently, but we have not scheduled a 1.6.1 release at the moment.

Brad

On 3/10/11 8:02 PM, Graham, Laura wrote:
> We were wondering here at the Library of Congress about the LiveWeb.xml in Wayback 1.6. The wayback.xml explains:
>
> " LiveWeb.xml contains the 'proxylivewebcache' bean that enable fetching
>    content from the live web, recording that content in ARC files.
>    To use the "excluder-factory-robot" bean as an exclusionFactory property of
>    AccessPoints, which will cause live robots.txt files to be consulted
>    retroactively before showing archived content, you'll need to import
>    LiveWeb.xml as well."
>
> We understand about consulting the robots.txt for display, of course, but can the Wayback actually write data to ARC (WARC?) files? What does "recording" mean?
>
> Thanks!
> Laura Graham
>