From: Graham, L. <lg...@lo...> - 2011-09-19 18:59:30
|
Hi Brad, Below, from a previous query you say that there is "some complexity in implementing" LiveWeb "which will probably require some additional documentation." We'd like to try this out for an onsite crawl project of a single but very large complex LC web site on formats/digital preservation. We crawl this site once or twice a year, but we are interested to see if LiveWeb's "backfilling" possibilities, as you describe below, might help with interim capture of new single urls on the seed. When you have some time could you provide that additional documentation? I have to be honest, all I've done thus far is import the LiveWeb.xml in wayback.xml, which auto-created a set of dirs, liveweb/arcs, off the basedir specified in wayback. And I've looked at the LiveWeb.xml but am not sure how to proceed. Thanks, Laura Graham Library of Congress ***************** Hi Laura, Wayback 1.6.0 contains code to run a special AccessPoint which acts as a "modified" proxy server. When proxy requests are received by this AccessPoint, a request to the live web, for the URL requested by the client, is recorded into an ARC file on the spot. The single compressed ARC record is then returned as the HTTP entity to the requesting client. Note this means you cannot point a web browser directly at this service, since the browser doesn't know how to unpack the enclosed ARC record (there is another "unwrapping" proxy AccessPoint which does this, allowing experimenting with recording a web browser session.) However, a client which expects to be returned an ARC record, can then unpack the returned ARC record and use it, to access the entire HTTP response to a robots.txt request, for example. This service is used in Wayback 1.6.0 to request content from the live web for both checking robots.txt files, and for "backfilling" content requested via replay sessions, but which is not in the archive. Some of the driving factors behind returning a compressed ARC record instead of proxy returning the actual response is to simplify inserting an HTTP cache between the Wayback service and the live web proxy AccessPoint. We use varnish, which handles caching of the returned ARC record, and coalescing of multiple concurrent requests into a single request to the live web proxy AccessPoint. We intend to make this service record WARC files in the near term - porting the old Wayback ARC recording code was more expediant for 1.6.0. Currently, there's some complexity in implementing this, which will probably require some additional documentation. If you're interested, please let me know, and we'll try to prioritize this documentation. Lastly, note that we've discovered some significant bugs in the 1.6.0 codebase specifically related to this live web proxy AccessPoint, mostly in bad handling of connection errors and timeouts. These fixes are all in SVN currently, but we have not scheduled a 1.6.1 release at the moment. Brad On 3/10/11 8:02 PM, Graham, Laura wrote: > We were wondering here at the Library of Congress about the LiveWeb.xml in Wayback 1.6. The wayback.xml explains: > > " LiveWeb.xml contains the 'proxylivewebcache' bean that enable fetching > content from the live web, recording that content in ARC files. > To use the "excluder-factory-robot" bean as an exclusionFactory property of > AccessPoints, which will cause live robots.txt files to be consulted > retroactively before showing archived content, you'll need to import > LiveWeb.xml as well." > > We understand about consulting the robots.txt for display, of course, but can the Wayback actually write data to ARC (WARC?) files? What does "recording" mean? > > Thanks! > Laura Graham > |