|
From: Brad T. <br...@ar...> - 2006-04-03 21:49:54
|
Hello archive-access! I wanted to take a few minutes to introduce myself and the new Wayback project, which has been mentioned on this list, but never formally announced. This project is designed to eventually be the Internet Archive's standard tool for querying and replaying archived content. The current production Wayback Machine (web.archive.org) software allows Internet users to view archived documents from the Internet Archive's web collection, which contains over 60 billion resources. This new Wayback project seeks to replace the classic Wayback Machine's functionality in an open-source, extensible and redistributable Java package. There are dramatic variations in the ways that people want to use this software. At one end of the spectrum is the user who simply wants to look at content they've just crawled with the Heritrix web crawler on their personal workstation. At the other end is the Internet Archive, needing to serve hundreds of requests per second against their 20 million ARC file collection. In between are everything from users experimenting with full-text searching technologies, and others trying out new methods of replaying archived content using browser extensions. To address these varying requirements, a good deal of the projects focus is to leverage modularity and extensibility, so various components can be swapped out and combined to satisfy diverse installation needs. The very early (and unannounced) 0.2.0 release enabled two methods of replaying content in ARC format, the "standard" archival URL mode, and also a new proxy mode, where a user configures their browser to proxy requests through a Wayback server. This proxy mode addresses many, if not most, of the problems reported with the production Wayback Machine's archival URL replay mechanism. The 0.2.0 version operated only in a standalone mode, requiring that all ARC files be located on the same machine running the Wayback software. We have just release a new version, 0.4.0, of the Wayback software, which you can read about in more detail at the project's home page: http://archive-access.sourceforge.net/projects/wayback/ This version has solidified some of the internal workings of the software, addressed the usual set of bugs found in new codebases, and also includes some major new capabilities. The first major feature is the ability to access documents from ARC files stored on remote servers, which has significant scaling ramifications. There have also been substantial improvements in both the query UI capabilities, and in replaying documents. Also, the Wayback software can now be queried using an Opensearch API, and preliminary development has been completed to allow requests to be satisfied using a NutchWAX full-text index. We plan to release 0.6.0 in the next couple of months, which will include better packaging, and substantial UI improvements, to make the Wayback software feature comparable with the WERA application. The current major features present in WERA that have not yet been developed in the Wayback are: * clickable "timeline" view in replay mode * very slick install application * vastly better documentation * better support/testing for international character-sets This is my first Java project, so I'm very appreciative of coaching and suggestions on coding style and things I'm doing wrong. Please let me know if you have problems, suggestions, or questions, and thanks in advance for the feedback! Brad Tofel |