Hello archive-access!
I wanted to take a few minutes to introduce myself and the new Wayback
project, which has been mentioned on this list, but never formally
announced. This project is designed to eventually be the Internet
Archive's standard tool for querying and replaying archived content. The
current production Wayback Machine (web.archive.org) software allows
Internet users to view archived documents from the Internet Archive's
web collection, which contains over 60 billion resources. This new
Wayback project seeks to replace the classic Wayback Machine's
functionality in an open-source, extensible and redistributable Java
package.
There are dramatic variations in the ways that people want to use this
software. At one end of the spectrum is the user who simply wants to
look at content they've just crawled with the Heritrix web crawler on
their personal workstation. At the other end is the Internet Archive,
needing to serve hundreds of requests per second against their 20
million ARC file collection. In between are everything from users
experimenting with full-text searching technologies, and others trying
out new methods of replaying archived content using browser extensions.
To address these varying requirements, a good deal of the projects focus
is to leverage modularity and extensibility, so various components can
be swapped out and combined to satisfy diverse installation needs.
The very early (and unannounced) 0.2.0 release enabled two methods of
replaying content in ARC format, the "standard" archival URL mode, and
also a new proxy mode, where a user configures their browser to proxy
requests through a Wayback server. This proxy mode addresses many, if
not most, of the problems reported with the production Wayback Machine's
archival URL replay mechanism. The 0.2.0 version operated only in a
standalone mode, requiring that all ARC files be located on the same
machine running the Wayback software.
We have just release a new version, 0.4.0, of the Wayback software,
which you can read about in more detail at the project's home page:
http://archive-access.sourceforge.net/projects/wayback/
This version has solidified some of the internal workings of the
software, addressed the usual set of bugs found in new codebases, and
also includes some major new capabilities. The first major feature is
the ability to access documents from ARC files stored on remote servers,
which has significant scaling ramifications. There have also been
substantial improvements in both the query UI capabilities, and in
replaying documents. Also, the Wayback software can now be queried using
an Opensearch API, and preliminary development has been completed to
allow requests to be satisfied using a NutchWAX full-text index.
We plan to release 0.6.0 in the next couple of months, which will
include better packaging, and substantial UI improvements, to make the
Wayback software feature comparable with the WERA application. The
current major features present in WERA that have not yet been developed
in the Wayback are:
* clickable "timeline" view in replay mode
* very slick install application
* vastly better documentation
* better support/testing for international character-sets
This is my first Java project, so I'm very appreciative of coaching and
suggestions on coding style and things I'm doing wrong.
Please let me know if you have problems, suggestions, or questions, and
thanks in advance for the feedback!
Brad Tofel
|