Re: [Archive-access-discuss] organization of the archive

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Jim Dixon wrote:
> Is there anywhere a description of how the archives are structured? I
> believe that there is some degree of replication (between San Francisco,
> the Netherlands, and Alexandria in Egypt) and then a multi-tiered indexing
> system.
> 
> Apologies if I somehow overlooked this, but there doesn't seem to be any
> information on the subject in the email achives or anywhere else.

There is not a good public writeup, but the broad outlines of the web 
archive can be described:

- Web captures are stored in ARC files, essentially verbatim transcripts 
of HTTP responses with a single line of per-response metadata including 
date of capture and server IP address, concatenated together into files 
of 100MB.

- As ARCs are brought in, from various crawls, they land on any of 1000+ 
machines at IA's US facility, based on which machine has space. (So, 
contemporaneous ARCs usually land on the same banks of machines, but 
there is no enforced mapping.) The machines are 4-hard-drive 1U 
commondity linux machines, with plain independent disks and regular 
filesystems.

- Sometimes, as with data collected in partnership with Alexa's 
crawling, this material arrives 3-6 months after crawling. One master 
inventory database remembers where the ARC is by its initial copy in; 
other inventory systems survey and verify actual machine contents at 
occasional intervals.

- At occasional intervals (but again sometimes months after ARC arrival) 
all new ARCs are scanned for the URL+date captures they contain, and 
their contents are merged into a master index of holdings, which is roughly:
  URL timestamp response-code tiny-checksum ARC-file offset-in-ARC
This master index is a flat file, one line per URL+date capture, split 
in hundreds of shards across many machines. (It currently contains over 
85 billion lines and will soon go over 100 billion.) When this merge 
happens is when new material appears in the Wayback Machine.

- Wayback Machine requests to list holdings of a particular URL consult 
contiguous ranges of this master index.

- Wayback Machine requests to view an exact URL+date (or most often, 
nearest-to-URL+date) seek a single best-match line in this master index, 
then find which machine(s) currently hold that ARC, then contact that 
machine for just that capture via an HTTP range request into the ARC.

- In 2002, the library of Alexandria received a complete mirror of data 
through part of 2001. In 2006, they again received a complete mirror of 
the data through early 2006. At times, bi-directional patching of each 
sides' collection has occurred, but is not currently an automated process.

> Also, I understand that there are two versions of the wayback utility, a
> Java version in development, which is open source, and a Perl version,
> which is the one actually being used and which is closed source.
> 
> Why is the Perl version closed source?

The legacy Wayback version relies on a mix of Perl and C code, was 
co-developed with Alexa, and relies on some Alexa code we don't have 
permission to put under a proper open source license.

We could try to replace just those parts, but there are other 
assumptions in the legacy Wayback which limit its performance and 
extensibility.  We wanted to leave those behind, and so have been 
investing effort in the open-source, Java Wayback project instead.

The new code will replace the legacy code on our public site this year.

- Gordon @ IA

> --
> Jim Dixon  jd...@gm...  cellphone 415 / 570 3608
> 
> -------------------------------------------------------------------------
> This SF.net email is sponsored by DB2 Express
> Download DB2 Express C - the FREE version of DB2 express and take
> control of your XML. No limits. Just data. Click to get it now.
> http://sourceforge.net/powerbar/db2/
> _______________________________________________
> Archive-access-discuss mailing list
> Arc...@li...
> https://lists.sourceforge.net/lists/listinfo/archive-access-discuss