|
From: Gordon M. <go...@ar...> - 2007-06-08 18:49:55
|
Jim Dixon wrote: > Is there anywhere a description of how the archives are structured? I > believe that there is some degree of replication (between San Francisco, > the Netherlands, and Alexandria in Egypt) and then a multi-tiered indexing > system. > > Apologies if I somehow overlooked this, but there doesn't seem to be any > information on the subject in the email achives or anywhere else. There is not a good public writeup, but the broad outlines of the web archive can be described: - Web captures are stored in ARC files, essentially verbatim transcripts of HTTP responses with a single line of per-response metadata including date of capture and server IP address, concatenated together into files of 100MB. - As ARCs are brought in, from various crawls, they land on any of 1000+ machines at IA's US facility, based on which machine has space. (So, contemporaneous ARCs usually land on the same banks of machines, but there is no enforced mapping.) The machines are 4-hard-drive 1U commondity linux machines, with plain independent disks and regular filesystems. - Sometimes, as with data collected in partnership with Alexa's crawling, this material arrives 3-6 months after crawling. One master inventory database remembers where the ARC is by its initial copy in; other inventory systems survey and verify actual machine contents at occasional intervals. - At occasional intervals (but again sometimes months after ARC arrival) all new ARCs are scanned for the URL+date captures they contain, and their contents are merged into a master index of holdings, which is roughly: URL timestamp response-code tiny-checksum ARC-file offset-in-ARC This master index is a flat file, one line per URL+date capture, split in hundreds of shards across many machines. (It currently contains over 85 billion lines and will soon go over 100 billion.) When this merge happens is when new material appears in the Wayback Machine. - Wayback Machine requests to list holdings of a particular URL consult contiguous ranges of this master index. - Wayback Machine requests to view an exact URL+date (or most often, nearest-to-URL+date) seek a single best-match line in this master index, then find which machine(s) currently hold that ARC, then contact that machine for just that capture via an HTTP range request into the ARC. - In 2002, the library of Alexandria received a complete mirror of data through part of 2001. In 2006, they again received a complete mirror of the data through early 2006. At times, bi-directional patching of each sides' collection has occurred, but is not currently an automated process. > Also, I understand that there are two versions of the wayback utility, a > Java version in development, which is open source, and a Perl version, > which is the one actually being used and which is closed source. > > Why is the Perl version closed source? The legacy Wayback version relies on a mix of Perl and C code, was co-developed with Alexa, and relies on some Alexa code we don't have permission to put under a proper open source license. We could try to replace just those parts, but there are other assumptions in the legacy Wayback which limit its performance and extensibility. We wanted to leave those behind, and so have been investing effort in the open-source, Java Wayback project instead. The new code will replace the legacy code on our public site this year. - Gordon @ IA > -- > Jim Dixon jd...@gm... cellphone 415 / 570 3608 > > ------------------------------------------------------------------------- > This SF.net email is sponsored by DB2 Express > Download DB2 Express C - the FREE version of DB2 express and take > control of your XML. No limits. Just data. Click to get it now. > http://sourceforge.net/powerbar/db2/ > _______________________________________________ > Archive-access-discuss mailing list > Arc...@li... > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss |