Thread: [Archive-access-discuss] Can Wayback handle an URL with 10.000+ snapshots?

Brought to you by: binzino, bradtofel, gojomo, ia_igor, and 5 others

archive-access-discuss

[Archive-access-discuss] Can Wayback handle an URL with 10.000+ snapshots?

From: Kristinn S. <kri...@la...> - 2013-07-30 15:49:53

We've been experimenting with crawling using RSS feeds. This has generally gone well, but has led to a concern over how well Wayback can handle a URL that has a LOT of snapshots.

In our RSS experiment we've seen that the front pages (which are crawled each time an item is added to the feed) are crawled as often as 2000 times a month (and, yes, those are all unique captures!). Wayback has a default "maxRecords" of 10 thousand, a value we'll hit in just a few months crawling. Interestingly, while I can lower that value in the wayback.xml config file, raising it causes all searches to return a "Bad Query Exception", the 10.000 limit seems pretty hard wired in.

Has anyone looked into how Wayback handles scaling along this axis?

- Kris

-------------------------------------------------------------------------
Landsbókasafn Íslands - Háskólabókasafn | Arngrímsgötu 3 - 107 Reykjavík
Sími/Tel: +354 5255600 | www.landsbokasafn.is
-------------------------------------------------------------------------
fyrirvari/disclaimer - http://fyrirvari.landsbokasafn.is

[Archive-access-discuss] RE Can Wayback handle an URL with 10.000+ snapshots?

From: <nic...@bn...> - 2013-07-30 16:20:34

Hi Kristinn,

The index lookup algorithm , as you probably know, boils down to:

1) Perform a bin search on the set of CDX files defined in the 
WaybackCollection
2) Sequentially iterate over the records starting at the first found 
occurence, applying filters along the way
3) Stop the process after examining maxRecords (10 000 by default)

At BnF we have recently changed the way we merge CDX to have a single CDX 
in a collection, whenever possible, or at least as few CDX as possible. 
This allowed us to raise maxRecords to 100,000 with a 
not-stellar-but-acceptable search times.

However we do have also this problem with sites that are captured on a 
daily basis, and that's one of the motivations behind trying to use a 
search engine framework like SOLR or ElasticSearch to index individual 
CDX.

Best regards,

Nicolas Giraud

Exposition  Zellidja, carnets de voyage  - prolongation jusqu'au 3 août 2013 - BnF - François-Mitterrand Avant d'imprimer, pensez à l'environnement.

Re: [Archive-access-discuss] RE Can Wayback handle an URL with 10.000+ snapshots?

From: Ilya <il...@ar...> - 2013-07-31 01:59:54

Hi,

It should be possible to configure wayback with more than 10000 records..

Note that both the LocalResourceIndex and the ArchivalUrlRequestParser 
have maxRecords properties..
both should be updated to the same value.

For our main index, we use the "ZipNum" CDX format which creates a 
secondary index for the CDX and allows compression. There's also an 
option to "collapse" results based on a portion of the timestamp (for 
example, don't show more than 1 snapshot per hour).

This is the configuration seen here:
http://web.archive.org/web/*/google.com

We are working on releasing addition documentation on how to create the 
"ZipNum" index from plain CDX files.


In addition, we've been working on a separate, new CDX server API for 
wayback, which allows for more control over querying.

For example, the following query returns a first page of "uncollapsed" 
results (page is configured at 150000 max cdx lines on our end at the 
moment)
http://web.archive.org/cdx/search/cdx?url=google.com

The following returns much less results, collapsed to no more than 1 
hour (by ignoring duplicates of the first 10 digits of the timestamp field)
http://web.archive.org/cdx/search/cdx?url=google.com&collapse=timestamp:10


The full documentation for this new API is available here:
https://github.com/internetarchive/wayback/tree/master/wayback-cdx-server





On 7/30/13 11:55 AM, nic...@bn... wrote:
>
> Hi Kristinn,
>
> The index lookup algorithm , as you probably know, boils down to:
>
> 1) Perform a bin search on the set of CDX files defined in the 
> WaybackCollection
> 2) Sequentially iterate over the records starting at the first found 
> occurence, applying filters along the way
> 3) Stop the process after examining maxRecords (10 000 by default)
>
> At BnF we have recently changed the way we merge CDX to have a single 
> CDX in a collection, whenever possible, or at least as few CDX as 
> possible. This allowed us to raise maxRecords to 100,000 with a 
> not-stellar-but-acceptable search times.
>
> However we do have also this problem with sites that are captured on a 
> daily basis, and that's one of the motivations behind trying to use a 
> search engine framework like SOLR or ElasticSearch to index individual 
> CDX.
>
> Best regards,
>
> Nicolas Giraud
> ------------------------------------------------------------------------
>
> Exposition */Zellidja, carnets de voyage 
> <http://www.bnf.fr/fr/evenements_et_culture/anx_expositions/f.zellidja.html>/* 
> - prolongation jusqu'au 3 août 2013 - BnF - François-Mitterrand
>
> *Avant d'imprimer, pensez à l'environnement.*
>
>
>
> ------------------------------------------------------------------------------
> Get your SQL database under version control now!
> Version control is standard for application code, but databases havent
> caught up. So what steps can you take to put your SQL databases under
> version control? Why should you start doing it? Read more to find out.
> http://pubads.g.doubleclick.net/gampad/clk?id=49501711&iu=/4140/ostg.clktrk
>
>
> _______________________________________________
> Archive-access-discuss mailing list
> Arc...@li...
> https://lists.sourceforge.net/lists/listinfo/archive-access-discuss