From: Ilya <il...@ar...> - 2013-07-31 01:59:54
|
Hi, It should be possible to configure wayback with more than 10000 records.. Note that both the LocalResourceIndex and the ArchivalUrlRequestParser have maxRecords properties.. both should be updated to the same value. For our main index, we use the "ZipNum" CDX format which creates a secondary index for the CDX and allows compression. There's also an option to "collapse" results based on a portion of the timestamp (for example, don't show more than 1 snapshot per hour). This is the configuration seen here: http://web.archive.org/web/*/google.com We are working on releasing addition documentation on how to create the "ZipNum" index from plain CDX files. In addition, we've been working on a separate, new CDX server API for wayback, which allows for more control over querying. For example, the following query returns a first page of "uncollapsed" results (page is configured at 150000 max cdx lines on our end at the moment) http://web.archive.org/cdx/search/cdx?url=google.com The following returns much less results, collapsed to no more than 1 hour (by ignoring duplicates of the first 10 digits of the timestamp field) http://web.archive.org/cdx/search/cdx?url=google.com&collapse=timestamp:10 The full documentation for this new API is available here: https://github.com/internetarchive/wayback/tree/master/wayback-cdx-server On 7/30/13 11:55 AM, nic...@bn... wrote: > > Hi Kristinn, > > The index lookup algorithm , as you probably know, boils down to: > > 1) Perform a bin search on the set of CDX files defined in the > WaybackCollection > 2) Sequentially iterate over the records starting at the first found > occurence, applying filters along the way > 3) Stop the process after examining maxRecords (10 000 by default) > > At BnF we have recently changed the way we merge CDX to have a single > CDX in a collection, whenever possible, or at least as few CDX as > possible. This allowed us to raise maxRecords to 100,000 with a > not-stellar-but-acceptable search times. > > However we do have also this problem with sites that are captured on a > daily basis, and that's one of the motivations behind trying to use a > search engine framework like SOLR or ElasticSearch to index individual > CDX. > > Best regards, > > Nicolas Giraud > ------------------------------------------------------------------------ > > Exposition */Zellidja, carnets de voyage > <http://www.bnf.fr/fr/evenements_et_culture/anx_expositions/f.zellidja.html>/* > - prolongation jusqu'au 3 août 2013 - BnF - François-Mitterrand > > *Avant d'imprimer, pensez à l'environnement.* > > > > ------------------------------------------------------------------------------ > Get your SQL database under version control now! > Version control is standard for application code, but databases havent > caught up. So what steps can you take to put your SQL databases under > version control? Why should you start doing it? Read more to find out. > http://pubads.g.doubleclick.net/gampad/clk?id=49501711&iu=/4140/ostg.clktrk > > > _______________________________________________ > Archive-access-discuss mailing list > Arc...@li... > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss |