From: <nic...@bn...> - 2013-07-30 16:20:34
|
Hi Kristinn, The index lookup algorithm , as you probably know, boils down to: 1) Perform a bin search on the set of CDX files defined in the WaybackCollection 2) Sequentially iterate over the records starting at the first found occurence, applying filters along the way 3) Stop the process after examining maxRecords (10 000 by default) At BnF we have recently changed the way we merge CDX to have a single CDX in a collection, whenever possible, or at least as few CDX as possible. This allowed us to raise maxRecords to 100,000 with a not-stellar-but-acceptable search times. However we do have also this problem with sites that are captured on a daily basis, and that's one of the motivations behind trying to use a search engine framework like SOLR or ElasticSearch to index individual CDX. Best regards, Nicolas Giraud Exposition Zellidja, carnets de voyage - prolongation jusqu'au 3 août 2013 - BnF - François-Mitterrand Avant d'imprimer, pensez à l'environnement. |