Re: [Archive-access-discuss] indexing best practices Wayback 1.x.xIndexers

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

I believe a fundamental mistake is being made in how duplicates are treated by these tools.

We should not treat uri and uri-agnostic duplicates any differently. Both should have the URI and Date of the original payload recorded in the warc/revisit record.

This simplifies the WARC reading as (ignoring legacy WARCs) you'll always have the information necessary to replay the content available in the revisit record. The WARC writer also doesn't need to handle more complex cases, either it is a revisit record or not.

Additionally, this makes building deduplication indexes (to be consulted during crawl-time) easier as you have all the necessary info in the WARCs, you don't need to dereference the revisit records you encounter there.

As for the WARC-Refers-To-Filename and WARC-Refers-To-File-Offset. These should not be used. It was very clear that many organizations do not view their WARC records as immutable objects. Any support for them should be implemented as optional and defaulting to off.

- Kris

-------------------------------------------------------------------------
Landsbókasafn Íslands - Háskólabókasafn | Arngrímsgötu 3 - 107 Reykjavík
Sími/Tel: +354 5255600 | www.landsbokasafn.is
-------------------------------------------------------------------------
fyrirvari/disclaimer - http://fyrirvari.landsbokasafn.is
> -----Original Message-----
> From: Noah Levitt [mailto:nl...@ar...]
> Sent: 6. júní 2013 19:23
> To: Ilya Kreymer
> Cc: arc...@li...
> Subject: Re: [Archive-access-discuss] indexing best practices Wayback
> 1.x.xIndexers
>
> Re WARC-Refers-To-Filename and WARC-Refers-To-File-Offset, as some of
> you know there is a proposed spec
> https://docs.google.com/document/d/1QyQBA7Ykgxie75V8Jziz_O7hbhwf7PF6_
> u9O6w6zgp0 that discourages these two fields, and instead "strongly
> recommends" WARC-Refers-To-Target-URI and WARC-Refers-To-Date .
>
> Url-agnostic revisit records written by heritrix currently contain
> all four of those headers.
>
> The wayback implementation does support replay using WARC-Refers-To-
> Target-URI
> +
>  WARC-Refers-To-Date
> , but that code path hasn't been exercised much at IA yet.
>
> Of course we plan to update heritrix and wayback to be more inline
> with the new spec soon, perhaps dropping support for WARC-Refers-To-
> Filename
> +
>  WARC-Refers-To-File-Offset
> . If someone else out there want to work on that, that would also be
> welcome.
>
> (There seems to be a feeling that IA has too much control over the
> code. Should more people have commit privs maybe? And/or maybe the
> repos under https://github.com/iipc should be canonical?)
>
> Noah
>
>
> On Thu, Jun 6, 2013 at 11:53 AM, Ilya Kreymer <il...@ar...>
> wrote:
>
>
>       Hi,
>
>       I wanted to clear up some confusion about how the revisit
> system is working.
>
>       When wayback reads cdx records for a given url it stores them
> by their digest hash in a cache (a map) for that request.
>
>       If/when a record of "warc/revisit" type has been encountered,
> wayback will look up the digest in this map, and add resolve the
> lookup to the original.
>       If the original can not be found for that revisit digest,
> wayback will display an error.
>
>       The traditional implementation going back several version was
> to play back the original warc headers and content from the original.
>
>       We realized that this was incorrect due to the fact that the
> digest only accounts for the response body and not the headers.
>
>       Since the warc that produces the revisit record still has the
> latest captured headers, wayback will replay the headers from the
> latest capture with the content from the original, again, since the
> digest guarantees only that the body is the same not the headers.
>
>       Thus to handle the revisit record, wayback will be reading from
> two warcs, the one with the revisit record and the original.
>
>       Finally, we've recently added support for the url-agnostic
> features that were added to Heritrix, which support looking up the
> original based on annotations found in the
>       warc, such as WARC-Refers-To-Filename and WARC-Refers-To-File-
> Offset. ( https://webarchive.jira.com/browse/HER-2022) This allows
> wayback to resolve the revisit against a cdx record from a different
> url by pointing
>       to the warc name and offset directly. This feature is still
> somewhat experimental and is not yet in wide use.
>
>       I hope this clears things up a bit, if not, feel free to
> respond and we'll try to elaborate further as this is potentially
> confusing area.
>
>       Thanks,
>
>       Ilya
>       Internet Archive
>       Engineer
>
>
>
>
>       On 06/06/2013 09:24 AM, Kristinn Sigurðsson wrote:
>
>
>               A question on the indexing of de-duplicated records ...
> are they of any use as Wayback is currently implemented?
>
>               The warc/revisit record in the CDX file will point at
> the WARC that contains that revisit record. That record does not give
> any indication as to where the actual payload is found. That can only
> be inferred as same URL, earliest date prior to this. An inference
> that may or may not be accurate.
>
>               The crawl logs I have, contain a bit more detail and I
> was planning on mining them to generate 'deduplication' cdx files
> that would augment the ones generated from WARCs and ARCs (especially
> necessary for the ARCs as they have no record of the duplicates).
>
>               It seems to me, that for deduplicated content CDX files
> really need to contain two file+offset values. One for the payload
> and another (optional one!) for the warc/revisit record.
>
>               Or maybe I've completely missed something.
>
>               - Kris
>
>
>
>
>               --------------------------------------------------------
> -----------------
>               Landsbókasafn Íslands - Háskólabókasafn | Arngrímsgötu 3
> - 107 Reykjavík
>               Sími/Tel: +354 5255600 <tel:%2B354%205255600>  |
> www.landsbokasafn.is
>               --------------------------------------------------------
> -----------------
>               fyrirvari/disclaimer - http://fyrirvari.landsbokasafn.is
>
>                       -----Original Message-----
>                       From: Jackson, Andrew
> [mailto:And...@bl...]
>                       Sent: 6. júní 2013 15:17
>                       To: Jones, Gina; archive-access-
> di...@li...
>                       Subject: Re: [Archive-access-discuss] indexing
> best practices Wayback
>                       1.x.xIndexers
>
>                       The latest versions of Wayback still seem to have
> major problems. The
>                       1.7.1-SNAPSHOT line appears to ignore de-
> duplication records,
>                       although
>                       this is confused by the fact that H3/Wayback has
> recently been
>                       changed
>                       so that de-duplication records are not empty, but
> rather they contain
>                       the headers of the response (in case only the
> payload of the resource
>                       itself was unchanged). However, recent Wayback
> versions *require*
>                       this
>                       header, which breaks playback in older (but WARC-
> spec compliant) WARC
>                       files with empty de-duplication records.
>
>                       This appears to be the same in the 1.8.0-SNAPSHOT
> line, but other
>                       regressions mean I can't use that version (it has
> started refusing to
>                       accept as valid some particular compressed WARC
> files that the
>                       1.7.1-SNAPSHOT line copes with just fine).
>
>                       Best wishes,
>                       Andy Jackson
>
>
>                               -----Original Message-----
>                               From: Jones, Gina [mailto:gj...@lo...]
>                               Sent: 04 June 2013 19:27
>                               To: archive-access-
> di...@li...
>                               Subject: [Archive-access-discuss] indexing
> best practices Wayback
>                               1.x.xIndexers
>
>                               We have not found issues here at the
> Library as our collection has
>
>                       gotten
>
>                               bigger.  In the past, we have had separate
> access points to the
>
>                       each
>
>                               "collection" but are in the process of
> combining our content into
>
>                       one
>                       access
>
>                               point for a more cohesive collection.
>
>                               However, we have found challenges in
> indexing and combining those
>                               indexes, specifically due to deduplicated
> content.  We have content
>                               beginning in 2009 that has been
> deduplicated using the WARC/revisit
>
>                       field.
>
>                               This is what we have think we have figured
> out.  If anyone has any
>
>                       other
>
>                               information on these indexers, we would
> love to know about it.  We
>
>                       posted
>
>                               a question to the listserv about 2 years
> ago and didn't get any
>
>                       comments
>
>                               back:
>
>                               Wayback 1.4.x Indexers
>                               -The Wayback 1.4.2 indexer produces
> "warc/revisit" fields in the
>
>                       file
>                       content
>
>                               index that Wayback 1.4.2 cannot process
> and display.
>
>                               -When we re-indexed the same content with
> Wayback 1.4.0 indexer,
>                               Wayback was able to handle the revisit
> entries. Since the
>
>                       "warc/revisit" field
>
>                               didn't exist at the time that Wayback
> 1.4.0 was released, we
>
>                       suppose
>                       that
>
>                               Wayback 1.4.0 responds to those entries as
> it would to any date
>
>                       instance link
>
>                               where content was missing - by redirecting
> to the next most
>
>                       temporally-
>
>                               proximate capture.
>
>                               -Wayback 1.6.0 can handle file content
> indexes with "warc/revisit"
>
>                       fields, as
>
>                               well as the older 1.4.0 file content
> indexes
>
>                               -We have been unable to get Wayback 1.6.0
> indexer to run on an AIX
>
>                       server.
>
>                               -Wayback 1.6.0 indexer writes an alpha key
> code to the top line of
>
>                       the
>                       file
>
>                               content index. If you are merging indexes
> and resorting manually,
>
>                       be
>                       sure to
>
>                               remove that line after the index is
> generated.
>
>                               Combining cdx's from multiple indexers
>
>                               -As for the issue on combining the
> indexes, it has to do with the
>
>                       number of
>
>                               fields that 1.4.0 / 1.4.2 and 1.6.X
> generate. The older version
>
>                       generates a
>
>                               different version of the index, with a
> different subset of fields.
>
>                               -Wayback 1.6.0 can handle both indexes, so
> it doesn't matter if you
>
>                       have
>
>                               your content indexed with either of the
> two. However, if you plan
>
>                       to
>
>                               combine the indexes into one big index,
> they need to match.
>
>                               -The specific problem we had was with
> sections of an ongoing crawl.
>
>                       2009
>
>                               content was indexed with 1.4.X, but
> 2009+2010 content was indexed
>
>                       with
>
>                               1.6.X, so if we merge and sort, we would
> get the 2009 entries
>
>                       twice,
>                       because
>
>                               they do not match exactly (different
> number of fields).
>
>                               -The field configurations for the two
> versions (as we have them
>
>                       are)
>
>                               1.4.2: CDX N b h m s k r V g
>                               1.6.1: CDX N b a m s k r M V g
>
>                               For definitions of the fields here is an
> old reference:
>
>       http://archive.org/web/researcher/cdx_legend.php
>
>
>                               Gina Jones
>                               Ignacio Garcia del Campo
>                               Laura Graham
>
>
>                               -----Original Message-----
>                               From: archive-access-discuss-
> re...@li...
>
>                       [mailto:archive-
>
>                               access-discuss-
> re...@li...]
>                               Sent: Tuesday, June 04, 2013 8:03 AM
>                               To: archive-access-
> di...@li...
>                               Subject: Archive-access-discuss Digest,
> Vol 78, Issue 2
>
>                               Send Archive-access-discuss mailing list
> submissions to
>                                   archive-access-
> di...@li...
>
>                               To subscribe or unsubscribe via the World
> Wide Web, visit
>
>
>
>       https://lists.sourceforge.net/lists/listinfo/archive-access-
> discuss
>
>                               or, via email, send a message with subject
> or body 'help' to
>                                   archive-access-discuss-
> re...@li...
>
>                               You can reach the person managing the list
> at
>                                   archive-access-discuss-
> ow...@li...
>
>                               When replying, please edit your Subject
> line so it is more specific
>
>                       than "Re:
>
>                               Contents of Archive-access-discuss
> digest..."
>
>
>                               Today's Topics:
>
>                                  1. Best practices for indexing a
> growing 2+      billion document
>                                     collection (Kristinn Sigur?sson)
>                                  2. Re: Best practices for indexing a
> growing     2+      billion
>
>                       document
>
>                                     collection (Erik Hetzner)
>                                  3. Re: Best practices for indexing a
> growing 2+  billion document
>                                     collection (Colin Rosenthal)
>
>
>                               ------------------------------------------
> -------------------------
>
>                       ---
>
>                               Message: 1
>                               Date: Mon, 3 Jun 2013 11:39:40 +0000
>                               From: Kristinn Sigur?sson
> <kri...@la...> <mailto:kri...@la...>
>                               Subject: [Archive-access-discuss] Best
> practices for indexing a
>                                   growing 2+      billion document
> collection
>                               To: "archive-access-
> di...@li..." <mailto:archive-access-
> di...@li...>
>                                   <archive-access-
> di...@li...> <mailto:archive-access-
> di...@li...>
>                               Message-ID:
>
> <E48...@bl...>
> <mailto:E48...@bl...>
>                               Content-Type: text/plain; charset="utf-8"
>
>                               Dear all,
>
>                               We are planning on updating our Wayback
> installation and I would
>
>                       like
>                       to poll
>
>                               your collective wisdom on the best
> approach for managing the
>
>                       Wayback
>
>                               index.
>
>                               Currently, our collection is about 2.2
> billion items. It is also
>
>                       growing at a rate of
>
>                               approximately 350-400 million records per
> year.
>
>                               The obvious approach would be to use a
> sorted CDX file (or files)
>
>                       as
>                       the
>
>                               index. I'm, however, concerned about its
> performance at this scale.
>                               Additionally, updating a CDX based index
> can be troublesome.
>
>                       Especially as
>
>                               we would like to update it continuously as
> new material is
>
>                       ingested.
>
>                               Any relevant experience and advice you
> could share on this topic
>
>                       would
>                       be
>
>                               greatly appreciated.
>
>
>                               Best regards,
>                               Mr. Kristinn Sigur?sson
>                               Head of IT
>                               National and University Library of Iceland
>
>
>
>
>
>
>                       -------------------------------------------------
> --------------------
>                       ---
>                       -
>
>                               Landsb?kasafn ?slands - H?sk?lab?kasafn |
> Arngr?msg?tu 3 - 107
>
>                       Reykjav?k
>
>                               S?mi/Tel: +354 5255600
> <tel:%2B354%205255600>  | www.landsbokasafn.is
>
>
>                       -------------------------------------------------
> --------------------
>                       ---
>                       -
>
>                               fyrirvari/disclaimer -
> http://fyrirvari.landsbokasafn.is
>
>                               ------------------------------
>
>                               Message: 2
>                               Date: Mon, 03 Jun 2013 11:49:04 -0700
>                               From: Erik Hetzner <eri...@uc...>
> <mailto:eri...@uc...>
>                               Subject: Re: [Archive-access-discuss] Best
> practices for indexing a
>                                   growing 2+      billion document
> collection
>                               To: Kristinn Sigur?sson
> <kri...@la...> <mailto:kri...@la...>
>                               Cc: "archive-access-
> di...@li..." <mailto:archive-access-
> di...@li...>
>                                   <archive-access-
> di...@li...> <mailto:archive-access-
> di...@li...>
>                               Message-ID:
> <201...@ma...>
> <mailto:201...@ma...>
>                               Content-Type: text/plain; charset="utf-8"
>
>                               At Mon, 3 Jun 2013 11:39:40 +0000,
>                               Kristinn Sigur?sson wrote:
>
>                                       Dear all,
>
>                                       We are planning on updating our
> Wayback installation and I would
>
>                       like
>
>                                       to poll your collective wisdom on
> the best approach for managing
>
>                       the
>
>                                       Wayback index.
>
>                                       Currently, our collection is about
> 2.2 billion items. It is also
>                                       growing at a rate of approximately
> 350-400 million records per
>
>                       year.
>
>                                       The obvious approach would be to use
> a sorted CDX file (or files)
>
>                       as
>
>                                       the index. I'm, however, concerned
> about its performance at this
>                                       scale. Additionally, updating a CDX
> based index can be
>
>                       troublesome.
>
>                                       Especially as we would like to
> update it continuously as new
>
>                       material
>
>                                       is ingested.
>
>                                       Any relevant experience and advice
> you could share on this topic
>
>                       would
>
>                                       be greatly appreciated.
>
>                               Hi Kristinn,
>
>                               We use 4 different CDX files. One is
> updated every ten minutes, one
>
>                       hourly,
>
>                               one daily, and one monthly. We use the
> unix sort command to sort.
>
>                       This
>                       has
>
>                               worked pretty well for us. We aren?t doing
> it in the most efficient
>
>                       manner,
>
>                               and we will probably switch to sorting
> with hadoop at some point,
>
>                       but
>                       it
>
>                               works pretty well.
>
>                               best, Erik
>                               -------------- next part --------------
>                               Sent from my free software system
> <http://fsf.org/> <http://fsf.org/> .
>
>                               ------------------------------
>
>                               Message: 3
>                               Date: Tue, 4 Jun 2013 12:17:18 +0200
>                               From: Colin Rosenthal
> <cs...@st...> <mailto:cs...@st...>
>                               Subject: Re: [Archive-access-discuss] Best
> practices for indexing a
>                                   growing 2+      billion document
> collection
>                               To: archive-access-
> di...@li...
>                               Message-ID:
> <51A...@st...>
> <mailto:51A...@st...>
>                               Content-Type: text/plain; charset="UTF-8";
> format=flowed
>
>                               On 06/03/2013 08:49 PM, Erik Hetzner
> wrote:
>
>                                       At Mon, 3 Jun 2013 11:39:40 +0000,
>                                       Kristinn Sigur?sson wrote:
>
>                                               Dear all,
>
>                                               We are planning on updating
> our Wayback installation and I would
>
>                       like
>
>                                               to poll your collective
> wisdom on the best approach for managing
>
>                       the
>
>                                               Wayback index.
>
>                                               Currently, our collection is
> about 2.2 billion items. It is also
>                                               growing at a rate of
> approximately 350-400 million records per
>
>                       year.
>
>                                               The obvious approach would be
> to use a sorted CDX file (or
>
>                       files)
>                       as
>
>                                               the index. I'm, however,
> concerned about its performance at this
>                                               scale. Additionally, updating
> a CDX based index can be
>
>                       troublesome.
>
>                                               Especially as we would like
> to update it continuously as new
>
>                       material
>
>                                               is ingested.
>
>                                               Any relevant experience and
> advice you could share on this topic
>                                               would be greatly appreciated.
>
>                                       Hi Kristinn,
>
>                                       We use 4 different CDX files. One is
> updated every ten minutes,
>
>                       one
>
>                                       hourly, one daily, and one monthly.
> We use the unix sort command
>
>                       to
>
>                                       sort. This has worked pretty well
> for us. We aren?t doing it in
>
>                       the
>
>                                       most efficient manner, and we will
> probably switch to sorting
>
>                       with
>
>                                       hadoop at some point, but it works
> pretty well.
>
>                                       best, Erik
>
>                               Hi Kristinn,
>
>                               Our strategy for building cdx indexes is
> described at
>
>
>
>       https://sbforge.org/display/NASDOC321/Wayback+Configuration#Way
> backC
>
>                               onfiguration-AggregatorApplication
>                               .
>
>                               Essentially we have multiple threads
> creating unsorted cdx files
>
>                       for
>                       all new
>
>                               arc/warc files in the archive. These are
> then sorted and merged
>
>                       into
>                       an
>
>                               intermediate index file. When the
> intermediate file grows larger
>
>                       than
>                       100MB,
>
>                               it is merged with the current main index
> file, and when that grows
>
>                       larger than
>
>                               50GB we rollover to a new main index file.
> We currently have about
>
>                       5TB
>                       total
>
>                               cdx index. This includes 16 older cdx
> files of size 150GB-300GB,
>
>                       built
>                       by
>
>                               handrolled scripts before we had a
> functional automatic indexing
>
>                       workflow.
>
>                               We would be fascinated to hear if anyone
> is using an entirely
>
>                       different
>
>                               strategy (e.g. bdb) for a large archive.
>
>                               One of our big issues at the moment is QA
> of our cdx files. How can
>
>                       we
>                       be
>
>                               sure that our indexes actually cover all
> the files and records in
>
>                       the
>                       archive?
>
>                               Colin Rosenthal
>                               IT-Developer
>                               Netarkivet, Denmark
>
>
>
>
>                               ------------------------------
>
>
>
>                       -------------------------------------------------
> --------------------
>                       ---
>                       ------
>
>                               How ServiceNow helps IT people transform
> IT departments:
>                               1. A cloud service to automate IT design,
> transition and operations
>
>                       2.
>
>                               Dashboards that offer high-level views of
> enterprise services 3. A
>
>                       single
>
>                               system of record for all IT processes
>
>                       http://p.sf.net/sfu/servicenow-d2d-j
>
>                               ------------------------------
>
>
>       _______________________________________________
>                               Archive-access-discuss mailing list
>                               Archive-access-
> di...@li...
>
>       https://lists.sourceforge.net/lists/listinfo/archive-access-
> discuss
>
>
>                               End of Archive-access-discuss Digest, Vol
> 78, Issue 2
>
>       *****************************************************
>
>
>
>                       -------------------------------------------------
> --------------------
>                       ---
>                       ------
>
>                               How ServiceNow helps IT people transform
> IT departments:
>                               1. A cloud service to automate IT design,
> transition and operations
>
>                       2.
>
>                               Dashboards that offer high-level views of
> enterprise services 3. A
>
>                       single
>
>                               system of record for all IT processes
>
>                       http://p.sf.net/sfu/servicenow-d2d-j
>
>
>       _______________________________________________
>                               Archive-access-discuss mailing list
>                               Archive-access-
> di...@li...
>
>       https://lists.sourceforge.net/lists/listinfo/archive-access-
> discuss
>
>
>       ***************************************************************
> ******
>                       *****
>                       Experience the British Library online at
> http://www.bl.uk/
>
>                       The British Library's latest Annual Report and
> Accounts :
>                       http://www.bl.uk/aboutus/annrep/index.html
>
>                       Help the British Library conserve the world's
> knowledge. Adopt a
>                       Book. http://www.bl.uk/adoptabook
>
>                       The Library's St Pancras site is WiFi - enabled
>
>
>       ***************************************************************
> ******
>                       ****
>
>                       The information contained in this e-mail is
> confidential and may be
>                       legally privileged. It is intended for the
> addressee(s) only. If you
>                       are not the intended recipient, please delete
> this e-mail and notify
>                       the mailto:pos...@bl... : The contents of
> this e-mail must not be
>                       disclosed or copied without the sender's consent.
>
>                       The statements and opinions expressed in this
> message are those of
>                       the author and do not necessarily reflect those
> of the British
>                       Library. The British Library does not take any
> responsibility for the
>                       views of the author.
>
>
>       ***************************************************************
> ******
>                       ****
>                        Think before you print
>
>                       -------------------------------------------------
> --------------------
>                       ---------
>                       How ServiceNow helps IT people transform IT
> departments:
>                       1. A cloud service to automate IT design,
> transition and operations
>                       2. Dashboards that offer high-level views of
> enterprise services
>                       3. A single system of record for all IT processes
>                       http://p.sf.net/sfu/servicenow-d2d-j
>                       _______________________________________________
>                       Archive-access-discuss mailing list
>                       Arc...@li...
>
>       https://lists.sourceforge.net/lists/listinfo/archive-access-
> discuss
>
>               --------------------------------------------------------
> ----------------------
>               How ServiceNow helps IT people transform IT departments:
>               1. A cloud service to automate IT design, transition and
> operations
>               2. Dashboards that offer high-level views of enterprise
> services
>               3. A single system of record for all IT processes
>               http://p.sf.net/sfu/servicenow-d2d-j
>               _______________________________________________
>               Archive-access-discuss mailing list
>               Arc...@li...
>               https://lists.sourceforge.net/lists/listinfo/archive-
> access-discuss
>
>
>
>       ---------------------------------------------------------------
> ---------------
>       How ServiceNow helps IT people transform IT departments:
>       1. A cloud service to automate IT design, transition and
> operations
>       2. Dashboards that offer high-level views of enterprise
> services
>       3. A single system of record for all IT processes
>       http://p.sf.net/sfu/servicenow-d2d-j
>       _______________________________________________
>       Archive-access-discuss mailing list
>       Arc...@li...
>       https://lists.sourceforge.net/lists/listinfo/archive-access-
> discuss
>
>
>