Re: [Archive-access-discuss] indexing best practices Wayback 1.x.xIndexers

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

The latest versions of Wayback still seem to have major problems. The
1.7.1-SNAPSHOT line appears to ignore de-duplication records, although
this is confused by the fact that H3/Wayback has recently been changed
so that de-duplication records are not empty, but rather they contain
the headers of the response (in case only the payload of the resource
itself was unchanged). However, recent Wayback versions *require* this
header, which breaks playback in older (but WARC-spec compliant) WARC
files with empty de-duplication records.

This appears to be the same in the 1.8.0-SNAPSHOT line, but other
regressions mean I can't use that version (it has started refusing to
accept as valid some particular compressed WARC files that the
1.7.1-SNAPSHOT line copes with just fine).

Best wishes,
Andy Jackson

> -----Original Message-----
> From: Jones, Gina [mailto:gj...@lo...]
> Sent: 04 June 2013 19:27
> To: arc...@li...
> Subject: [Archive-access-discuss] indexing best practices Wayback
> 1.x.xIndexers
> 
> We have not found issues here at the Library as our collection has
gotten
> bigger.  In the past, we have had separate access points to the each
> "collection" but are in the process of combining our content into one
access
> point for a more cohesive collection.
> 
> However, we have found challenges in indexing and combining those
> indexes, specifically due to deduplicated content.  We have content
> beginning in 2009 that has been deduplicated using the WARC/revisit
field.
> 
> This is what we have think we have figured out.  If anyone has any
other
> information on these indexers, we would love to know about it.  We
posted
> a question to the listserv about 2 years ago and didn't get any
comments
> back:
> 
> Wayback 1.4.x Indexers
> -The Wayback 1.4.2 indexer produces "warc/revisit" fields in the file
content
> index that Wayback 1.4.2 cannot process and display.
> 
> -When we re-indexed the same content with Wayback 1.4.0 indexer,
> Wayback was able to handle the revisit entries. Since the
"warc/revisit" field
> didn't exist at the time that Wayback 1.4.0 was released, we suppose
that
> Wayback 1.4.0 responds to those entries as it would to any date
instance link
> where content was missing - by redirecting to the next most
temporally-
> proximate capture.
> 
> -Wayback 1.6.0 can handle file content indexes with "warc/revisit"
fields, as
> well as the older 1.4.0 file content indexes
> 
> -We have been unable to get Wayback 1.6.0 indexer to run on an AIX
server.
> 
> -Wayback 1.6.0 indexer writes an alpha key code to the top line of the
file
> content index. If you are merging indexes and resorting manually, be
sure to
> remove that line after the index is generated.
> 
> Combining cdx's from multiple indexers
> 
> -As for the issue on combining the indexes, it has to do with the
number of
> fields that 1.4.0 / 1.4.2 and 1.6.X generate. The older version
generates a
> different version of the index, with a different subset of fields.
> 
> -Wayback 1.6.0 can handle both indexes, so it doesn't matter if you
have
> your content indexed with either of the two. However, if you plan to
> combine the indexes into one big index, they need to match.
> 
> -The specific problem we had was with sections of an ongoing crawl.
2009
> content was indexed with 1.4.X, but 2009+2010 content was indexed with
> 1.6.X, so if we merge and sort, we would get the 2009 entries twice,
because
> they do not match exactly (different number of fields).
> 
> -The field configurations for the two versions (as we have them are)
> 
> 1.4.2: CDX N b h m s k r V g
> 1.6.1: CDX N b a m s k r M V g
> 
> For definitions of the fields here is an old reference:
> http://archive.org/web/researcher/cdx_legend.php
> 
> 
> Gina Jones
> Ignacio Garcia del Campo
> Laura Graham
> 
> 
> -----Original Message-----
> From: arc...@li...
[mailto:archive-
> acc...@li...]
> Sent: Tuesday, June 04, 2013 8:03 AM
> To: arc...@li...
> Subject: Archive-access-discuss Digest, Vol 78, Issue 2
> 
> Send Archive-access-discuss mailing list submissions to
> 	arc...@li...
> 
> To subscribe or unsubscribe via the World Wide Web, visit
>
https://lists.sourceforge.net/lists/listinfo/archive-access-discuss
> or, via email, send a message with subject or body 'help' to
> 	arc...@li...
> 
> You can reach the person managing the list at
> 	arc...@li...
> 
> When replying, please edit your Subject line so it is more specific
than "Re:
> Contents of Archive-access-discuss digest..."
> 
> 
> Today's Topics:
> 
>    1. Best practices for indexing a growing 2+	billion document
>       collection (Kristinn Sigur?sson)
>    2. Re: Best practices for indexing a growing	2+	billion
document
>       collection (Erik Hetzner)
>    3. Re: Best practices for indexing a growing 2+	billion document
>       collection (Colin Rosenthal)
> 
> 
> ----------------------------------------------------------------------
> 
> Message: 1
> Date: Mon, 3 Jun 2013 11:39:40 +0000
> From: Kristinn Sigur?sson <kri...@la...>
> Subject: [Archive-access-discuss] Best practices for indexing a
> 	growing 2+	billion document collection
> To: "arc...@li..."
> 	<arc...@li...>
> Message-ID:
> 	<E48...@bl...>
> Content-Type: text/plain; charset="utf-8"
> 
> Dear all,
> 
> We are planning on updating our Wayback installation and I would like
to poll
> your collective wisdom on the best approach for managing the Wayback
> index.
> 
> Currently, our collection is about 2.2 billion items. It is also
growing at a rate of
> approximately 350-400 million records per year.
> 
> The obvious approach would be to use a sorted CDX file (or files) as
the
> index. I'm, however, concerned about its performance at this scale.
> Additionally, updating a CDX based index can be troublesome.
Especially as
> we would like to update it continuously as new material is ingested.
> 
> Any relevant experience and advice you could share on this topic would
be
> greatly appreciated.
> 
> 
> Best regards,
> Mr. Kristinn Sigur?sson
> Head of IT
> National and University Library of Iceland
> 
> 
> 
> 
>
------------------------------------------------------------------------
-
> Landsb?kasafn ?slands - H?sk?lab?kasafn | Arngr?msg?tu 3 - 107
Reykjav?k
> S?mi/Tel: +354 5255600 | www.landsbokasafn.is
>
------------------------------------------------------------------------
-
> fyrirvari/disclaimer - http://fyrirvari.landsbokasafn.is
> 
> ------------------------------
> 
> Message: 2
> Date: Mon, 03 Jun 2013 11:49:04 -0700
> From: Erik Hetzner <eri...@uc...>
> Subject: Re: [Archive-access-discuss] Best practices for indexing a
> 	growing	2+	billion document collection
> To: Kristinn Sigur?sson <kri...@la...>
> Cc: "arc...@li..."
> 	<arc...@li...>
> Message-ID: <201...@ma...>
> Content-Type: text/plain; charset="utf-8"
> 
> At Mon, 3 Jun 2013 11:39:40 +0000,
> Kristinn Sigur?sson wrote:
> >
> > Dear all,
> >
> > We are planning on updating our Wayback installation and I would
like
> > to poll your collective wisdom on the best approach for managing the
> > Wayback index.
> >
> > Currently, our collection is about 2.2 billion items. It is also
> > growing at a rate of approximately 350-400 million records per year.
> >
> > The obvious approach would be to use a sorted CDX file (or files) as
> > the index. I'm, however, concerned about its performance at this
> > scale. Additionally, updating a CDX based index can be troublesome.
> > Especially as we would like to update it continuously as new
material
> > is ingested.
> >
> > Any relevant experience and advice you could share on this topic
would
> > be greatly appreciated.
> 
> Hi Kristinn,
> 
> We use 4 different CDX files. One is updated every ten minutes, one
hourly,
> one daily, and one monthly. We use the unix sort command to sort. This
has
> worked pretty well for us. We aren?t doing it in the most efficient
manner,
> and we will probably switch to sorting with hadoop at some point, but
it
> works pretty well.
> 
> best, Erik
> -------------- next part --------------
> Sent from my free software system <http://fsf.org/>.
> 
> ------------------------------
> 
> Message: 3
> Date: Tue, 4 Jun 2013 12:17:18 +0200
> From: Colin Rosenthal <cs...@st...>
> Subject: Re: [Archive-access-discuss] Best practices for indexing a
> 	growing 2+	billion document collection
> To: arc...@li...
> Message-ID: <51A...@st...>
> Content-Type: text/plain; charset="UTF-8"; format=flowed
> 
> On 06/03/2013 08:49 PM, Erik Hetzner wrote:
> > At Mon, 3 Jun 2013 11:39:40 +0000,
> > Kristinn Sigur?sson wrote:
> >> Dear all,
> >>
> >> We are planning on updating our Wayback installation and I would
like
> >> to poll your collective wisdom on the best approach for managing
the
> >> Wayback index.
> >>
> >> Currently, our collection is about 2.2 billion items. It is also
> >> growing at a rate of approximately 350-400 million records per
year.
> >>
> >> The obvious approach would be to use a sorted CDX file (or files)
as
> >> the index. I'm, however, concerned about its performance at this
> >> scale. Additionally, updating a CDX based index can be troublesome.
> >> Especially as we would like to update it continuously as new
material
> >> is ingested.
> >>
> >> Any relevant experience and advice you could share on this topic
> >> would be greatly appreciated.
> > Hi Kristinn,
> >
> > We use 4 different CDX files. One is updated every ten minutes, one
> > hourly, one daily, and one monthly. We use the unix sort command to
> > sort. This has worked pretty well for us. We aren?t doing it in the
> > most efficient manner, and we will probably switch to sorting with
> > hadoop at some point, but it works pretty well.
> >
> > best, Erik
> Hi Kristinn,
> 
> Our strategy for building cdx indexes is described at
> https://sbforge.org/display/NASDOC321/Wayback+Configuration#WaybackC
> onfiguration-AggregatorApplication
> .
> 
> Essentially we have multiple threads creating unsorted cdx files for
all new
> arc/warc files in the archive. These are then sorted and merged into
an
> intermediate index file. When the intermediate file grows larger than
100MB,
> it is merged with the current main index file, and when that grows
larger than
> 50GB we rollover to a new main index file. We currently have about 5TB
total
> cdx index. This includes 16 older cdx files of size 150GB-300GB, built
by
> handrolled scripts before we had a functional automatic indexing
workflow.
> 
> We would be fascinated to hear if anyone is using an entirely
different
> strategy (e.g. bdb) for a large archive.
> 
> One of our big issues at the moment is QA of our cdx files. How can we
be
> sure that our indexes actually cover all the files and records in the
archive?
> 
> Colin Rosenthal
> IT-Developer
> Netarkivet, Denmark
> 
> 
> 
> 
> ------------------------------
> 
>
------------------------------------------------------------------------
------
> How ServiceNow helps IT people transform IT departments:
> 1. A cloud service to automate IT design, transition and operations 2.
> Dashboards that offer high-level views of enterprise services 3. A
single
> system of record for all IT processes
http://p.sf.net/sfu/servicenow-d2d-j
> 
> ------------------------------
> 
> _______________________________________________
> Archive-access-discuss mailing list
> Arc...@li...
> https://lists.sourceforge.net/lists/listinfo/archive-access-discuss
> 
> 
> End of Archive-access-discuss Digest, Vol 78, Issue 2
> *****************************************************
> 
>
------------------------------------------------------------------------
------
> How ServiceNow helps IT people transform IT departments:
> 1. A cloud service to automate IT design, transition and operations 2.
> Dashboards that offer high-level views of enterprise services 3. A
single
> system of record for all IT processes
http://p.sf.net/sfu/servicenow-d2d-j
> _______________________________________________
> Archive-access-discuss mailing list
> Arc...@li...
> https://lists.sourceforge.net/lists/listinfo/archive-access-discuss

**************************************************************************
Experience the British Library online at http://www.bl.uk/

The British Library’s latest Annual Report and Accounts : http://www.bl.uk/aboutus/annrep/index.html

Help the British Library conserve the world's knowledge. Adopt a Book. http://www.bl.uk/adoptabook

The Library's St Pancras site is WiFi - enabled

*************************************************************************

The information contained in this e-mail is confidential and may be legally privileged. It is intended for the addressee(s) only. If you are not the intended recipient, please delete this e-mail and notify the mailto:pos...@bl... : The contents of this e-mail must not be disclosed or copied without the sender's consent.

The statements and opinions expressed in this message are those of the author and do not necessarily reflect those of the British Library. The British Library does not take any responsibility for the views of the author.

*************************************************************************
 Think before you print