You can subscribe to this list here.
2005 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
(1) |
Aug
(4) |
Sep
(5) |
Oct
(17) |
Nov
(30) |
Dec
(3) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2006 |
Jan
(4) |
Feb
(14) |
Mar
(8) |
Apr
(11) |
May
(2) |
Jun
(13) |
Jul
(9) |
Aug
(2) |
Sep
(2) |
Oct
(9) |
Nov
(20) |
Dec
(9) |
2007 |
Jan
(6) |
Feb
(4) |
Mar
(6) |
Apr
(7) |
May
(6) |
Jun
(6) |
Jul
(4) |
Aug
(3) |
Sep
(9) |
Oct
(26) |
Nov
(23) |
Dec
(2) |
2008 |
Jan
(17) |
Feb
(19) |
Mar
(16) |
Apr
(27) |
May
(3) |
Jun
(21) |
Jul
(21) |
Aug
(8) |
Sep
(13) |
Oct
(7) |
Nov
(8) |
Dec
(8) |
2009 |
Jan
(18) |
Feb
(14) |
Mar
(27) |
Apr
(14) |
May
(10) |
Jun
(14) |
Jul
(18) |
Aug
(30) |
Sep
(18) |
Oct
(12) |
Nov
(5) |
Dec
(26) |
2010 |
Jan
(27) |
Feb
(3) |
Mar
(8) |
Apr
(4) |
May
(6) |
Jun
(13) |
Jul
(25) |
Aug
(11) |
Sep
(2) |
Oct
(4) |
Nov
(7) |
Dec
(6) |
2011 |
Jan
(25) |
Feb
(17) |
Mar
(25) |
Apr
(23) |
May
(15) |
Jun
(12) |
Jul
(8) |
Aug
(13) |
Sep
(4) |
Oct
(17) |
Nov
(7) |
Dec
(6) |
2012 |
Jan
(4) |
Feb
(7) |
Mar
(1) |
Apr
(10) |
May
(11) |
Jun
(5) |
Jul
(7) |
Aug
(1) |
Sep
(1) |
Oct
(5) |
Nov
(6) |
Dec
(13) |
2013 |
Jan
(9) |
Feb
(7) |
Mar
(3) |
Apr
(1) |
May
(3) |
Jun
(19) |
Jul
(3) |
Aug
(3) |
Sep
|
Oct
(1) |
Nov
(1) |
Dec
(1) |
2014 |
Jan
(11) |
Feb
(1) |
Mar
|
Apr
(2) |
May
(6) |
Jun
|
Jul
|
Aug
(1) |
Sep
|
Oct
(1) |
Nov
(1) |
Dec
(1) |
2015 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
(1) |
Jul
(4) |
Aug
|
Sep
|
Oct
|
Nov
|
Dec
(1) |
2016 |
Jan
(4) |
Feb
(3) |
Mar
|
Apr
|
May
|
Jun
|
Jul
(1) |
Aug
|
Sep
|
Oct
(1) |
Nov
|
Dec
|
2018 |
Jan
|
Feb
|
Mar
|
Apr
(1) |
May
(1) |
Jun
|
Jul
(2) |
Aug
|
Sep
(1) |
Oct
|
Nov
(1) |
Dec
|
2019 |
Jan
(2) |
Feb
(1) |
Mar
|
Apr
|
May
|
Jun
(2) |
Jul
|
Aug
|
Sep
(1) |
Oct
(1) |
Nov
|
Dec
|
From: Ilya K. <il...@ar...> - 2013-06-06 19:13:48
|
Hi Andy, I totally agree with you regarding the need for additional integration tests. We have unfortunately not had the resourcesto devote to ensuring full stability of the snapshot distributions, but we are now focusing on creating a stable 1.8.0 release in the upcoming month(s). If you have any integration tests you would like to contribute or suggest, please let me know. I am aware of this bug that was filed regarding url-agnostic dedup: https://webarchive.jira.com/browse/ACC-126 This is planned to be addressed before the 1.8.0 release. If there are other bug reports, feel free to file them under this JIRA. I believe the meeting in the fall is planned to better figure out how to ensure the stability of wayback in the long term for the IIPC. Thanks, Ilya Engineer IA On 06/06/2013 09:13 AM, Jackson, Andrew wrote: > It's not just the indexer. The front-end logic and the coupling to H3 > have all been problematic recently. > > We have suffered a range of problems deploying recent Wayback versions, > due to unintended consequences of recent changes that break > functionality that we require. As well as the de-duplication problems I > mentioned in a separate email, we've also had issues with Memento access > points (which don't return link-format timemaps as they should/used to) > and the XML query endpoint failing under certain conditions (due to > changes in URL handling/'cleaning'). > > In my opinion, one of the critical jobs for the future Wayback OS > project is to set up proper, automated integration tests that exercise > all the functionality the IIPC partners need, and will therefore detect > if changes to the source code have unintentionally altered critical > behaviour. It is technically fairly straightforward to make an > integration test that, say, indexes a few WARCs, fires up a Wayback > instance, and checks the responses to some queries. It does, of course, > require some investment of time and effort. However, that investment > would enable future modifications to the code base to be carried out > with far more confidence. > > I've started doing some work in this area, but would appreciate knowing > if anyone else is willing to put some effort into building up the > testing framework. > > Thanks, > Andy > > >> -----Original Message----- >> From: Jones, Gina [mailto:gj...@lo...] >> Sent: 06 June 2013 13:13 >> To: arc...@li... >> Subject: [Archive-access-discuss] Wayback Indexer >> >> I believe that the wayback indexer is the weakest link to longterm > access to >> our collections. And it isn't obvious sometimes what is going on when > you >> index content until you actually access that content. >> >> One of the projects I want to do this year (or next) is to take the > available >> indexers and index a set of content that we have (2000-now) and review > the >> output. >> >> gina >> >> > ------------------------------------------------------------------------ > ------ >> How ServiceNow helps IT people transform IT departments: >> 1. A cloud service to automate IT design, transition and operations 2. >> Dashboards that offer high-level views of enterprise services 3. A > single >> system of record for all IT processes > http://p.sf.net/sfu/servicenow-d2d-j >> _______________________________________________ >> Archive-access-discuss mailing list >> Arc...@li... >> https://lists.sourceforge.net/lists/listinfo/archive-access-discuss > ************************************************************************** > Experience the British Library online at http://www.bl.uk/ > > The British Library’s latest Annual Report and Accounts : http://www.bl.uk/aboutus/annrep/index.html > > Help the British Library conserve the world's knowledge. Adopt a Book. http://www.bl.uk/adoptabook > > The Library's St Pancras site is WiFi - enabled > > ************************************************************************* > > The information contained in this e-mail is confidential and may be legally privileged. It is intended for the addressee(s) only. If you are not the intended recipient, please delete this e-mail and notify the mailto:pos...@bl... : The contents of this e-mail must not be disclosed or copied without the sender's consent. > > The statements and opinions expressed in this message are those of the author and do not necessarily reflect those of the British Library. The British Library does not take any responsibility for the views of the author. > > ************************************************************************* > Think before you print > > ------------------------------------------------------------------------------ > How ServiceNow helps IT people transform IT departments: > 1. A cloud service to automate IT design, transition and operations > 2. Dashboards that offer high-level views of enterprise services > 3. A single system of record for all IT processes > http://p.sf.net/sfu/servicenow-d2d-j > _______________________________________________ > Archive-access-discuss mailing list > Arc...@li... > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss |
From: Ilya K. <il...@ar...> - 2013-06-06 18:54:12
|
Hi, I wanted to clear up some confusion about how the revisit system is working. When wayback reads cdx records for a given url it stores them by their digest hash in a cache (a map) for that request. If/when a record of "warc/revisit" type has been encountered, wayback will look up the digest in this map, and add resolve the lookup to the original. If the original can not be found for that revisit digest, wayback will display an error. The traditional implementation going back several version was to play back the original warc headers and content from the original. We realized that this was incorrect due to the fact that the digest only accounts for the response body and not the headers. Since the warc that produces the revisit record still has the latest captured headers, wayback will replay the headers from the latest capture with the content from the original, again, since the digest guarantees only that the body is the same not the headers. Thus to handle the revisit record, wayback will be reading from two warcs, the one with the revisit record and the original. Finally, we've recently added support for the url-agnostic features that were added to Heritrix, which support looking up the original based on annotations found in the warc, such as WARC-Refers-To-Filename and WARC-Refers-To-File-Offset. ( https://webarchive.jira.com/browse/HER-2022) This allows wayback to resolve the revisit against a cdx record from a different url by pointing to the warc name and offset directly. This feature is still somewhat experimental and is not yet in wide use. I hope this clears things up a bit, if not, feel free to respond and we'll try to elaborate further as this is potentially confusing area. Thanks, Ilya Internet Archive Engineer On 06/06/2013 09:24 AM, Kristinn Sigurðsson wrote: > A question on the indexing of de-duplicated records ... are they of any use as Wayback is currently implemented? > > The warc/revisit record in the CDX file will point at the WARC that contains that revisit record. That record does not give any indication as to where the actual payload is found. That can only be inferred as same URL, earliest date prior to this. An inference that may or may not be accurate. > > The crawl logs I have, contain a bit more detail and I was planning on mining them to generate 'deduplication' cdx files that would augment the ones generated from WARCs and ARCs (especially necessary for the ARCs as they have no record of the duplicates). > > It seems to me, that for deduplicated content CDX files really need to contain two file+offset values. One for the payload and another (optional one!) for the warc/revisit record. > > Or maybe I've completely missed something. > > - Kris > > > > > ------------------------------------------------------------------------- > Landsbókasafn Íslands - Háskólabókasafn | Arngrímsgötu 3 - 107 Reykjavík > Sími/Tel: +354 5255600 | www.landsbokasafn.is > ------------------------------------------------------------------------- > fyrirvari/disclaimer - http://fyrirvari.landsbokasafn.is >> -----Original Message----- >> From: Jackson, Andrew [mailto:And...@bl...] >> Sent: 6. júní 2013 15:17 >> To: Jones, Gina; arc...@li... >> Subject: Re: [Archive-access-discuss] indexing best practices Wayback >> 1.x.xIndexers >> >> The latest versions of Wayback still seem to have major problems. The >> 1.7.1-SNAPSHOT line appears to ignore de-duplication records, >> although >> this is confused by the fact that H3/Wayback has recently been >> changed >> so that de-duplication records are not empty, but rather they contain >> the headers of the response (in case only the payload of the resource >> itself was unchanged). However, recent Wayback versions *require* >> this >> header, which breaks playback in older (but WARC-spec compliant) WARC >> files with empty de-duplication records. >> >> This appears to be the same in the 1.8.0-SNAPSHOT line, but other >> regressions mean I can't use that version (it has started refusing to >> accept as valid some particular compressed WARC files that the >> 1.7.1-SNAPSHOT line copes with just fine). >> >> Best wishes, >> Andy Jackson >> >>> -----Original Message----- >>> From: Jones, Gina [mailto:gj...@lo...] >>> Sent: 04 June 2013 19:27 >>> To: arc...@li... >>> Subject: [Archive-access-discuss] indexing best practices Wayback >>> 1.x.xIndexers >>> >>> We have not found issues here at the Library as our collection has >> gotten >>> bigger. In the past, we have had separate access points to the >> each >>> "collection" but are in the process of combining our content into >> one >> access >>> point for a more cohesive collection. >>> >>> However, we have found challenges in indexing and combining those >>> indexes, specifically due to deduplicated content. We have content >>> beginning in 2009 that has been deduplicated using the WARC/revisit >> field. >>> This is what we have think we have figured out. If anyone has any >> other >>> information on these indexers, we would love to know about it. We >> posted >>> a question to the listserv about 2 years ago and didn't get any >> comments >>> back: >>> >>> Wayback 1.4.x Indexers >>> -The Wayback 1.4.2 indexer produces "warc/revisit" fields in the >> file >> content >>> index that Wayback 1.4.2 cannot process and display. >>> >>> -When we re-indexed the same content with Wayback 1.4.0 indexer, >>> Wayback was able to handle the revisit entries. Since the >> "warc/revisit" field >>> didn't exist at the time that Wayback 1.4.0 was released, we >> suppose >> that >>> Wayback 1.4.0 responds to those entries as it would to any date >> instance link >>> where content was missing - by redirecting to the next most >> temporally- >>> proximate capture. >>> >>> -Wayback 1.6.0 can handle file content indexes with "warc/revisit" >> fields, as >>> well as the older 1.4.0 file content indexes >>> >>> -We have been unable to get Wayback 1.6.0 indexer to run on an AIX >> server. >>> -Wayback 1.6.0 indexer writes an alpha key code to the top line of >> the >> file >>> content index. If you are merging indexes and resorting manually, >> be >> sure to >>> remove that line after the index is generated. >>> >>> Combining cdx's from multiple indexers >>> >>> -As for the issue on combining the indexes, it has to do with the >> number of >>> fields that 1.4.0 / 1.4.2 and 1.6.X generate. The older version >> generates a >>> different version of the index, with a different subset of fields. >>> >>> -Wayback 1.6.0 can handle both indexes, so it doesn't matter if you >> have >>> your content indexed with either of the two. However, if you plan >> to >>> combine the indexes into one big index, they need to match. >>> >>> -The specific problem we had was with sections of an ongoing crawl. >> 2009 >>> content was indexed with 1.4.X, but 2009+2010 content was indexed >> with >>> 1.6.X, so if we merge and sort, we would get the 2009 entries >> twice, >> because >>> they do not match exactly (different number of fields). >>> >>> -The field configurations for the two versions (as we have them >> are) >>> 1.4.2: CDX N b h m s k r V g >>> 1.6.1: CDX N b a m s k r M V g >>> >>> For definitions of the fields here is an old reference: >>> http://archive.org/web/researcher/cdx_legend.php >>> >>> >>> Gina Jones >>> Ignacio Garcia del Campo >>> Laura Graham >>> >>> >>> -----Original Message----- >>> From: arc...@li... >> [mailto:archive- >>> acc...@li...] >>> Sent: Tuesday, June 04, 2013 8:03 AM >>> To: arc...@li... >>> Subject: Archive-access-discuss Digest, Vol 78, Issue 2 >>> >>> Send Archive-access-discuss mailing list submissions to >>> arc...@li... >>> >>> To subscribe or unsubscribe via the World Wide Web, visit >>> >> https://lists.sourceforge.net/lists/listinfo/archive-access-discuss >>> or, via email, send a message with subject or body 'help' to >>> arc...@li... >>> >>> You can reach the person managing the list at >>> arc...@li... >>> >>> When replying, please edit your Subject line so it is more specific >> than "Re: >>> Contents of Archive-access-discuss digest..." >>> >>> >>> Today's Topics: >>> >>> 1. Best practices for indexing a growing 2+ billion document >>> collection (Kristinn Sigur?sson) >>> 2. Re: Best practices for indexing a growing 2+ billion >> document >>> collection (Erik Hetzner) >>> 3. Re: Best practices for indexing a growing 2+ billion document >>> collection (Colin Rosenthal) >>> >>> >>> ------------------------------------------------------------------- >> --- >>> Message: 1 >>> Date: Mon, 3 Jun 2013 11:39:40 +0000 >>> From: Kristinn Sigur?sson <kri...@la...> >>> Subject: [Archive-access-discuss] Best practices for indexing a >>> growing 2+ billion document collection >>> To: "arc...@li..." >>> <arc...@li...> >>> Message-ID: >>> <E48...@bl...khlada.local> >>> Content-Type: text/plain; charset="utf-8" >>> >>> Dear all, >>> >>> We are planning on updating our Wayback installation and I would >> like >> to poll >>> your collective wisdom on the best approach for managing the >> Wayback >>> index. >>> >>> Currently, our collection is about 2.2 billion items. It is also >> growing at a rate of >>> approximately 350-400 million records per year. >>> >>> The obvious approach would be to use a sorted CDX file (or files) >> as >> the >>> index. I'm, however, concerned about its performance at this scale. >>> Additionally, updating a CDX based index can be troublesome. >> Especially as >>> we would like to update it continuously as new material is >> ingested. >>> Any relevant experience and advice you could share on this topic >> would >> be >>> greatly appreciated. >>> >>> >>> Best regards, >>> Mr. Kristinn Sigur?sson >>> Head of IT >>> National and University Library of Iceland >>> >>> >>> >>> >>> >> --------------------------------------------------------------------- >> --- >> - >>> Landsb?kasafn ?slands - H?sk?lab?kasafn | Arngr?msg?tu 3 - 107 >> Reykjav?k >>> S?mi/Tel: +354 5255600 | www.landsbokasafn.is >>> >> --------------------------------------------------------------------- >> --- >> - >>> fyrirvari/disclaimer - http://fyrirvari.landsbokasafn.is >>> >>> ------------------------------ >>> >>> Message: 2 >>> Date: Mon, 03 Jun 2013 11:49:04 -0700 >>> From: Erik Hetzner <eri...@uc...> >>> Subject: Re: [Archive-access-discuss] Best practices for indexing a >>> growing 2+ billion document collection >>> To: Kristinn Sigur?sson <kri...@la...> >>> Cc: "arc...@li..." >>> <arc...@li...> >>> Message-ID: <201...@ma...> >>> Content-Type: text/plain; charset="utf-8" >>> >>> At Mon, 3 Jun 2013 11:39:40 +0000, >>> Kristinn Sigur?sson wrote: >>>> Dear all, >>>> >>>> We are planning on updating our Wayback installation and I would >> like >>>> to poll your collective wisdom on the best approach for managing >> the >>>> Wayback index. >>>> >>>> Currently, our collection is about 2.2 billion items. It is also >>>> growing at a rate of approximately 350-400 million records per >> year. >>>> The obvious approach would be to use a sorted CDX file (or files) >> as >>>> the index. I'm, however, concerned about its performance at this >>>> scale. Additionally, updating a CDX based index can be >> troublesome. >>>> Especially as we would like to update it continuously as new >> material >>>> is ingested. >>>> >>>> Any relevant experience and advice you could share on this topic >> would >>>> be greatly appreciated. >>> Hi Kristinn, >>> >>> We use 4 different CDX files. One is updated every ten minutes, one >> hourly, >>> one daily, and one monthly. We use the unix sort command to sort. >> This >> has >>> worked pretty well for us. We aren?t doing it in the most efficient >> manner, >>> and we will probably switch to sorting with hadoop at some point, >> but >> it >>> works pretty well. >>> >>> best, Erik >>> -------------- next part -------------- >>> Sent from my free software system <http://fsf.org/>. >>> >>> ------------------------------ >>> >>> Message: 3 >>> Date: Tue, 4 Jun 2013 12:17:18 +0200 >>> From: Colin Rosenthal <cs...@st...> >>> Subject: Re: [Archive-access-discuss] Best practices for indexing a >>> growing 2+ billion document collection >>> To: arc...@li... >>> Message-ID: <51A...@st...> >>> Content-Type: text/plain; charset="UTF-8"; format=flowed >>> >>> On 06/03/2013 08:49 PM, Erik Hetzner wrote: >>>> At Mon, 3 Jun 2013 11:39:40 +0000, >>>> Kristinn Sigur?sson wrote: >>>>> Dear all, >>>>> >>>>> We are planning on updating our Wayback installation and I would >> like >>>>> to poll your collective wisdom on the best approach for managing >> the >>>>> Wayback index. >>>>> >>>>> Currently, our collection is about 2.2 billion items. It is also >>>>> growing at a rate of approximately 350-400 million records per >> year. >>>>> The obvious approach would be to use a sorted CDX file (or >> files) >> as >>>>> the index. I'm, however, concerned about its performance at this >>>>> scale. Additionally, updating a CDX based index can be >> troublesome. >>>>> Especially as we would like to update it continuously as new >> material >>>>> is ingested. >>>>> >>>>> Any relevant experience and advice you could share on this topic >>>>> would be greatly appreciated. >>>> Hi Kristinn, >>>> >>>> We use 4 different CDX files. One is updated every ten minutes, >> one >>>> hourly, one daily, and one monthly. We use the unix sort command >> to >>>> sort. This has worked pretty well for us. We aren?t doing it in >> the >>>> most efficient manner, and we will probably switch to sorting >> with >>>> hadoop at some point, but it works pretty well. >>>> >>>> best, Erik >>> Hi Kristinn, >>> >>> Our strategy for building cdx indexes is described at >>> >> https://sbforge.org/display/NASDOC321/Wayback+Configuration#WaybackC >>> onfiguration-AggregatorApplication >>> . >>> >>> Essentially we have multiple threads creating unsorted cdx files >> for >> all new >>> arc/warc files in the archive. These are then sorted and merged >> into >> an >>> intermediate index file. When the intermediate file grows larger >> than >> 100MB, >>> it is merged with the current main index file, and when that grows >> larger than >>> 50GB we rollover to a new main index file. We currently have about >> 5TB >> total >>> cdx index. This includes 16 older cdx files of size 150GB-300GB, >> built >> by >>> handrolled scripts before we had a functional automatic indexing >> workflow. >>> We would be fascinated to hear if anyone is using an entirely >> different >>> strategy (e.g. bdb) for a large archive. >>> >>> One of our big issues at the moment is QA of our cdx files. How can >> we >> be >>> sure that our indexes actually cover all the files and records in >> the >> archive? >>> Colin Rosenthal >>> IT-Developer >>> Netarkivet, Denmark >>> >>> >>> >>> >>> ------------------------------ >>> >>> >> --------------------------------------------------------------------- >> --- >> ------ >>> How ServiceNow helps IT people transform IT departments: >>> 1. A cloud service to automate IT design, transition and operations >> 2. >>> Dashboards that offer high-level views of enterprise services 3. A >> single >>> system of record for all IT processes >> http://p.sf.net/sfu/servicenow-d2d-j >>> ------------------------------ >>> >>> _______________________________________________ >>> Archive-access-discuss mailing list >>> Arc...@li... >>> https://lists.sourceforge.net/lists/listinfo/archive-access-discuss >>> >>> >>> End of Archive-access-discuss Digest, Vol 78, Issue 2 >>> ***************************************************** >>> >>> >> --------------------------------------------------------------------- >> --- >> ------ >>> How ServiceNow helps IT people transform IT departments: >>> 1. A cloud service to automate IT design, transition and operations >> 2. >>> Dashboards that offer high-level views of enterprise services 3. A >> single >>> system of record for all IT processes >> http://p.sf.net/sfu/servicenow-d2d-j >>> _______________________________________________ >>> Archive-access-discuss mailing list >>> Arc...@li... >>> https://lists.sourceforge.net/lists/listinfo/archive-access-discuss >> ********************************************************************* >> ***** >> Experience the British Library online at http://www.bl.uk/ >> >> The British Library’s latest Annual Report and Accounts : >> http://www.bl.uk/aboutus/annrep/index.html >> >> Help the British Library conserve the world's knowledge. Adopt a >> Book. http://www.bl.uk/adoptabook >> >> The Library's St Pancras site is WiFi - enabled >> >> ********************************************************************* >> **** >> >> The information contained in this e-mail is confidential and may be >> legally privileged. It is intended for the addressee(s) only. If you >> are not the intended recipient, please delete this e-mail and notify >> the mailto:pos...@bl... : The contents of this e-mail must not be >> disclosed or copied without the sender's consent. >> >> The statements and opinions expressed in this message are those of >> the author and do not necessarily reflect those of the British >> Library. The British Library does not take any responsibility for the >> views of the author. >> >> ********************************************************************* >> **** >> Think before you print >> >> --------------------------------------------------------------------- >> --------- >> How ServiceNow helps IT people transform IT departments: >> 1. A cloud service to automate IT design, transition and operations >> 2. Dashboards that offer high-level views of enterprise services >> 3. A single system of record for all IT processes >> http://p.sf.net/sfu/servicenow-d2d-j >> _______________________________________________ >> Archive-access-discuss mailing list >> Arc...@li... >> https://lists.sourceforge.net/lists/listinfo/archive-access-discuss > ------------------------------------------------------------------------------ > How ServiceNow helps IT people transform IT departments: > 1. A cloud service to automate IT design, transition and operations > 2. Dashboards that offer high-level views of enterprise services > 3. A single system of record for all IT processes > http://p.sf.net/sfu/servicenow-d2d-j > _______________________________________________ > Archive-access-discuss mailing list > Arc...@li... > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss |
From: Kristinn S. <kri...@la...> - 2013-06-06 16:24:38
|
A question on the indexing of de-duplicated records ... are they of any use as Wayback is currently implemented? The warc/revisit record in the CDX file will point at the WARC that contains that revisit record. That record does not give any indication as to where the actual payload is found. That can only be inferred as same URL, earliest date prior to this. An inference that may or may not be accurate. The crawl logs I have, contain a bit more detail and I was planning on mining them to generate 'deduplication' cdx files that would augment the ones generated from WARCs and ARCs (especially necessary for the ARCs as they have no record of the duplicates). It seems to me, that for deduplicated content CDX files really need to contain two file+offset values. One for the payload and another (optional one!) for the warc/revisit record. Or maybe I've completely missed something. - Kris ------------------------------------------------------------------------- Landsbókasafn Íslands - Háskólabókasafn | Arngrímsgötu 3 - 107 Reykjavík Sími/Tel: +354 5255600 | www.landsbokasafn.is ------------------------------------------------------------------------- fyrirvari/disclaimer - http://fyrirvari.landsbokasafn.is > -----Original Message----- > From: Jackson, Andrew [mailto:And...@bl...] > Sent: 6. júní 2013 15:17 > To: Jones, Gina; arc...@li... > Subject: Re: [Archive-access-discuss] indexing best practices Wayback > 1.x.xIndexers > > The latest versions of Wayback still seem to have major problems. The > 1.7.1-SNAPSHOT line appears to ignore de-duplication records, > although > this is confused by the fact that H3/Wayback has recently been > changed > so that de-duplication records are not empty, but rather they contain > the headers of the response (in case only the payload of the resource > itself was unchanged). However, recent Wayback versions *require* > this > header, which breaks playback in older (but WARC-spec compliant) WARC > files with empty de-duplication records. > > This appears to be the same in the 1.8.0-SNAPSHOT line, but other > regressions mean I can't use that version (it has started refusing to > accept as valid some particular compressed WARC files that the > 1.7.1-SNAPSHOT line copes with just fine). > > Best wishes, > Andy Jackson > > > -----Original Message----- > > From: Jones, Gina [mailto:gj...@lo...] > > Sent: 04 June 2013 19:27 > > To: arc...@li... > > Subject: [Archive-access-discuss] indexing best practices Wayback > > 1.x.xIndexers > > > > We have not found issues here at the Library as our collection has > gotten > > bigger. In the past, we have had separate access points to the > each > > "collection" but are in the process of combining our content into > one > access > > point for a more cohesive collection. > > > > However, we have found challenges in indexing and combining those > > indexes, specifically due to deduplicated content. We have content > > beginning in 2009 that has been deduplicated using the WARC/revisit > field. > > > > This is what we have think we have figured out. If anyone has any > other > > information on these indexers, we would love to know about it. We > posted > > a question to the listserv about 2 years ago and didn't get any > comments > > back: > > > > Wayback 1.4.x Indexers > > -The Wayback 1.4.2 indexer produces "warc/revisit" fields in the > file > content > > index that Wayback 1.4.2 cannot process and display. > > > > -When we re-indexed the same content with Wayback 1.4.0 indexer, > > Wayback was able to handle the revisit entries. Since the > "warc/revisit" field > > didn't exist at the time that Wayback 1.4.0 was released, we > suppose > that > > Wayback 1.4.0 responds to those entries as it would to any date > instance link > > where content was missing - by redirecting to the next most > temporally- > > proximate capture. > > > > -Wayback 1.6.0 can handle file content indexes with "warc/revisit" > fields, as > > well as the older 1.4.0 file content indexes > > > > -We have been unable to get Wayback 1.6.0 indexer to run on an AIX > server. > > > > -Wayback 1.6.0 indexer writes an alpha key code to the top line of > the > file > > content index. If you are merging indexes and resorting manually, > be > sure to > > remove that line after the index is generated. > > > > Combining cdx's from multiple indexers > > > > -As for the issue on combining the indexes, it has to do with the > number of > > fields that 1.4.0 / 1.4.2 and 1.6.X generate. The older version > generates a > > different version of the index, with a different subset of fields. > > > > -Wayback 1.6.0 can handle both indexes, so it doesn't matter if you > have > > your content indexed with either of the two. However, if you plan > to > > combine the indexes into one big index, they need to match. > > > > -The specific problem we had was with sections of an ongoing crawl. > 2009 > > content was indexed with 1.4.X, but 2009+2010 content was indexed > with > > 1.6.X, so if we merge and sort, we would get the 2009 entries > twice, > because > > they do not match exactly (different number of fields). > > > > -The field configurations for the two versions (as we have them > are) > > > > 1.4.2: CDX N b h m s k r V g > > 1.6.1: CDX N b a m s k r M V g > > > > For definitions of the fields here is an old reference: > > http://archive.org/web/researcher/cdx_legend.php > > > > > > Gina Jones > > Ignacio Garcia del Campo > > Laura Graham > > > > > > -----Original Message----- > > From: arc...@li... > [mailto:archive- > > acc...@li...] > > Sent: Tuesday, June 04, 2013 8:03 AM > > To: arc...@li... > > Subject: Archive-access-discuss Digest, Vol 78, Issue 2 > > > > Send Archive-access-discuss mailing list submissions to > > arc...@li... > > > > To subscribe or unsubscribe via the World Wide Web, visit > > > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss > > or, via email, send a message with subject or body 'help' to > > arc...@li... > > > > You can reach the person managing the list at > > arc...@li... > > > > When replying, please edit your Subject line so it is more specific > than "Re: > > Contents of Archive-access-discuss digest..." > > > > > > Today's Topics: > > > > 1. Best practices for indexing a growing 2+ billion document > > collection (Kristinn Sigur?sson) > > 2. Re: Best practices for indexing a growing 2+ billion > document > > collection (Erik Hetzner) > > 3. Re: Best practices for indexing a growing 2+ billion document > > collection (Colin Rosenthal) > > > > > > ------------------------------------------------------------------- > --- > > > > Message: 1 > > Date: Mon, 3 Jun 2013 11:39:40 +0000 > > From: Kristinn Sigur?sson <kri...@la...> > > Subject: [Archive-access-discuss] Best practices for indexing a > > growing 2+ billion document collection > > To: "arc...@li..." > > <arc...@li...> > > Message-ID: > > <E48...@bl...khlada.local> > > Content-Type: text/plain; charset="utf-8" > > > > Dear all, > > > > We are planning on updating our Wayback installation and I would > like > to poll > > your collective wisdom on the best approach for managing the > Wayback > > index. > > > > Currently, our collection is about 2.2 billion items. It is also > growing at a rate of > > approximately 350-400 million records per year. > > > > The obvious approach would be to use a sorted CDX file (or files) > as > the > > index. I'm, however, concerned about its performance at this scale. > > Additionally, updating a CDX based index can be troublesome. > Especially as > > we would like to update it continuously as new material is > ingested. > > > > Any relevant experience and advice you could share on this topic > would > be > > greatly appreciated. > > > > > > Best regards, > > Mr. Kristinn Sigur?sson > > Head of IT > > National and University Library of Iceland > > > > > > > > > > > --------------------------------------------------------------------- > --- > - > > Landsb?kasafn ?slands - H?sk?lab?kasafn | Arngr?msg?tu 3 - 107 > Reykjav?k > > S?mi/Tel: +354 5255600 | www.landsbokasafn.is > > > --------------------------------------------------------------------- > --- > - > > fyrirvari/disclaimer - http://fyrirvari.landsbokasafn.is > > > > ------------------------------ > > > > Message: 2 > > Date: Mon, 03 Jun 2013 11:49:04 -0700 > > From: Erik Hetzner <eri...@uc...> > > Subject: Re: [Archive-access-discuss] Best practices for indexing a > > growing 2+ billion document collection > > To: Kristinn Sigur?sson <kri...@la...> > > Cc: "arc...@li..." > > <arc...@li...> > > Message-ID: <201...@ma...> > > Content-Type: text/plain; charset="utf-8" > > > > At Mon, 3 Jun 2013 11:39:40 +0000, > > Kristinn Sigur?sson wrote: > > > > > > Dear all, > > > > > > We are planning on updating our Wayback installation and I would > like > > > to poll your collective wisdom on the best approach for managing > the > > > Wayback index. > > > > > > Currently, our collection is about 2.2 billion items. It is also > > > growing at a rate of approximately 350-400 million records per > year. > > > > > > The obvious approach would be to use a sorted CDX file (or files) > as > > > the index. I'm, however, concerned about its performance at this > > > scale. Additionally, updating a CDX based index can be > troublesome. > > > Especially as we would like to update it continuously as new > material > > > is ingested. > > > > > > Any relevant experience and advice you could share on this topic > would > > > be greatly appreciated. > > > > Hi Kristinn, > > > > We use 4 different CDX files. One is updated every ten minutes, one > hourly, > > one daily, and one monthly. We use the unix sort command to sort. > This > has > > worked pretty well for us. We aren?t doing it in the most efficient > manner, > > and we will probably switch to sorting with hadoop at some point, > but > it > > works pretty well. > > > > best, Erik > > -------------- next part -------------- > > Sent from my free software system <http://fsf.org/>. > > > > ------------------------------ > > > > Message: 3 > > Date: Tue, 4 Jun 2013 12:17:18 +0200 > > From: Colin Rosenthal <cs...@st...> > > Subject: Re: [Archive-access-discuss] Best practices for indexing a > > growing 2+ billion document collection > > To: arc...@li... > > Message-ID: <51A...@st...> > > Content-Type: text/plain; charset="UTF-8"; format=flowed > > > > On 06/03/2013 08:49 PM, Erik Hetzner wrote: > > > At Mon, 3 Jun 2013 11:39:40 +0000, > > > Kristinn Sigur?sson wrote: > > >> Dear all, > > >> > > >> We are planning on updating our Wayback installation and I would > like > > >> to poll your collective wisdom on the best approach for managing > the > > >> Wayback index. > > >> > > >> Currently, our collection is about 2.2 billion items. It is also > > >> growing at a rate of approximately 350-400 million records per > year. > > >> > > >> The obvious approach would be to use a sorted CDX file (or > files) > as > > >> the index. I'm, however, concerned about its performance at this > > >> scale. Additionally, updating a CDX based index can be > troublesome. > > >> Especially as we would like to update it continuously as new > material > > >> is ingested. > > >> > > >> Any relevant experience and advice you could share on this topic > > >> would be greatly appreciated. > > > Hi Kristinn, > > > > > > We use 4 different CDX files. One is updated every ten minutes, > one > > > hourly, one daily, and one monthly. We use the unix sort command > to > > > sort. This has worked pretty well for us. We aren?t doing it in > the > > > most efficient manner, and we will probably switch to sorting > with > > > hadoop at some point, but it works pretty well. > > > > > > best, Erik > > Hi Kristinn, > > > > Our strategy for building cdx indexes is described at > > > https://sbforge.org/display/NASDOC321/Wayback+Configuration#WaybackC > > onfiguration-AggregatorApplication > > . > > > > Essentially we have multiple threads creating unsorted cdx files > for > all new > > arc/warc files in the archive. These are then sorted and merged > into > an > > intermediate index file. When the intermediate file grows larger > than > 100MB, > > it is merged with the current main index file, and when that grows > larger than > > 50GB we rollover to a new main index file. We currently have about > 5TB > total > > cdx index. This includes 16 older cdx files of size 150GB-300GB, > built > by > > handrolled scripts before we had a functional automatic indexing > workflow. > > > > We would be fascinated to hear if anyone is using an entirely > different > > strategy (e.g. bdb) for a large archive. > > > > One of our big issues at the moment is QA of our cdx files. How can > we > be > > sure that our indexes actually cover all the files and records in > the > archive? > > > > Colin Rosenthal > > IT-Developer > > Netarkivet, Denmark > > > > > > > > > > ------------------------------ > > > > > --------------------------------------------------------------------- > --- > ------ > > How ServiceNow helps IT people transform IT departments: > > 1. A cloud service to automate IT design, transition and operations > 2. > > Dashboards that offer high-level views of enterprise services 3. A > single > > system of record for all IT processes > http://p.sf.net/sfu/servicenow-d2d-j > > > > ------------------------------ > > > > _______________________________________________ > > Archive-access-discuss mailing list > > Arc...@li... > > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss > > > > > > End of Archive-access-discuss Digest, Vol 78, Issue 2 > > ***************************************************** > > > > > --------------------------------------------------------------------- > --- > ------ > > How ServiceNow helps IT people transform IT departments: > > 1. A cloud service to automate IT design, transition and operations > 2. > > Dashboards that offer high-level views of enterprise services 3. A > single > > system of record for all IT processes > http://p.sf.net/sfu/servicenow-d2d-j > > _______________________________________________ > > Archive-access-discuss mailing list > > Arc...@li... > > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss > > ********************************************************************* > ***** > Experience the British Library online at http://www.bl.uk/ > > The British Library’s latest Annual Report and Accounts : > http://www.bl.uk/aboutus/annrep/index.html > > Help the British Library conserve the world's knowledge. Adopt a > Book. http://www.bl.uk/adoptabook > > The Library's St Pancras site is WiFi - enabled > > ********************************************************************* > **** > > The information contained in this e-mail is confidential and may be > legally privileged. It is intended for the addressee(s) only. If you > are not the intended recipient, please delete this e-mail and notify > the mailto:pos...@bl... : The contents of this e-mail must not be > disclosed or copied without the sender's consent. > > The statements and opinions expressed in this message are those of > the author and do not necessarily reflect those of the British > Library. The British Library does not take any responsibility for the > views of the author. > > ********************************************************************* > **** > Think before you print > > --------------------------------------------------------------------- > --------- > How ServiceNow helps IT people transform IT departments: > 1. A cloud service to automate IT design, transition and operations > 2. Dashboards that offer high-level views of enterprise services > 3. A single system of record for all IT processes > http://p.sf.net/sfu/servicenow-d2d-j > _______________________________________________ > Archive-access-discuss mailing list > Arc...@li... > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss |
From: Jackson, A. <And...@bl...> - 2013-06-06 16:13:25
|
It's not just the indexer. The front-end logic and the coupling to H3 have all been problematic recently. We have suffered a range of problems deploying recent Wayback versions, due to unintended consequences of recent changes that break functionality that we require. As well as the de-duplication problems I mentioned in a separate email, we've also had issues with Memento access points (which don't return link-format timemaps as they should/used to) and the XML query endpoint failing under certain conditions (due to changes in URL handling/'cleaning'). In my opinion, one of the critical jobs for the future Wayback OS project is to set up proper, automated integration tests that exercise all the functionality the IIPC partners need, and will therefore detect if changes to the source code have unintentionally altered critical behaviour. It is technically fairly straightforward to make an integration test that, say, indexes a few WARCs, fires up a Wayback instance, and checks the responses to some queries. It does, of course, require some investment of time and effort. However, that investment would enable future modifications to the code base to be carried out with far more confidence. I've started doing some work in this area, but would appreciate knowing if anyone else is willing to put some effort into building up the testing framework. Thanks, Andy > -----Original Message----- > From: Jones, Gina [mailto:gj...@lo...] > Sent: 06 June 2013 13:13 > To: arc...@li... > Subject: [Archive-access-discuss] Wayback Indexer > > I believe that the wayback indexer is the weakest link to longterm access to > our collections. And it isn't obvious sometimes what is going on when you > index content until you actually access that content. > > One of the projects I want to do this year (or next) is to take the available > indexers and index a set of content that we have (2000-now) and review the > output. > > gina > > ------------------------------------------------------------------------ ------ > How ServiceNow helps IT people transform IT departments: > 1. A cloud service to automate IT design, transition and operations 2. > Dashboards that offer high-level views of enterprise services 3. A single > system of record for all IT processes http://p.sf.net/sfu/servicenow-d2d-j > _______________________________________________ > Archive-access-discuss mailing list > Arc...@li... > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss ************************************************************************** Experience the British Library online at http://www.bl.uk/ The British Library’s latest Annual Report and Accounts : http://www.bl.uk/aboutus/annrep/index.html Help the British Library conserve the world's knowledge. Adopt a Book. http://www.bl.uk/adoptabook The Library's St Pancras site is WiFi - enabled ************************************************************************* The information contained in this e-mail is confidential and may be legally privileged. It is intended for the addressee(s) only. If you are not the intended recipient, please delete this e-mail and notify the mailto:pos...@bl... : The contents of this e-mail must not be disclosed or copied without the sender's consent. The statements and opinions expressed in this message are those of the author and do not necessarily reflect those of the British Library. The British Library does not take any responsibility for the views of the author. ************************************************************************* Think before you print |
From: Jackson, A. <And...@bl...> - 2013-06-06 15:17:19
|
The latest versions of Wayback still seem to have major problems. The 1.7.1-SNAPSHOT line appears to ignore de-duplication records, although this is confused by the fact that H3/Wayback has recently been changed so that de-duplication records are not empty, but rather they contain the headers of the response (in case only the payload of the resource itself was unchanged). However, recent Wayback versions *require* this header, which breaks playback in older (but WARC-spec compliant) WARC files with empty de-duplication records. This appears to be the same in the 1.8.0-SNAPSHOT line, but other regressions mean I can't use that version (it has started refusing to accept as valid some particular compressed WARC files that the 1.7.1-SNAPSHOT line copes with just fine). Best wishes, Andy Jackson > -----Original Message----- > From: Jones, Gina [mailto:gj...@lo...] > Sent: 04 June 2013 19:27 > To: arc...@li... > Subject: [Archive-access-discuss] indexing best practices Wayback > 1.x.xIndexers > > We have not found issues here at the Library as our collection has gotten > bigger. In the past, we have had separate access points to the each > "collection" but are in the process of combining our content into one access > point for a more cohesive collection. > > However, we have found challenges in indexing and combining those > indexes, specifically due to deduplicated content. We have content > beginning in 2009 that has been deduplicated using the WARC/revisit field. > > This is what we have think we have figured out. If anyone has any other > information on these indexers, we would love to know about it. We posted > a question to the listserv about 2 years ago and didn't get any comments > back: > > Wayback 1.4.x Indexers > -The Wayback 1.4.2 indexer produces "warc/revisit" fields in the file content > index that Wayback 1.4.2 cannot process and display. > > -When we re-indexed the same content with Wayback 1.4.0 indexer, > Wayback was able to handle the revisit entries. Since the "warc/revisit" field > didn't exist at the time that Wayback 1.4.0 was released, we suppose that > Wayback 1.4.0 responds to those entries as it would to any date instance link > where content was missing - by redirecting to the next most temporally- > proximate capture. > > -Wayback 1.6.0 can handle file content indexes with "warc/revisit" fields, as > well as the older 1.4.0 file content indexes > > -We have been unable to get Wayback 1.6.0 indexer to run on an AIX server. > > -Wayback 1.6.0 indexer writes an alpha key code to the top line of the file > content index. If you are merging indexes and resorting manually, be sure to > remove that line after the index is generated. > > Combining cdx's from multiple indexers > > -As for the issue on combining the indexes, it has to do with the number of > fields that 1.4.0 / 1.4.2 and 1.6.X generate. The older version generates a > different version of the index, with a different subset of fields. > > -Wayback 1.6.0 can handle both indexes, so it doesn't matter if you have > your content indexed with either of the two. However, if you plan to > combine the indexes into one big index, they need to match. > > -The specific problem we had was with sections of an ongoing crawl. 2009 > content was indexed with 1.4.X, but 2009+2010 content was indexed with > 1.6.X, so if we merge and sort, we would get the 2009 entries twice, because > they do not match exactly (different number of fields). > > -The field configurations for the two versions (as we have them are) > > 1.4.2: CDX N b h m s k r V g > 1.6.1: CDX N b a m s k r M V g > > For definitions of the fields here is an old reference: > http://archive.org/web/researcher/cdx_legend.php > > > Gina Jones > Ignacio Garcia del Campo > Laura Graham > > > -----Original Message----- > From: arc...@li... [mailto:archive- > acc...@li...] > Sent: Tuesday, June 04, 2013 8:03 AM > To: arc...@li... > Subject: Archive-access-discuss Digest, Vol 78, Issue 2 > > Send Archive-access-discuss mailing list submissions to > arc...@li... > > To subscribe or unsubscribe via the World Wide Web, visit > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss > or, via email, send a message with subject or body 'help' to > arc...@li... > > You can reach the person managing the list at > arc...@li... > > When replying, please edit your Subject line so it is more specific than "Re: > Contents of Archive-access-discuss digest..." > > > Today's Topics: > > 1. Best practices for indexing a growing 2+ billion document > collection (Kristinn Sigur?sson) > 2. Re: Best practices for indexing a growing 2+ billion document > collection (Erik Hetzner) > 3. Re: Best practices for indexing a growing 2+ billion document > collection (Colin Rosenthal) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Mon, 3 Jun 2013 11:39:40 +0000 > From: Kristinn Sigur?sson <kri...@la...> > Subject: [Archive-access-discuss] Best practices for indexing a > growing 2+ billion document collection > To: "arc...@li..." > <arc...@li...> > Message-ID: > <E48...@bl...khlada.local> > Content-Type: text/plain; charset="utf-8" > > Dear all, > > We are planning on updating our Wayback installation and I would like to poll > your collective wisdom on the best approach for managing the Wayback > index. > > Currently, our collection is about 2.2 billion items. It is also growing at a rate of > approximately 350-400 million records per year. > > The obvious approach would be to use a sorted CDX file (or files) as the > index. I'm, however, concerned about its performance at this scale. > Additionally, updating a CDX based index can be troublesome. Especially as > we would like to update it continuously as new material is ingested. > > Any relevant experience and advice you could share on this topic would be > greatly appreciated. > > > Best regards, > Mr. Kristinn Sigur?sson > Head of IT > National and University Library of Iceland > > > > > ------------------------------------------------------------------------ - > Landsb?kasafn ?slands - H?sk?lab?kasafn | Arngr?msg?tu 3 - 107 Reykjav?k > S?mi/Tel: +354 5255600 | www.landsbokasafn.is > ------------------------------------------------------------------------ - > fyrirvari/disclaimer - http://fyrirvari.landsbokasafn.is > > ------------------------------ > > Message: 2 > Date: Mon, 03 Jun 2013 11:49:04 -0700 > From: Erik Hetzner <eri...@uc...> > Subject: Re: [Archive-access-discuss] Best practices for indexing a > growing 2+ billion document collection > To: Kristinn Sigur?sson <kri...@la...> > Cc: "arc...@li..." > <arc...@li...> > Message-ID: <201...@ma...> > Content-Type: text/plain; charset="utf-8" > > At Mon, 3 Jun 2013 11:39:40 +0000, > Kristinn Sigur?sson wrote: > > > > Dear all, > > > > We are planning on updating our Wayback installation and I would like > > to poll your collective wisdom on the best approach for managing the > > Wayback index. > > > > Currently, our collection is about 2.2 billion items. It is also > > growing at a rate of approximately 350-400 million records per year. > > > > The obvious approach would be to use a sorted CDX file (or files) as > > the index. I'm, however, concerned about its performance at this > > scale. Additionally, updating a CDX based index can be troublesome. > > Especially as we would like to update it continuously as new material > > is ingested. > > > > Any relevant experience and advice you could share on this topic would > > be greatly appreciated. > > Hi Kristinn, > > We use 4 different CDX files. One is updated every ten minutes, one hourly, > one daily, and one monthly. We use the unix sort command to sort. This has > worked pretty well for us. We aren?t doing it in the most efficient manner, > and we will probably switch to sorting with hadoop at some point, but it > works pretty well. > > best, Erik > -------------- next part -------------- > Sent from my free software system <http://fsf.org/>. > > ------------------------------ > > Message: 3 > Date: Tue, 4 Jun 2013 12:17:18 +0200 > From: Colin Rosenthal <cs...@st...> > Subject: Re: [Archive-access-discuss] Best practices for indexing a > growing 2+ billion document collection > To: arc...@li... > Message-ID: <51A...@st...> > Content-Type: text/plain; charset="UTF-8"; format=flowed > > On 06/03/2013 08:49 PM, Erik Hetzner wrote: > > At Mon, 3 Jun 2013 11:39:40 +0000, > > Kristinn Sigur?sson wrote: > >> Dear all, > >> > >> We are planning on updating our Wayback installation and I would like > >> to poll your collective wisdom on the best approach for managing the > >> Wayback index. > >> > >> Currently, our collection is about 2.2 billion items. It is also > >> growing at a rate of approximately 350-400 million records per year. > >> > >> The obvious approach would be to use a sorted CDX file (or files) as > >> the index. I'm, however, concerned about its performance at this > >> scale. Additionally, updating a CDX based index can be troublesome. > >> Especially as we would like to update it continuously as new material > >> is ingested. > >> > >> Any relevant experience and advice you could share on this topic > >> would be greatly appreciated. > > Hi Kristinn, > > > > We use 4 different CDX files. One is updated every ten minutes, one > > hourly, one daily, and one monthly. We use the unix sort command to > > sort. This has worked pretty well for us. We aren?t doing it in the > > most efficient manner, and we will probably switch to sorting with > > hadoop at some point, but it works pretty well. > > > > best, Erik > Hi Kristinn, > > Our strategy for building cdx indexes is described at > https://sbforge.org/display/NASDOC321/Wayback+Configuration#WaybackC > onfiguration-AggregatorApplication > . > > Essentially we have multiple threads creating unsorted cdx files for all new > arc/warc files in the archive. These are then sorted and merged into an > intermediate index file. When the intermediate file grows larger than 100MB, > it is merged with the current main index file, and when that grows larger than > 50GB we rollover to a new main index file. We currently have about 5TB total > cdx index. This includes 16 older cdx files of size 150GB-300GB, built by > handrolled scripts before we had a functional automatic indexing workflow. > > We would be fascinated to hear if anyone is using an entirely different > strategy (e.g. bdb) for a large archive. > > One of our big issues at the moment is QA of our cdx files. How can we be > sure that our indexes actually cover all the files and records in the archive? > > Colin Rosenthal > IT-Developer > Netarkivet, Denmark > > > > > ------------------------------ > > ------------------------------------------------------------------------ ------ > How ServiceNow helps IT people transform IT departments: > 1. A cloud service to automate IT design, transition and operations 2. > Dashboards that offer high-level views of enterprise services 3. A single > system of record for all IT processes http://p.sf.net/sfu/servicenow-d2d-j > > ------------------------------ > > _______________________________________________ > Archive-access-discuss mailing list > Arc...@li... > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss > > > End of Archive-access-discuss Digest, Vol 78, Issue 2 > ***************************************************** > > ------------------------------------------------------------------------ ------ > How ServiceNow helps IT people transform IT departments: > 1. A cloud service to automate IT design, transition and operations 2. > Dashboards that offer high-level views of enterprise services 3. A single > system of record for all IT processes http://p.sf.net/sfu/servicenow-d2d-j > _______________________________________________ > Archive-access-discuss mailing list > Arc...@li... > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss ************************************************************************** Experience the British Library online at http://www.bl.uk/ The British Library’s latest Annual Report and Accounts : http://www.bl.uk/aboutus/annrep/index.html Help the British Library conserve the world's knowledge. Adopt a Book. http://www.bl.uk/adoptabook The Library's St Pancras site is WiFi - enabled ************************************************************************* The information contained in this e-mail is confidential and may be legally privileged. It is intended for the addressee(s) only. If you are not the intended recipient, please delete this e-mail and notify the mailto:pos...@bl... : The contents of this e-mail must not be disclosed or copied without the sender's consent. The statements and opinions expressed in this message are those of the author and do not necessarily reflect those of the British Library. The British Library does not take any responsibility for the views of the author. ************************************************************************* Think before you print |
From: Jones, G. <gj...@lo...> - 2013-06-06 12:13:15
|
I believe that the wayback indexer is the weakest link to longterm access to our collections. And it isn't obvious sometimes what is going on when you index content until you actually access that content. One of the projects I want to do this year (or next) is to take the available indexers and index a set of content that we have (2000-now) and review the output. gina |
From: Colin R. <cs...@st...> - 2013-06-06 07:18:04
|
On 06/04/2013 08:27 PM, Jones, Gina wrote: > > -Wayback 1.6.0 can handle both indexes, so it doesn't matter if you have your content indexed with either of the two. However, if you plan to combine the indexes into one big index, they need to match. > > -The specific problem we had was with sections of an ongoing crawl. 2009 content was indexed with 1.4.X, but 2009+2010 content was indexed with 1.6.X, so if we merge and sort, we would get the 2009 entries twice, because they do not match exactly (different number of fields). > > -The field configurations for the two versions (as we have them are) > > 1.4.2: CDX N b h m s k r V g > 1.6.1: CDX N b a m s k r M V g > > For definitions of the fields here is an old reference: http://archive.org/web/researcher/cdx_legend.php > Thank you, Gina, that is extremely interesting! Colin Rosenthal Netarkivet |
From: Jones, G. <gj...@lo...> - 2013-06-04 18:55:26
|
We have not found issues here at the Library as our collection has gotten bigger. In the past, we have had separate access points to the each "collection" but are in the process of combining our content into one access point for a more cohesive collection. However, we have found challenges in indexing and combining those indexes, specifically due to deduplicated content. We have content beginning in 2009 that has been deduplicated using the WARC/revisit field. This is what we have think we have figured out. If anyone has any other information on these indexers, we would love to know about it. We posted a question to the listserv about 2 years ago and didn't get any comments back: Wayback 1.4.x Indexers -The Wayback 1.4.2 indexer produces "warc/revisit" fields in the file content index that Wayback 1.4.2 cannot process and display. -When we re-indexed the same content with Wayback 1.4.0 indexer, Wayback was able to handle the revisit entries. Since the "warc/revisit" field didn't exist at the time that Wayback 1.4.0 was released, we suppose that Wayback 1.4.0 responds to those entries as it would to any date instance link where content was missing - by redirecting to the next most temporally-proximate capture. -Wayback 1.6.0 can handle file content indexes with "warc/revisit" fields, as well as the older 1.4.0 file content indexes -We have been unable to get Wayback 1.6.0 indexer to run on an AIX server. -Wayback 1.6.0 indexer writes an alpha key code to the top line of the file content index. If you are merging indexes and resorting manually, be sure to remove that line after the index is generated. Combining cdx's from multiple indexers -As for the issue on combining the indexes, it has to do with the number of fields that 1.4.0 / 1.4.2 and 1.6.X generate. The older version generates a different version of the index, with a different subset of fields. -Wayback 1.6.0 can handle both indexes, so it doesn't matter if you have your content indexed with either of the two. However, if you plan to combine the indexes into one big index, they need to match. -The specific problem we had was with sections of an ongoing crawl. 2009 content was indexed with 1.4.X, but 2009+2010 content was indexed with 1.6.X, so if we merge and sort, we would get the 2009 entries twice, because they do not match exactly (different number of fields). -The field configurations for the two versions (as we have them are) 1.4.2: CDX N b h m s k r V g 1.6.1: CDX N b a m s k r M V g For definitions of the fields here is an old reference: http://archive.org/web/researcher/cdx_legend.php Gina Jones Ignacio Garcia del Campo Laura Graham -----Original Message----- From: arc...@li... [mailto:arc...@li...] Sent: Tuesday, June 04, 2013 8:03 AM To: arc...@li... Subject: Archive-access-discuss Digest, Vol 78, Issue 2 Send Archive-access-discuss mailing list submissions to arc...@li... To subscribe or unsubscribe via the World Wide Web, visit https://lists.sourceforge.net/lists/listinfo/archive-access-discuss or, via email, send a message with subject or body 'help' to arc...@li... You can reach the person managing the list at arc...@li... When replying, please edit your Subject line so it is more specific than "Re: Contents of Archive-access-discuss digest..." Today's Topics: 1. Best practices for indexing a growing 2+ billion document collection (Kristinn Sigur?sson) 2. Re: Best practices for indexing a growing 2+ billion document collection (Erik Hetzner) 3. Re: Best practices for indexing a growing 2+ billion document collection (Colin Rosenthal) ---------------------------------------------------------------------- Message: 1 Date: Mon, 3 Jun 2013 11:39:40 +0000 From: Kristinn Sigur?sson <kri...@la...> Subject: [Archive-access-discuss] Best practices for indexing a growing 2+ billion document collection To: "arc...@li..." <arc...@li...> Message-ID: <E48...@bl...khlada.local> Content-Type: text/plain; charset="utf-8" Dear all, We are planning on updating our Wayback installation and I would like to poll your collective wisdom on the best approach for managing the Wayback index. Currently, our collection is about 2.2 billion items. It is also growing at a rate of approximately 350-400 million records per year. The obvious approach would be to use a sorted CDX file (or files) as the index. I'm, however, concerned about its performance at this scale. Additionally, updating a CDX based index can be troublesome. Especially as we would like to update it continuously as new material is ingested. Any relevant experience and advice you could share on this topic would be greatly appreciated. Best regards, Mr. Kristinn Sigur?sson Head of IT National and University Library of Iceland ------------------------------------------------------------------------- Landsb?kasafn ?slands - H?sk?lab?kasafn | Arngr?msg?tu 3 - 107 Reykjav?k S?mi/Tel: +354 5255600 | www.landsbokasafn.is ------------------------------------------------------------------------- fyrirvari/disclaimer - http://fyrirvari.landsbokasafn.is ------------------------------ Message: 2 Date: Mon, 03 Jun 2013 11:49:04 -0700 From: Erik Hetzner <eri...@uc...> Subject: Re: [Archive-access-discuss] Best practices for indexing a growing 2+ billion document collection To: Kristinn Sigur?sson <kri...@la...> Cc: "arc...@li..." <arc...@li...> Message-ID: <201...@ma...> Content-Type: text/plain; charset="utf-8" At Mon, 3 Jun 2013 11:39:40 +0000, Kristinn Sigur?sson wrote: > > Dear all, > > We are planning on updating our Wayback installation and I would like > to poll your collective wisdom on the best approach for managing the > Wayback index. > > Currently, our collection is about 2.2 billion items. It is also > growing at a rate of approximately 350-400 million records per year. > > The obvious approach would be to use a sorted CDX file (or files) as > the index. I'm, however, concerned about its performance at this > scale. Additionally, updating a CDX based index can be troublesome. > Especially as we would like to update it continuously as new material > is ingested. > > Any relevant experience and advice you could share on this topic would > be greatly appreciated. Hi Kristinn, We use 4 different CDX files. One is updated every ten minutes, one hourly, one daily, and one monthly. We use the unix sort command to sort. This has worked pretty well for us. We aren?t doing it in the most efficient manner, and we will probably switch to sorting with hadoop at some point, but it works pretty well. best, Erik -------------- next part -------------- Sent from my free software system <http://fsf.org/>. ------------------------------ Message: 3 Date: Tue, 4 Jun 2013 12:17:18 +0200 From: Colin Rosenthal <cs...@st...> Subject: Re: [Archive-access-discuss] Best practices for indexing a growing 2+ billion document collection To: arc...@li... Message-ID: <51A...@st...> Content-Type: text/plain; charset="UTF-8"; format=flowed On 06/03/2013 08:49 PM, Erik Hetzner wrote: > At Mon, 3 Jun 2013 11:39:40 +0000, > Kristinn Sigur?sson wrote: >> Dear all, >> >> We are planning on updating our Wayback installation and I would like >> to poll your collective wisdom on the best approach for managing the >> Wayback index. >> >> Currently, our collection is about 2.2 billion items. It is also >> growing at a rate of approximately 350-400 million records per year. >> >> The obvious approach would be to use a sorted CDX file (or files) as >> the index. I'm, however, concerned about its performance at this >> scale. Additionally, updating a CDX based index can be troublesome. >> Especially as we would like to update it continuously as new material >> is ingested. >> >> Any relevant experience and advice you could share on this topic >> would be greatly appreciated. > Hi Kristinn, > > We use 4 different CDX files. One is updated every ten minutes, one > hourly, one daily, and one monthly. We use the unix sort command to > sort. This has worked pretty well for us. We aren?t doing it in the > most efficient manner, and we will probably switch to sorting with > hadoop at some point, but it works pretty well. > > best, Erik Hi Kristinn, Our strategy for building cdx indexes is described at https://sbforge.org/display/NASDOC321/Wayback+Configuration#WaybackConfiguration-AggregatorApplication . Essentially we have multiple threads creating unsorted cdx files for all new arc/warc files in the archive. These are then sorted and merged into an intermediate index file. When the intermediate file grows larger than 100MB, it is merged with the current main index file, and when that grows larger than 50GB we rollover to a new main index file. We currently have about 5TB total cdx index. This includes 16 older cdx files of size 150GB-300GB, built by handrolled scripts before we had a functional automatic indexing workflow. We would be fascinated to hear if anyone is using an entirely different strategy (e.g. bdb) for a large archive. One of our big issues at the moment is QA of our cdx files. How can we be sure that our indexes actually cover all the files and records in the archive? Colin Rosenthal IT-Developer Netarkivet, Denmark ------------------------------ ------------------------------------------------------------------------------ How ServiceNow helps IT people transform IT departments: 1. A cloud service to automate IT design, transition and operations 2. Dashboards that offer high-level views of enterprise services 3. A single system of record for all IT processes http://p.sf.net/sfu/servicenow-d2d-j ------------------------------ _______________________________________________ Archive-access-discuss mailing list Arc...@li... https://lists.sourceforge.net/lists/listinfo/archive-access-discuss End of Archive-access-discuss Digest, Vol 78, Issue 2 ***************************************************** |
From: Colin R. <cs...@st...> - 2013-06-04 10:17:28
|
On 06/03/2013 08:49 PM, Erik Hetzner wrote: > At Mon, 3 Jun 2013 11:39:40 +0000, > Kristinn Sigurðsson wrote: >> Dear all, >> >> We are planning on updating our Wayback installation and I would >> like to poll your collective wisdom on the best approach for >> managing the Wayback index. >> >> Currently, our collection is about 2.2 billion items. It is also >> growing at a rate of approximately 350-400 million records per year. >> >> The obvious approach would be to use a sorted CDX file (or files) as >> the index. I'm, however, concerned about its performance at this >> scale. Additionally, updating a CDX based index can be troublesome. >> Especially as we would like to update it continuously as new >> material is ingested. >> >> Any relevant experience and advice you could share on this topic >> would be greatly appreciated. > Hi Kristinn, > > We use 4 different CDX files. One is updated every ten minutes, one > hourly, one daily, and one monthly. We use the unix sort command to > sort. This has worked pretty well for us. We aren’t doing it in the > most efficient manner, and we will probably switch to sorting with > hadoop at some point, but it works pretty well. > > best, Erik Hi Kristinn, Our strategy for building cdx indexes is described at https://sbforge.org/display/NASDOC321/Wayback+Configuration#WaybackConfiguration-AggregatorApplication . Essentially we have multiple threads creating unsorted cdx files for all new arc/warc files in the archive. These are then sorted and merged into an intermediate index file. When the intermediate file grows larger than 100MB, it is merged with the current main index file, and when that grows larger than 50GB we rollover to a new main index file. We currently have about 5TB total cdx index. This includes 16 older cdx files of size 150GB-300GB, built by handrolled scripts before we had a functional automatic indexing workflow. We would be fascinated to hear if anyone is using an entirely different strategy (e.g. bdb) for a large archive. One of our big issues at the moment is QA of our cdx files. How can we be sure that our indexes actually cover all the files and records in the archive? Colin Rosenthal IT-Developer Netarkivet, Denmark |
From: Erik H. <eri...@uc...> - 2013-06-03 19:07:20
|
At Mon, 3 Jun 2013 11:39:40 +0000, Kristinn Sigurðsson wrote: > > Dear all, > > We are planning on updating our Wayback installation and I would > like to poll your collective wisdom on the best approach for > managing the Wayback index. > > Currently, our collection is about 2.2 billion items. It is also > growing at a rate of approximately 350-400 million records per year. > > The obvious approach would be to use a sorted CDX file (or files) as > the index. I'm, however, concerned about its performance at this > scale. Additionally, updating a CDX based index can be troublesome. > Especially as we would like to update it continuously as new > material is ingested. > > Any relevant experience and advice you could share on this topic > would be greatly appreciated. Hi Kristinn, We use 4 different CDX files. One is updated every ten minutes, one hourly, one daily, and one monthly. We use the unix sort command to sort. This has worked pretty well for us. We aren’t doing it in the most efficient manner, and we will probably switch to sorting with hadoop at some point, but it works pretty well. best, Erik |
From: Kristinn S. <kri...@la...> - 2013-06-03 12:49:51
|
Dear all, We are planning on updating our Wayback installation and I would like to poll your collective wisdom on the best approach for managing the Wayback index. Currently, our collection is about 2.2 billion items. It is also growing at a rate of approximately 350-400 million records per year. The obvious approach would be to use a sorted CDX file (or files) as the index. I'm, however, concerned about its performance at this scale. Additionally, updating a CDX based index can be troublesome. Especially as we would like to update it continuously as new material is ingested. Any relevant experience and advice you could share on this topic would be greatly appreciated. Best regards, Mr. Kristinn Sigurðsson Head of IT National and University Library of Iceland ------------------------------------------------------------------------- Landsbókasafn Íslands - Háskólabókasafn | Arngrímsgötu 3 - 107 Reykjavík Sími/Tel: +354 5255600 | www.landsbokasafn.is ------------------------------------------------------------------------- fyrirvari/disclaimer - http://fyrirvari.landsbokasafn.is |
From: Steph <she...@ya...> - 2013-06-01 12:41:41
|
Hi everybody, for some time I had troubles with my crawled websites (which are CMS based). Whenever I replayed my crawled site I got no style information and (when directly accessing the css file) an error saying 1013 (content not available). After days of searching the bug on the side of Heritrix I finally found the solution: when I remove the toolbar (toolbar.jsp), which I added some time ago to ArchivalURLReplay.xml, the style information shows up perfectly. Now I wonder why this happened? Has anybody any ideas? Can I use the toolbar anyway? Many thanks in advance, Steph |
From: Steph <she...@ya...> - 2013-05-27 08:33:04
|
Hello everybody, Noah Levitt kindly redirected me to this list. I'm trying to develop a good way to archive news sites on the web. Therefore I tested crawling RSS feeds and the next few links to get the main topics every day. It does work fine, just the Wayback Machine doesn't seem to do the link rewriting in the RSS Feed (which is an XML document). Instead of directing me to the archived versions of the articles it points to the live web. For "normal" HTML-Websites the rewriting works just fine. Do I have to do some specific configuration to accomplish the link rewriting in the RSS feed? Any help would be very appreciated! Many thanks in advance, Steph |
From: Jackson, A. <And...@bl...> - 2013-05-17 10:04:09
|
Hi, I'm trying to use recent version of Wayback, and am hitting the same two issue with 1.7.1-SNAPSHOT and 1.8.0-SNAPSHOT. The primary problem is that we expect to use the XML Query mode, like this: http://localhost:8080/wayback/archive/xmlquery.jsp?url=http://bits.wikim edia.org/skins-1.18/common/images/poweredby_mediawiki_88x31.png But instead of getting XML, we now get redirected to here: http://localhost:8080/wayback/archive/*/http://bits.wikimedia.org/skins- 1.18/common/images/poweredby_mediawiki_88x31.png <http://localhost:8080/wayback/archive/*/http:/bits.wikimedia.org/skins- 1.18/common/images/poweredby_mediawiki_88x31.png> Note that this is our replay config: --- <property name="query"> <bean class="org.archive.wayback.query.Renderer"> <property name="captureJsp" value="/WEB-INF/query/CalendarResults.jsp" /> --- I tracked this redirect down to this method: org.archive.wayback.archivalurl.requestparser.ArchivalUrlFormRequestPars er.parse(HttpServletRequest, AccessPoint) throwing a BetterURI exception, as a way of forcing a redirect to a URI that is neater. However, in this case, the 'neat' URI is wrong, as I am expecting XML. What should I do? Well, once I found the cause, I tried to disable the redirect, like this: <bean name="${wayback.port}:archive" class="org.archive.wayback.webapp.AccessPoint"> <property name="serveStatic" value="true" /> <property name="bounceToReplayPrefix" value="false" /> <property name="bounceToQueryPrefix" value="false" /> <property name="forceCleanQueries" value="false" /> Which worked at first, but then came up against the second, different problem, which is that the XMLCaptureResults JSP depends on a method that is now private: --- PWC6197: An error occurred at line: 38 in the jsp file: /WEB-INF/query/XMLCaptureResults.jsp PWC6199: Generated servlet error: toCanonicalStringMap() has protected access in org.archive.wayback.core.SearchResult PWC6199: Generated servlet error: /XMLCaptureResults_jsp.java uses or overrides a deprecated API. --- OR --- An error occurred at line: 40 in the jsp file: /WEB-INF/query/XMLCaptureResults.jsp The method toCanonicalStringMap() from the type SearchResult is not visible 37: <result> 38: <% 39: CaptureSearchResult result = itr.next(); 40: Map<String,String> p2 = result.toCanonicalStringMap(); 41: kitr = p2.keySet().iterator(); 42: 43: while(kitr.hasNext()) { --- I'm happy to help fix these issues, but both seem to relate to systematic changes that I'm not fully aware of (introducing 'better' URIs, and API changes), so I'd like some advice from the current developers first. Thanks, Andy -- Dr Andrew N Jackson Web Archiving Technical Lead The British Library Tel: 01937 546602 Mobile: 07765 897948 Web: www.webarchive.org.uk <http://www.webarchive.org.uk/> Twitter: @UKWebArchive |
From: Noah L. <nl...@ar...> - 2013-05-15 22:51:50
|
Hello Steph, this sounds like a wayback question, sending to wayback list. Does wayback rewrite links in rss, or does it require special configuration to do that? Noah On Tue, May 14, 2013 at 2:03 AM, sherlock.h221b <she...@ya...>wrote: > Hi there! > > I've been trying to crawl RSS feeds from different news sites. The crawl > does work just fine and displays the feed perfectly. What obviously doesn't > work is the link rewriting. When I select a link it points to the live web > version of the article instead of the archived version. The articles are > crawled though and when I insert the link into the Wayback Machine it is > found. > > I wonder if I missed a configuration or if this is a shortcoming due to > the xml-content? > > Any ideas? > > Thanks in advance, > > Steph > > > > ------------------------------------ > > Yahoo! Groups Links > > <*> To visit your group on the web, go to: > http://groups.yahoo.com/group/archive-crawler/ > > <*> Your email settings: > Individual Email | Traditional > > <*> To change settings online go to: > http://groups.yahoo.com/group/archive-crawler/join > (Yahoo! ID required) > > <*> To change settings via email: > arc...@ya... > arc...@ya... > > <*> To unsubscribe from this group, send an email to: > arc...@ya... > > <*> Your use of Yahoo! Groups is subject to: > http://docs.yahoo.com/info/terms/ > > |
From: Armin S. <sch...@gm...> - 2013-04-08 12:40:51
|
Hey Folks, I was just wondering if some of you can tell me if this is normal or if i'm doing something wrong. I configured wayback (i only adapted the paths in a working configuration, so there should not be a problem) and provided a fair amount of archive data (around 500GB of .warc.gz files). when i look at the catalina.out of Tomcat, is see /08.04.2013 14:36:09 org.archive.wayback.resourcestore.resourcefile.ResourceFileSourceUpdater synchronizeSource// //INFO: Synchronized DILIMAG Filestore/ being pushed out every few seconds. The process has been running for about 5 weeks (!!). when i try to access links that should be found within the collection, all i get is "Not in archive". Am i doing something wrong here, or is this the way this is happening? Thanks for your help,any hints are highly appreciated! Cheers, Armin |
From: Noah L. <nl...@ar...> - 2013-03-27 00:50:59
|
Hello Ferencz, This question seems about wayback replay? Ccing the wayback list. Maybe wayback doesn't do rewriting in comments, but that's just a guess. Maybe there's a setting for it. Noah On Mar 25, 2013 5:17 AM, "Ferencz Marton" <fer...@gm...> wrote: > > > Hi,**** > > Could anybody help me out with the IE conditional commenting. Why in a > page with conditional commenting the URL in <!-- --> is not rewritten; what > do I miss out?**** > > This is how it looks like in a crawled page, and the HREF should be > rewritten with : http://waybackmachine/wayback/ > static/project/verticalmenu_ie.css**** > > <!--[if IE]>**** > > &nb sp; <link rel="stylesheet" > href="/static/project/verticalmenu_ie.css" type="text/css" />**** > > <![endif]-->**** > > ** ** > > I tried Heritrix 1.14.3+4 H3.1.1**** > > Do I need to specify something in the order.xml besides:**** > > <newObject name="ExtractorHTML" > class="org.archive.crawler.extractor.ExtractorHTML">**** > > <boolean name="enabled">true</boolean>**** > > <newObject name="ExtractorHTML#decide-rules" > class="org.archive.crawler.deciderules.DecideRuleSequence">**** > > <map name="rules">**** > > </map>**** > > </newObject>**** > > <boolean name="extract-javascript">true</boolean>**** > > <boolean name="treat-frames-as-embed-links">true</boolean>**** > > <boolean name="ignore-form-action-urls">false</boolean>**** > > <boolean name="extract-only-form-gets">true</boolean>**** > > <boolean name="extract-value-attributes">true</boolean>**** > > <boolean name="ignore-unexpected-html">true</boolean>**** > > </newObject>**** > > ** ** > > Thank you.**** > > ** ** > > Regards,**** > > Ferencz Marton.**** > > > __._,_.___ > > > Your email settings: Individual Email|Traditional > Change settings via the Web<http://groups.yahoo.com/group/archive-crawler/join;_ylc=X3oDMTJmbTNzMTJrBF9TAzk3NDc2NTkwBGdycElkAzg3NTk4NjcEZ3Jwc3BJZAMxNzA1MDA0OTI0BHNlYwNmdHIEc2xrA3N0bmdzBHN0aW1lAzEzNjQyMTM4NDI->(Yahoo! ID required) > Change settings via email: Switch delivery to Daily Digest<arc...@ya...?subject=Email+Delivery:+Digest>| Switch > to Fully Featured<arc...@ya...?subject=Change+Delivery+Format:+Fully+Featured> > Visit Your Group > <http://groups.yahoo.com/group/archive-crawler;_ylc=X3oDMTJkYTBjOG91BF9TAzk3NDc2NTkwBGdycElkAzg3NTk4NjcEZ3Jwc3BJZAMxNzA1MDA0OTI0BHNlYwNmdHIEc2xrA2hwZgRzdGltZQMxMzY0MjEzODQy>| Yahoo! > Groups Terms of Use <http://docs.yahoo.com/info/terms/> | Unsubscribe > <arc...@ya...?subject=Unsubscribe> > > __,_._,___ |
From: Noah L. <nle...@gm...> - 2013-03-27 00:46:43
|
Hello Ferencz, This question seems about wayback replay? Ccing the wayback list. Maybe wayback doesn't do rewriting in comments, but that's just a guess. Maybe there's a setting for it. Noah On Mar 25, 2013 5:17 AM, "Ferencz Marton" <fer...@gm...> wrote: > > > Hi,**** > > Could anybody help me out with the IE conditional commenting. Why in a > page with conditional commenting the URL in <!-- --> is not rewritten; what > do I miss out?**** > > This is how it looks like in a crawled page, and the HREF should be > rewritten with : http://waybackmachine/wayback/ > static/project/verticalmenu_ie.css**** > > <!--[if IE]>**** > > &nb sp; <link rel="stylesheet" > href="/static/project/verticalmenu_ie.css" type="text/css" />**** > > <![endif]-->**** > > ** ** > > I tried Heritrix 1.14.3+4 H3.1.1**** > > Do I need to specify something in the order.xml besides:**** > > <newObject name="ExtractorHTML" > class="org.archive.crawler.extractor.ExtractorHTML">**** > > <boolean name="enabled">true</boolean>**** > > <newObject name="ExtractorHTML#decide-rules" > class="org.archive.crawler.deciderules.DecideRuleSequence">**** > > <map name="rules">**** > > </map>**** > > </newObject>**** > > <boolean name="extract-javascript">true</boolean>**** > > <boolean name="treat-frames-as-embed-links">true</boolean>**** > > <boolean name="ignore-form-action-urls">false</boolean>**** > > <boolean name="extract-only-form-gets">true</boolean>**** > > <boolean name="extract-value-attributes">true</boolean>**** > > <boolean name="ignore-unexpected-html">true</boolean>**** > > </newObject>**** > > ** ** > > Thank you.**** > > ** ** > > Regards,**** > > Ferencz Marton.**** > > > __._,_.___ > > > Your email settings: Individual Email|Traditional > Change settings via the Web<http://groups.yahoo.com/group/archive-crawler/join;_ylc=X3oDMTJmbTNzMTJrBF9TAzk3NDc2NTkwBGdycElkAzg3NTk4NjcEZ3Jwc3BJZAMxNzA1MDA0OTI0BHNlYwNmdHIEc2xrA3N0bmdzBHN0aW1lAzEzNjQyMTM4NDI->(Yahoo! ID required) > Change settings via email: Switch delivery to Daily Digest<arc...@ya...?subject=Email+Delivery:+Digest>| Switch > to Fully Featured<arc...@ya...?subject=Change+Delivery+Format:+Fully+Featured> > Visit Your Group > <http://groups.yahoo.com/group/archive-crawler;_ylc=X3oDMTJkYTBjOG91BF9TAzk3NDc2NTkwBGdycElkAzg3NTk4NjcEZ3Jwc3BJZAMxNzA1MDA0OTI0BHNlYwNmdHIEc2xrA2hwZgRzdGltZQMxMzY0MjEzODQy>| Yahoo! > Groups Terms of Use <http://docs.yahoo.com/info/terms/> | Unsubscribe > <arc...@ya...?subject=Unsubscribe> > > __,_._,___ |
From: Noah L. <nl...@ar...> - 2013-03-01 03:48:38
|
Hello Nicholas, We've had some success with the pages replaying in wayback archival mode, i.e. http://example.org:8080/wayback/https://facebook.com/whatever Presumably you're referring to the fact that wayback doesn't support https yet in proxy mode. We're planning to add that within the next couple of months. Unfortunately twitter and especially facebook continue to change the way their stuff works, and more variations present themselves. The settings on that wiki page may not work exactly right anymore, and I'm sure they won't handle every case. We've also had to add some custom canonicalization rules for playback to work in some cases. :-\ Some kind of crawling with real javascript support looks like it's the only feasible way for the future. Noah On Thu, 21 Feb 2013 14:13:35 +0000 Nicholas Clarke <ni...@kb...> wrote: > Hello people > > We have been experimenting with H3 settings based on the following article. > > https://webarchive.jira.com/wiki/display/Heritrix/Facebook+and+Twitter+Scroll-down > > But now our problem is how to access https content using wayback. > > Is there an established way of doing this? > > Best > Nicholas > > ------------------------------------------------------------------------------ > Nicholas Clarke, Software Developer > Department of Digital Preservation, Royal Library, Copenhagen, Denmark > tlf: (+45) 33 47 48 38 > email: ni...@kb...<mailto:sv...@kb...> > ------------------------------------------------------------------------------ > Building complex programs one state machine at a time. > -- |
From: Nicholas C. <ni...@kb...> - 2013-02-21 14:13:47
|
Hello people We have been experimenting with H3 settings based on the following article. https://webarchive.jira.com/wiki/display/Heritrix/Facebook+and+Twitter+Scroll-down But now our problem is how to access https content using wayback. Is there an established way of doing this? Best Nicholas ------------------------------------------------------------------------------ Nicholas Clarke, Software Developer Department of Digital Preservation, Royal Library, Copenhagen, Denmark tlf: (+45) 33 47 48 38 email: ni...@kb...<mailto:sv...@kb...> ------------------------------------------------------------------------------ Building complex programs one state machine at a time. |
From: Noah L. <nl...@ar...> - 2013-02-06 02:51:18
|
Hello Søren, I committed a fix to ARCReaderFactory in Heritrix for the issue you raised. See https://webarchive.jira.com/browse/HER-2032 Not sure how long that will take to appear in a wayback build. Noah On 02/05/2013 05:53 AM, Søren Vejrup Carlsen wrote: > > Hi all. > > I have found the problem. It was in the wayback-core module in the > class > org.archive.wayback.resourcestore.resourcefile.ResourceFactory.getResource(File > > file, long offset) > > The method-call "ARCReaderFactory.get(path.getName(), is, false);" > > assumes, that the file is a gzipped ARC-file, even though the > getResource method should work for both compressed > > and uncompressed arc-files? > > The solution is to replace this call with ARCReaderFactory.get(file, > offset). > > This makes the method work for both compressed and uncompressed arc-files. > > /Søren V. Carlsen (Royal Library, Copenhagen) > > *Fra:*Søren Vejrup Carlsen [mailto:sv...@kb...] > *Sendt:* 1. februar 2013 12:32 > *Til:* arc...@li... > *Emne:* [Archive-access-discuss] Workaround for > locationDBResourceStore bug in 1.7.1-SNAPSHOT > > Hi all. > > I have installed wayback 1.7.1-SNAPSHOT, built myself directly from > the pom.xml after downloading the code from > https://github.com/internetarchive/wayback > > I'm using the locationDBResourceStore that the CDXCollection.xml uses, > and it can find the correct files from the CDX. > > However, it fails to extract the record, as it somehow assumes that > all files are GZIPPED, and when it is now, it fails miserably with the > following log-entries: > > Jan 31, 2013 6:49:18 PM > org.archive.wayback.resourcestore.resourcefile.ResourceFactory getResource > INFO: Fetching: /home/prod/wayback/arcs/83807-92-0000-1.arc : 39136770 > Jan 31, 2013 6:49:18 PM > org.archive.wayback.resourcestore.resourcefile.ResourceFactory getResource > WARNING: ResourceNotAvailable for > /home/prod/wayback/arcs/83807-92-0000-1.arc Not in GZIP format > Jan 31, 2013 6:49:18 PM > org.archive.wayback.resourcestore.LocationDBResourceStore retrieveResource > INFO: Unable to retrieve /home/prod/wayback/arcs/83807-92-0000-1.arc - > java.util.zip.ZipException: Not in GZIP format > Jan 31, 2013 6:49:18 PM org.archive.wayback.webapp.AccessPoint > handleReplay > WARNING: (1)LOADFAIL: /home/prod/wayback/arcs/83807-92-0000-1.arc - > java.util.zip.ZipException: Not in GZIP format > /20100107153228/http://www2.kb.dk/elib/mss/skatte/aeldre_danske/ln185.htm > > Can anyone help me here? > > /Søren > > --------------------------------------------------------------------------- > > Søren Vejrup Carlsen, Department of Digital Preservation, Royal > Library, Copenhagen, Denmark > > tlf: (+45) 33 47 48 41 > > email: sv...@kb... <mailto:sv...@kb...> > > ---------------------------------------------------------------------------- > > Non omnia possumus omnes > > --- Macrobius, Saturnalia, VI, 1, 35 ------- > > > > ------------------------------------------------------------------------------ > Free Next-Gen Firewall Hardware Offer > Buy your Sophos next-gen firewall before the end March 2013 > and get the hardware for free! Learn more. > http://p.sf.net/sfu/sophos-d2d-feb > > > _______________________________________________ > Archive-access-discuss mailing list > Arc...@li... > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss |
From: Søren V. C. <sv...@kb...> - 2013-02-05 13:54:05
|
Hi all. I have found the problem. It was in the wayback-core module in the class org.archive.wayback.resourcestore.resourcefile.ResourceFactory.getResource(File file, long offset) The method-call “ARCReaderFactory.get(path.getName(), is, false);” assumes, that the file is a gzipped ARC-file, even though the getResource method should work for both compressed and uncompressed arc-files The solution is to replace this call with ARCReaderFactory.get(file, offset). This makes the method work for both compressed and uncompressed arc-files. /Søren V. Carlsen (Royal Library, Copenhagen) Fra: Søren Vejrup Carlsen [mailto:sv...@kb...] Sendt: 1. februar 2013 12:32 Til: arc...@li... Emne: [Archive-access-discuss] Workaround for locationDBResourceStore bug in 1.7.1-SNAPSHOT Hi all. I have installed wayback 1.7.1-SNAPSHOT, built myself directly from the pom.xml after downloading the code from https://github.com/internetarchive/wayback I’m using the locationDBResourceStore that the CDXCollection.xml uses, and it can find the correct files from the CDX. However, it fails to extract the record, as it somehow assumes that all files are GZIPPED, and when it is now, it fails miserably with the following log-entries: Jan 31, 2013 6:49:18 PM org.archive.wayback.resourcestore.resourcefile.ResourceFactory getResource INFO: Fetching: /home/prod/wayback/arcs/83807-92-0000-1.arc : 39136770 Jan 31, 2013 6:49:18 PM org.archive.wayback.resourcestore.resourcefile.ResourceFactory getResource WARNING: ResourceNotAvailable for /home/prod/wayback/arcs/83807-92-0000-1.arc Not in GZIP format Jan 31, 2013 6:49:18 PM org.archive.wayback.resourcestore.LocationDBResourceStore retrieveResource INFO: Unable to retrieve /home/prod/wayback/arcs/83807-92-0000-1.arc - java.util.zip.ZipException: Not in GZIP format Jan 31, 2013 6:49:18 PM org.archive.wayback.webapp.AccessPoint handleReplay WARNING: (1)LOADFAIL: /home/prod/wayback/arcs/83807-92-0000-1.arc - java.util.zip.ZipException: Not in GZIP format /20100107153228/http://www2.kb.dk/elib/mss/skatte/aeldre_danske/ln185.htm Can anyone help me here? /Søren --------------------------------------------------------------------------- Søren Vejrup Carlsen, Department of Digital Preservation, Royal Library, Copenhagen, Denmark tlf: (+45) 33 47 48 41 email: sv...@kb...<mailto:sv...@kb...> ---------------------------------------------------------------------------- Non omnia possumus omnes --- Macrobius, Saturnalia, VI, 1, 35 ------- |