You can subscribe to this list here.
2005 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
(1) |
Aug
(4) |
Sep
(5) |
Oct
(17) |
Nov
(30) |
Dec
(3) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2006 |
Jan
(4) |
Feb
(14) |
Mar
(8) |
Apr
(11) |
May
(2) |
Jun
(13) |
Jul
(9) |
Aug
(2) |
Sep
(2) |
Oct
(9) |
Nov
(20) |
Dec
(9) |
2007 |
Jan
(6) |
Feb
(4) |
Mar
(6) |
Apr
(7) |
May
(6) |
Jun
(6) |
Jul
(4) |
Aug
(3) |
Sep
(9) |
Oct
(26) |
Nov
(23) |
Dec
(2) |
2008 |
Jan
(17) |
Feb
(19) |
Mar
(16) |
Apr
(27) |
May
(3) |
Jun
(21) |
Jul
(21) |
Aug
(8) |
Sep
(13) |
Oct
(7) |
Nov
(8) |
Dec
(8) |
2009 |
Jan
(18) |
Feb
(14) |
Mar
(27) |
Apr
(14) |
May
(10) |
Jun
(14) |
Jul
(18) |
Aug
(30) |
Sep
(18) |
Oct
(12) |
Nov
(5) |
Dec
(26) |
2010 |
Jan
(27) |
Feb
(3) |
Mar
(8) |
Apr
(4) |
May
(6) |
Jun
(13) |
Jul
(25) |
Aug
(11) |
Sep
(2) |
Oct
(4) |
Nov
(7) |
Dec
(6) |
2011 |
Jan
(25) |
Feb
(17) |
Mar
(25) |
Apr
(23) |
May
(15) |
Jun
(12) |
Jul
(8) |
Aug
(13) |
Sep
(4) |
Oct
(17) |
Nov
(7) |
Dec
(6) |
2012 |
Jan
(4) |
Feb
(7) |
Mar
(1) |
Apr
(10) |
May
(11) |
Jun
(5) |
Jul
(7) |
Aug
(1) |
Sep
(1) |
Oct
(5) |
Nov
(6) |
Dec
(13) |
2013 |
Jan
(9) |
Feb
(7) |
Mar
(3) |
Apr
(1) |
May
(3) |
Jun
(19) |
Jul
(3) |
Aug
(3) |
Sep
|
Oct
(1) |
Nov
(1) |
Dec
(1) |
2014 |
Jan
(11) |
Feb
(1) |
Mar
|
Apr
(2) |
May
(6) |
Jun
|
Jul
|
Aug
(1) |
Sep
|
Oct
(1) |
Nov
(1) |
Dec
(1) |
2015 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
(1) |
Jul
(4) |
Aug
|
Sep
|
Oct
|
Nov
|
Dec
(1) |
2016 |
Jan
(4) |
Feb
(3) |
Mar
|
Apr
|
May
|
Jun
|
Jul
(1) |
Aug
|
Sep
|
Oct
(1) |
Nov
|
Dec
|
2018 |
Jan
|
Feb
|
Mar
|
Apr
(1) |
May
(1) |
Jun
|
Jul
(2) |
Aug
|
Sep
(1) |
Oct
|
Nov
(1) |
Dec
|
2019 |
Jan
(2) |
Feb
(1) |
Mar
|
Apr
|
May
|
Jun
(2) |
Jul
|
Aug
|
Sep
(1) |
Oct
(1) |
Nov
|
Dec
|
From: Mat K. <mk...@cs...> - 2011-10-19 19:34:06
|
Hello, I have fabricated a WARC file with the help I have thus far obtained on this forum but am having difficulty getting Wayback to display the data contained within the record. I am able to add WARCs from other sources to my Wayback instance after adding my fabricated one and have them displayed, but the content within mine never displays. I have used the tools from Hanzo Archives (http://code.hanzoarchives.com/), particularly the warc-valid.py to assure that my WARC file has no trivial issues. warc-valid assures me that my WARC file is valid. How do I get this WARC file to be displayed in my Wayback instance? I have attached the WARC file. Thank you, Mat On Wed, Oct 5, 2011 at 12:23 PM, Bradley Tofel <br...@ar...> wrote: > HTTP headers are considered part of the response, and part of the archival > record - if it's possible to save them within your system, I'd suggest > grabbing them going forward, and also that you considering using Heritrix > for your archiving. Once you have it running, you'll have standard formats > available, tools that the rest of the community is using (a great resource > for getting help), and a lot of features that will be cumbersome to > replicate. > > You could fabricate the HTTP headers yourself for previously archived > materials - Wayback will need them to replay content. > > As to the question about getting your new content indexed with Wayback, > you'll need to either rename the file, so Wayback notices it as new content, > or reset your indexing directory state: > > * stop Tomcat > * delete all files under .../wayback/{index,index-data,file-db} > * place new W/ARC files under .../wayback/files{1,2} > * start Tomcat > > Hope this helps, > > Brad > > On 10/5/11 6:20 AM, Mat Kelly wrote: > > Brad, > I did not realize Wayback would consider uncompressed WARCs. That > information will be useful. I was also considering the ARC format to > get around my WARC issues but have only recently begun to explore > that. > > Regarding your questions, I do not currently collect HTTP headers for > my data. I have created a tool that essentially saves a certain type > of webpage and all associated media to a local directory and retains > information such as time of archiving and original URI as metadata. > Are HTTP headers critical for the format? Could they be artificially > created to comply with the standard? I do know Java and was also > looking into the three projects that Erik (thanks!) suggested to > extract some of the code for my use or at least get a basis for > porting the code but the WARC format seems pretty coupled with the > rest of each package. > > >From the truncating scheme I described in a past message, why should > it not work if it is simply truncating off records? Should something > else be adjusted in the resulting file to account for the difference > in length and/or record count? > > Thanks, > Mat > > > ---------- Forwarded message ---------- > From: Bradley Tofel <br...@ar...> > Date: Tue, Oct 4, 2011 at 9:28 PM > Subject: Re: [Archive-access-discuss] WARC Manipulation and manually > creating WARCs: Need guidance > To: "me...@ma..." <me...@ma...> > > > Hi Mat, > > Another solution to side-step the compression complexities while you > work on the WARC format issues, would be using uncompressed WARC files > - just skip the compress step altogether (be sure to remove the ".gz" > suffix) > > Wayback should handle those fine - note you do still need to create > WARC records to encapsulate the archived content, but this may lower > the bar to some iterative testing. > > A couple questions to help steer you in the right direction: > > 1) do you have HTTP response headers for your archived content? > 2) do you know Java? > > Brad > > On 10/4/11 5:09 PM, Erik Hetzner wrote: > > At Tue, 4 Oct 2011 20:02:01 -0400, > Mat Kelly wrote: > > Erik, > Thank you for the reply. Please do send your script, as it might be > helpful. From the procedure above, I was hoping to create a base case > WARC and if I am not doing so properly, is there a bare bones template > to create a WARC file? Once I am familiar enough with the > procedure/structure, I plan to write a script to do the work but > wanted first to understand how I go about constructing a WARC. Please > supply any insight you can, as I am just learning about this system. > > Hi Mat, > > Attached. > > As far as I know there is no template to create a WARC file. > > You might want to have a look at the warc-tools project [1] or the > it.unimi tools as well as the heritrix commons tools [3]. > > best, Erik > > 1. http://code.hanzoarchives.com/ > 2. > http://law.dsi.unimi.it/software/docs/it/unimi/dsi/law/warc/io/package-summary.html > 3. > http://builds.archive.org:8080/maven2/org/archive/heritrix/heritrix-commons/ > > > > Sent from my free software system <http://fsf.org/>. > > > ------------------------------------------------------------------------------ > All the data continuously generated in your IT infrastructure contains a > definitive record of customers, application performance, security > threats, fraudulent activity and more. Splunk takes this data and makes > sense of it. Business sense. IT sense. Common sense. > http://p.sf.net/sfu/splunk-d2dcopy1 > > _______________________________________________ > Archive-access-discuss mailing list > Arc...@li... > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss > > > ------------------------------------------------------------------------------ > All the data continuously generated in your IT infrastructure contains a > definitive record of customers, application performance, security > threats, fraudulent activity and more. Splunk takes this data and makes > sense of it. Business sense. IT sense. Common sense. > http://p.sf.net/sfu/splunk-d2dcopy1 > > _______________________________________________ > Archive-access-discuss mailing list > Arc...@li... > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss > |
From: Bradley T. <br...@ar...> - 2011-10-05 16:23:19
|
HTTP headers are considered part of the response, and part of the archival record - if it's possible to save them within your system, I'd suggest grabbing them going forward, and also that you considering using Heritrix for your archiving. Once you have it running, you'll have standard formats available, tools that the rest of the community is using (a great resource for getting help), and a lot of features that will be cumbersome to replicate. You could fabricate the HTTP headers yourself for previously archived materials - Wayback will need them to replay content. As to the question about getting your new content indexed with Wayback, you'll need to either rename the file, so Wayback notices it as new content, or reset your indexing directory state: * stop Tomcat * delete all files under .../wayback/{index,index-data,file-db} * place new W/ARC files under .../wayback/files{1,2} * start Tomcat Hope this helps, Brad On 10/5/11 6:20 AM, Mat Kelly wrote: > Brad, > I did not realize Wayback would consider uncompressed WARCs. That > information will be useful. I was also considering the ARC format to > get around my WARC issues but have only recently begun to explore > that. > > Regarding your questions, I do not currently collect HTTP headers for > my data. I have created a tool that essentially saves a certain type > of webpage and all associated media to a local directory and retains > information such as time of archiving and original URI as metadata. > Are HTTP headers critical for the format? Could they be artificially > created to comply with the standard? I do know Java and was also > looking into the three projects that Erik (thanks!) suggested to > extract some of the code for my use or at least get a basis for > porting the code but the WARC format seems pretty coupled with the > rest of each package. > > > From the truncating scheme I described in a past message, why should > it not work if it is simply truncating off records? Should something > else be adjusted in the resulting file to account for the difference > in length and/or record count? > > Thanks, > Mat > > > ---------- Forwarded message ---------- > From: Bradley Tofel<br...@ar...> > Date: Tue, Oct 4, 2011 at 9:28 PM > Subject: Re: [Archive-access-discuss] WARC Manipulation and manually > creating WARCs: Need guidance > To: "me...@ma..."<me...@ma...> > > > Hi Mat, > > Another solution to side-step the compression complexities while you > work on the WARC format issues, would be using uncompressed WARC files > - just skip the compress step altogether (be sure to remove the ".gz" > suffix) > > Wayback should handle those fine - note you do still need to create > WARC records to encapsulate the archived content, but this may lower > the bar to some iterative testing. > > A couple questions to help steer you in the right direction: > > 1) do you have HTTP response headers for your archived content? > 2) do you know Java? > > Brad > > On 10/4/11 5:09 PM, Erik Hetzner wrote: > > At Tue, 4 Oct 2011 20:02:01 -0400, > Mat Kelly wrote: > > Erik, > Thank you for the reply. Please do send your script, as it might be > helpful. From the procedure above, I was hoping to create a base case > WARC and if I am not doing so properly, is there a bare bones template > to create a WARC file? Once I am familiar enough with the > procedure/structure, I plan to write a script to do the work but > wanted first to understand how I go about constructing a WARC. Please > supply any insight you can, as I am just learning about this system. > > Hi Mat, > > Attached. > > As far as I know there is no template to create a WARC file. > > You might want to have a look at the warc-tools project [1] or the > it.unimi tools as well as the heritrix commons tools [3]. > > best, Erik > > 1. http://code.hanzoarchives.com/ > 2. http://law.dsi.unimi.it/software/docs/it/unimi/dsi/law/warc/io/package-summary.html > 3. http://builds.archive.org:8080/maven2/org/archive/heritrix/heritrix-commons/ > > > > Sent from my free software system<http://fsf.org/>. > > > ------------------------------------------------------------------------------ > All the data continuously generated in your IT infrastructure contains a > definitive record of customers, application performance, security > threats, fraudulent activity and more. Splunk takes this data and makes > sense of it. Business sense. IT sense. Common sense. > http://p.sf.net/sfu/splunk-d2dcopy1 > > _______________________________________________ > Archive-access-discuss mailing list > Arc...@li... > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss > > > ------------------------------------------------------------------------------ > All the data continuously generated in your IT infrastructure contains a > definitive record of customers, application performance, security > threats, fraudulent activity and more. Splunk takes this data and makes > sense of it. Business sense. IT sense. Common sense. > http://p.sf.net/sfu/splunk-d2dcopy1 > > > _______________________________________________ > Archive-access-discuss mailing list > Arc...@li... > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss |
From: Mat K. <mk...@cs...> - 2011-10-05 13:20:43
|
Brad, I did not realize Wayback would consider uncompressed WARCs. That information will be useful. I was also considering the ARC format to get around my WARC issues but have only recently begun to explore that. Regarding your questions, I do not currently collect HTTP headers for my data. I have created a tool that essentially saves a certain type of webpage and all associated media to a local directory and retains information such as time of archiving and original URI as metadata. Are HTTP headers critical for the format? Could they be artificially created to comply with the standard? I do know Java and was also looking into the three projects that Erik (thanks!) suggested to extract some of the code for my use or at least get a basis for porting the code but the WARC format seems pretty coupled with the rest of each package. >From the truncating scheme I described in a past message, why should it not work if it is simply truncating off records? Should something else be adjusted in the resulting file to account for the difference in length and/or record count? Thanks, Mat ---------- Forwarded message ---------- From: Bradley Tofel <br...@ar...> Date: Tue, Oct 4, 2011 at 9:28 PM Subject: Re: [Archive-access-discuss] WARC Manipulation and manually creating WARCs: Need guidance To: "me...@ma..." <me...@ma...> Hi Mat, Another solution to side-step the compression complexities while you work on the WARC format issues, would be using uncompressed WARC files - just skip the compress step altogether (be sure to remove the ".gz" suffix) Wayback should handle those fine - note you do still need to create WARC records to encapsulate the archived content, but this may lower the bar to some iterative testing. A couple questions to help steer you in the right direction: 1) do you have HTTP response headers for your archived content? 2) do you know Java? Brad On 10/4/11 5:09 PM, Erik Hetzner wrote: At Tue, 4 Oct 2011 20:02:01 -0400, Mat Kelly wrote: Erik, Thank you for the reply. Please do send your script, as it might be helpful. From the procedure above, I was hoping to create a base case WARC and if I am not doing so properly, is there a bare bones template to create a WARC file? Once I am familiar enough with the procedure/structure, I plan to write a script to do the work but wanted first to understand how I go about constructing a WARC. Please supply any insight you can, as I am just learning about this system. Hi Mat, Attached. As far as I know there is no template to create a WARC file. You might want to have a look at the warc-tools project [1] or the it.unimi tools as well as the heritrix commons tools [3]. best, Erik 1. http://code.hanzoarchives.com/ 2. http://law.dsi.unimi.it/software/docs/it/unimi/dsi/law/warc/io/package-summary.html 3. http://builds.archive.org:8080/maven2/org/archive/heritrix/heritrix-commons/ Sent from my free software system <http://fsf.org/>. ------------------------------------------------------------------------------ All the data continuously generated in your IT infrastructure contains a definitive record of customers, application performance, security threats, fraudulent activity and more. Splunk takes this data and makes sense of it. Business sense. IT sense. Common sense. http://p.sf.net/sfu/splunk-d2dcopy1 _______________________________________________ Archive-access-discuss mailing list Arc...@li... https://lists.sourceforge.net/lists/listinfo/archive-access-discuss |
From: Bradley T. <br...@ar...> - 2011-10-05 01:28:09
|
Hi Mat, Another solution to side-step the compression complexities while you work on the WARC format issues, would be using uncompressed WARC files - just skip the compress step altogether (be sure to remove the ".gz" suffix) Wayback should handle those fine - note you do still need to create WARC records to encapsulate the archived content, but this may lower the bar to some iterative testing. A couple questions to help steer you in the right direction: 1) do you have HTTP response headers for your archived content? 2) do you know Java? Brad On 10/4/11 5:09 PM, Erik Hetzner wrote: > At Tue, 4 Oct 2011 20:02:01 -0400, > Mat Kelly wrote: >> Erik, >> Thank you for the reply. Please do send your script, as it might be >> helpful. From the procedure above, I was hoping to create a base case >> WARC and if I am not doing so properly, is there a bare bones template >> to create a WARC file? Once I am familiar enough with the >> procedure/structure, I plan to write a script to do the work but >> wanted first to understand how I go about constructing a WARC. Please >> supply any insight you can, as I am just learning about this system. > Hi Mat, > > Attached. > > As far as I know there is no template to create a WARC file. > > You might want to have a look at the warc-tools project [1] or the > it.unimi tools as well as the heritrix commons tools [3]. > > best, Erik > > 1. http://code.hanzoarchives.com/ > 2. http://law.dsi.unimi.it/software/docs/it/unimi/dsi/law/warc/io/package-summary.html > 3. http://builds.archive.org:8080/maven2/org/archive/heritrix/heritrix-commons/ > > > > Sent from my free software system<http://fsf.org/>. > > > ------------------------------------------------------------------------------ > All the data continuously generated in your IT infrastructure contains a > definitive record of customers, application performance, security > threats, fraudulent activity and more. Splunk takes this data and makes > sense of it. Business sense. IT sense. Common sense. > http://p.sf.net/sfu/splunk-d2dcopy1 > > > _______________________________________________ > Archive-access-discuss mailing list > Arc...@li... > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss |
From: Erik H. <eri...@uc...> - 2011-10-05 00:09:48
|
At Tue, 4 Oct 2011 20:02:01 -0400, Mat Kelly wrote: > > Erik, > Thank you for the reply. Please do send your script, as it might be > helpful. From the procedure above, I was hoping to create a base case > WARC and if I am not doing so properly, is there a bare bones template > to create a WARC file? Once I am familiar enough with the > procedure/structure, I plan to write a script to do the work but > wanted first to understand how I go about constructing a WARC. Please > supply any insight you can, as I am just learning about this system. Hi Mat, Attached. As far as I know there is no template to create a WARC file. You might want to have a look at the warc-tools project [1] or the it.unimi tools as well as the heritrix commons tools [3]. best, Erik 1. http://code.hanzoarchives.com/ 2. http://law.dsi.unimi.it/software/docs/it/unimi/dsi/law/warc/io/package-summary.html 3. http://builds.archive.org:8080/maven2/org/archive/heritrix/heritrix-commons/ |
From: Mat K. <mk...@cs...> - 2011-10-05 00:02:08
|
Erik, Thank you for the reply. Please do send your script, as it might be helpful. From the procedure above, I was hoping to create a base case WARC and if I am not doing so properly, is there a bare bones template to create a WARC file? Once I am familiar enough with the procedure/structure, I plan to write a script to do the work but wanted first to understand how I go about constructing a WARC. Please supply any insight you can, as I am just learning about this system. -Mat On Tue, Oct 4, 2011 at 7:47 PM, Erik Hetzner <eri...@uc...> wrote: > At Tue, 4 Oct 2011 19:16:28 -0400, > Mat Kelly wrote: >> >> Hello, >> I have successfully installed an instance of Wayback and am able to >> successfully add the file from >> http://www.archive.org/download/ExampleArcAndWarcFiles/IAH-20080430204825-00000-blackbook.warc.gz >> to my WARCs folder, see the listing appear and access the content of >> the archive. I am investigating how to create WARCs from scratch (e.g. >> without using Heretrix), so wanted to modify this WARC file and see >> the changed reflected in my local Wayback instance after allowing some >> time for re-indexing. >> >> I have decompressed this WARC file: >> gzip -d http://www.archive.org/download/ExampleArcAndWarcFiles/IAH-20080430204825-00000-blackbook.warc.gz >> >> Truncated and/or made a subtle change to the file: >> truncate -s 1500 IAH-20080430204825-00000-blackbook.warc >> >> Re-gzipped the file: >> gzip 1500 IAH-20080430204825-00000-blackbook.warc > > This won’t work. You need to compress each warc record & concatenate > the result. See [1]. Unfortunately this will probably be some effort. > I have a perl script which can compress ARC files, but not WARC files > which I can send to you. > > best, Erik > > 1. http://crawler.archive.org/articles/developer_manual/arcs.html > > best, Erik > > Sent from my free software system <http://fsf.org/>. > > |
From: Erik H. <eri...@uc...> - 2011-10-04 23:47:48
|
At Tue, 4 Oct 2011 19:16:28 -0400, Mat Kelly wrote: > > Hello, > I have successfully installed an instance of Wayback and am able to > successfully add the file from > http://www.archive.org/download/ExampleArcAndWarcFiles/IAH-20080430204825-00000-blackbook.warc.gz > to my WARCs folder, see the listing appear and access the content of > the archive. I am investigating how to create WARCs from scratch (e.g. > without using Heretrix), so wanted to modify this WARC file and see > the changed reflected in my local Wayback instance after allowing some > time for re-indexing. > > I have decompressed this WARC file: > gzip -d http://www.archive.org/download/ExampleArcAndWarcFiles/IAH-20080430204825-00000-blackbook.warc.gz > > Truncated and/or made a subtle change to the file: > truncate -s 1500 IAH-20080430204825-00000-blackbook.warc > > Re-gzipped the file: > gzip 1500 IAH-20080430204825-00000-blackbook.warc This won’t work. You need to compress each warc record & concatenate the result. See [1]. Unfortunately this will probably be some effort. I have a perl script which can compress ARC files, but not WARC files which I can send to you. best, Erik 1. http://crawler.archive.org/articles/developer_manual/arcs.html best, Erik |
From: Mat K. <mk...@cs...> - 2011-10-04 23:28:42
|
Hello, I have successfully installed an instance of Wayback and am able to successfully add the file from http://www.archive.org/download/ExampleArcAndWarcFiles/IAH-20080430204825-00000-blackbook.warc.gz to my WARCs folder, see the listing appear and access the content of the archive. I am investigating how to create WARCs from scratch (e.g. without using Heretrix), so wanted to modify this WARC file and see the changed reflected in my local Wayback instance after allowing some time for re-indexing. I have decompressed this WARC file: gzip -d http://www.archive.org/download/ExampleArcAndWarcFiles/IAH-20080430204825-00000-blackbook.warc.gz Truncated and/or made a subtle change to the file: truncate -s 1500 IAH-20080430204825-00000-blackbook.warc Re-gzipped the file: gzip 1500 IAH-20080430204825-00000-blackbook.warc again producing IAH-20080430204825-00000-blackbook.warc.gz in my Warc directory but even after restarting Tomcat and the even the server, the new file's contents never become accessible. When I click on the date link supposedly associated with the modified WARC (I am guessing it is this WARC and not a stale link from the old one), I am simply told: Resource Not Available. My ultimate goal is to create a WARC file using a collection of webpages and images that I have manually archived. Is there something wrong with my procedure above that would prevent the truncated data from showing up in Wayback? Does a resource exist that would allow me to accomplish my ultimate goal of manually creating a WARC file from a very small collection of data currently represented as files on a file system? Any advice or direction provided would be very helpful. Thank you, Mat |
From: Colin R. <cs...@st...> - 2011-10-04 13:19:31
|
Hi, Is there any way to configure wayback to replay urls harvested with the ftp:// scheme? We are using our own custom implementation of ResourceStore. Do the standard implementations support ftp:// ? Colin Rosenthal IT Developer State and University Library Aarhus |
From: Graham, L. <lg...@lo...> - 2011-09-19 18:59:30
|
Hi Brad, Below, from a previous query you say that there is "some complexity in implementing" LiveWeb "which will probably require some additional documentation." We'd like to try this out for an onsite crawl project of a single but very large complex LC web site on formats/digital preservation. We crawl this site once or twice a year, but we are interested to see if LiveWeb's "backfilling" possibilities, as you describe below, might help with interim capture of new single urls on the seed. When you have some time could you provide that additional documentation? I have to be honest, all I've done thus far is import the LiveWeb.xml in wayback.xml, which auto-created a set of dirs, liveweb/arcs, off the basedir specified in wayback. And I've looked at the LiveWeb.xml but am not sure how to proceed. Thanks, Laura Graham Library of Congress ***************** Hi Laura, Wayback 1.6.0 contains code to run a special AccessPoint which acts as a "modified" proxy server. When proxy requests are received by this AccessPoint, a request to the live web, for the URL requested by the client, is recorded into an ARC file on the spot. The single compressed ARC record is then returned as the HTTP entity to the requesting client. Note this means you cannot point a web browser directly at this service, since the browser doesn't know how to unpack the enclosed ARC record (there is another "unwrapping" proxy AccessPoint which does this, allowing experimenting with recording a web browser session.) However, a client which expects to be returned an ARC record, can then unpack the returned ARC record and use it, to access the entire HTTP response to a robots.txt request, for example. This service is used in Wayback 1.6.0 to request content from the live web for both checking robots.txt files, and for "backfilling" content requested via replay sessions, but which is not in the archive. Some of the driving factors behind returning a compressed ARC record instead of proxy returning the actual response is to simplify inserting an HTTP cache between the Wayback service and the live web proxy AccessPoint. We use varnish, which handles caching of the returned ARC record, and coalescing of multiple concurrent requests into a single request to the live web proxy AccessPoint. We intend to make this service record WARC files in the near term - porting the old Wayback ARC recording code was more expediant for 1.6.0. Currently, there's some complexity in implementing this, which will probably require some additional documentation. If you're interested, please let me know, and we'll try to prioritize this documentation. Lastly, note that we've discovered some significant bugs in the 1.6.0 codebase specifically related to this live web proxy AccessPoint, mostly in bad handling of connection errors and timeouts. These fixes are all in SVN currently, but we have not scheduled a 1.6.1 release at the moment. Brad On 3/10/11 8:02 PM, Graham, Laura wrote: > We were wondering here at the Library of Congress about the LiveWeb.xml in Wayback 1.6. The wayback.xml explains: > > " LiveWeb.xml contains the 'proxylivewebcache' bean that enable fetching > content from the live web, recording that content in ARC files. > To use the "excluder-factory-robot" bean as an exclusionFactory property of > AccessPoints, which will cause live robots.txt files to be consulted > retroactively before showing archived content, you'll need to import > LiveWeb.xml as well." > > We understand about consulting the robots.txt for display, of course, but can the Wayback actually write data to ARC (WARC?) files? What does "recording" mean? > > Thanks! > Laura Graham > |
From: Mohamed E. <moh...@bi...> - 2011-09-12 14:25:06
|
Wayback Machine and resource index are working on one machine. Arc files are located in other machines. I can access these arc files by using local static lookup file (path index) with org.archive.wayback.resourcestore.LocationDBResourceStore. What I need right now is how to access remote resource index from Wayback Machine. -- Mohamed Elsayed Bibliotheca Alexandrina |
From: Finn, B. L <bra...@ed...> - 2011-09-08 03:18:31
|
Can you send me your BDBCollection.xml for wayback and your wct-core.properties file for WCT ________________________________ From: Allen Sim [mailto:all...@gm...] Sent: Thursday, 8 September 2011 12:42 PM To: arc...@li...; br...@ar...; Finn, Bradley L Subject: Wayback shown Resource Not In Archive for all harvested websites for previous two months Hi there, I'm wondering if anyone else has experienced this problem and whether there's a simple solution we can apply. I have been using WCT1.5.1 and Wayback to harvest and archive the website for two months. I am using the wayback to replay the content and everything works okay. But on yesterday morning, an error message pop out saying that "the connection was reset". Therefor I shut down the tomcat and restart the tomcat again. As I go to the Wayback page to search for my previous harvested website, Wayback appeared "Resource Not In Archive". My directory is at "/tmp/wayback/files1" and all my ARC/WARC file are still in "/tmp/arcstore" I wonder why the Wayback page shown "Resource Not In Archive" . Is it related with the reindexing problem? How to keep the files in place? Can you please guide me so that I can search/relay back all the harvested websites on Wayback again. Appreciate your guidance and thanks in advance. Warmest Regards, Allen ----------------------------------------------------------------------------- CONFIDENTIALITY NOTICE AND DISCLAIMER Information in this transmission is intended only for the person(s) to whom it is addressed and may contain privileged and/or confidential information. If you are not the intended recipient, any disclosure, copying or dissemination of the information is unauthorised and you should delete/destroy all copies and notify the sender. No liability is accepted for any unauthorised use of the information contained in this transmission. This disclaimer has been automatically added. |
From: Allen S. <all...@gm...> - 2011-09-08 02:41:56
|
Hi there, I'm wondering if anyone else has experienced this problem and whether there's a simple solution we can apply. I have been using WCT1.5.1 and Wayback to harvest and archive the website for two months. I am using the wayback to replay the content and everything works okay. But on yesterday morning, an error message pop out saying that "the connection was reset". Therefor I shut down the tomcat and restart the tomcat again. As I go to the Wayback page to search for my previous harvested website, Wayback appeared "Resource Not In Archive". My directory is at "/tmp/wayback/files1" and all my ARC/WARC file are still in “/tmp/arcstore” I wonder why the Wayback page shown "Resource Not In Archive" . Is it related with the reindexing problem? How to keep the files in place? Can you please guide me so that I can search/relay back all the harvested websites on Wayback again. Appreciate your guidance and thanks in advance. Warmest Regards, Allen |
From: Jones, G. <gj...@lo...> - 2011-08-30 19:51:53
|
Hi Brad, we tested 1.6.1 for the XML query problem and it works fine. thanks, Gina |
From: Bradley T. <br...@ar...> - 2011-08-29 06:47:36
|
Hi Mohamed, I think you're talking about the UDP broadcast location service? For those not familiar with IA internal systems, the www.archive.org website locates content on the cluster by sending a UDP broadcast packet to all hosts on the network. A special UDP listening server is running on each host, and is aware of what content is local. When the UDP server receives a broadcast UDP packet for content that is local, it sends a packet back to the originating server:port, indicating "that content is here!". The classic Wayback used this service to locate ARC content, up until about 4 years ago. It was discontinued in favor of a static, lookup file, that mapped W/ARC filenames to one or more URLs. The reason for the change was the inherent unreliability of UDP. We would see constant low-level failures in the location service, which often provoked IA admins and end users to "just try refreshing a few times". The failure levels escalated sharply when internal network usage was near peak, and also increased steadily as our data centers became more separated. So, the current Wayback has two implementations: 1) local static lookup file (path index) running with org.archive.wayback.resourcestore.LocationDBResourceStore 2) remote HTTP 1.1 directory. This is likely in fact one of: 2a) a normal HTTP server fronting a single directory of W/ARC files 2b) a custom HTTP server fronting some more complex storage network, with site specific logic to make all W/ARC files appear to be in the top-level directory 2c) an org.archive.wayback.resourcestore.locationdb.FileProxyServlet instance, backed by either a static path index (flat file) or a BDB. All of our production Wayback installations at IA use option #1 - it's fast, and simple, and rebuilding a path-index, even with 40M entries only takes a few minutes. In the mid to long term, we are exploring option 2b. So, my short answer would be to advise you also to go with option #1, as W/ARC files don't move around that much, it will definitely meet your scale needs, and is the most robust choice. Brad On 8/28/11 4:40 PM, Mohamed Elsayed wrote: > I now have the new Wayback working on a single host. I am currently > trying to set up something like the "Item Location Server" that used to > exist in the old system. I guess this should also be possible with the > new Wayback. Can you provide any pointers for getting started on this?Is > it a fileproxy? > > Thanks in advance. > |
From: Aaron B. <aa...@ar...> - 2011-08-28 21:50:44
|
Jon Walton <jon...@gm...> writes: > I use NutchWAX to index WARC file content for analysis. I need to fix > it to get around the JDK u23 gzip problem, but I noticed that > development seems to have died. Is everyone using other solutions now > such as Solr? If so, care to share any details? It's not quite entirely dead, but pretty close to it. I can't speak for everyone, but many (former) users of NutchWAX are in some state of migration to a Solr-based implementation. IMO, the main challenge in moving to Solr is replacing the NutchWAX 'import' step -- reading the documents from (W)ARC files. There is a branch on the public NutchWAX subversion tree that has a fix to hangle the JDK u23 gzip change. This branch also contains a few customizations specific to the way NutchWAX is still used in a few deployments with particular needs. YMMV. The branch is: http://archive-access.svn.sourceforge.net/svnroot/archive-access/tags/nutchwax-0_13-JIRA-WAX-75/archive As for the future of NutchWAX, well if you consider that NutchWAX has three essential pieces: 1. Import: read (w)arc files, get the documents, parse them and extract the text, metadata, links, etc. 2. Index: Read the output of step 1, perform text analysis and manipulation and index with Solr/Lucene/etc. 3. Search/query: the live search service. For us at the Archive, we still use NutchWAX for step 1, albeit with modifications that are particular to our deployments. As for step 2, we are now using some custom MapReduce code I wrote which can index documents either directly with Lucene, or push them over the wire into a Solr server, that project can be found at: https://github.com/aaronbinns/jbs And as for step 3, some folks have moved on to Solr, at the Archive we use a custom Lucene-based Java web application: https://github.com/aaronbinns/tnh which is purpose-built for search of archival web pages. My plan is to replace NutchWAX for step 1. At the Archive, we have an in-development "access" library which can read (w)arcs and has lots of goodies for doing things with (w)arcs at scale in Hadoop. The idea is that both Wayback and full-text search and other web access projects will all use that core library. It's just a ways off still. Hope that helps, Aaron -- Aaron Binns Senior Software Engineer, Web Group, Internet Archive Program Officer, IIPC aa...@ar... |
From: Mohamed E. <moh...@bi...> - 2011-08-28 09:53:32
|
I now have the new Wayback working on a single host. I am currently trying to set up something like the "Item Location Server" that used to exist in the old system. I guess this should also be possible with the new Wayback. Can you provide any pointers for getting started on this?Is it a fileproxy? Thanks in advance. -- Mohamed Elsayed Bibliotheca Alexandrina |
From: Jon W. <jon...@gm...> - 2011-08-24 23:29:44
|
Greetings, I use NutchWAX to index WARC file content for analysis. I need to fix it to get around the JDK u23 gzip problem, but I noticed that development seems to have died. Is everyone using other solutions now such as Solr? If so, care to share any details? Thanks, Jon |
From: Bradley T. <br...@ar...> - 2011-08-17 11:14:17
|
Hi Gina, I've made a new Wayback-1.6.1 release candidate. I'd greatly appreciate it if you can verify that this new version addresses the problem. http://home.us.archive.org/~brad/wayback-1.6.1RC3.tar.gz The only other changes between this and 1.6.0 are 3 Null Pointer Exceptions being handled more gracefully, when confronted with very strange W/ARC content. Brad On 8/15/11 4:23 PM, Bradley Tofel wrote: > Hi Gina, > > I've looked at the source code, and it looks like a bug.. > > For the global Wayback release we implemented a forced redirect to the > standard ArchivalURL for form queries (with GET CGI arguments.) > > Unfortunately, this was put into the mainline codebase, without a > configuration option. > > I've added the AccessPoint option - this behavior will be disabled by > default - and will post a release 1.6.2 in a couple days after some > testing. > > Brad > > On 8/3/11 9:05 PM, Jones, Gina wrote: >> >> We are running Wayback 1.6. A programmer at LC was using an API that >> returned XML, (see >> https://webarchive.jira.com/wiki/display/wayback/OS+Wayback+API+Documentation >> >> ) >> >> to get capture dates. I also tested this against the LC archive >> hosted at IA, and also got the same non xml results. The query >> http://webarchive.loc.gov/lcwa0002/xmlquery?type=urlquery&url=http://house.gov/aderholt/ >> <http://webarchive.loc.gov/lcwa0002/xmlquery?type=urlquery&url=http://house.gov/aderholt/> >> >> that used to provide an XML result of all of the capture dates. >> >> now gets redirected (302) to LC's landing page for that archived >> site: http://webarchive.loc.gov/lcwa0002/*/http://house.gov/aderholt/ >> <http://webarchive.loc.gov/lcwa0002/*/http:/house.gov/aderholt/>. >> >> Is this api still valid and was there a change to the query structure? >> >> Thanks, Gina >> >> Gina Jones >> >> Library of Congress >> >> Web Preservation Engineering Team. >> >> >> ------------------------------------------------------------------------------ >> BlackBerry® DevCon Americas, Oct. 18-20, San Francisco, CA >> The must-attend event for mobile developers. Connect with experts. >> Get tools for creating Super Apps. See the latest technologies. >> Sessions, hands-on labs, demos& much more. Register early& save! >> http://p.sf.net/sfu/rim-blackberry-1 >> >> >> _______________________________________________ >> Archive-access-discuss mailing list >> Arc...@li... >> https://lists.sourceforge.net/lists/listinfo/archive-access-discuss > > > ------------------------------------------------------------------------------ > uberSVN's rich system and user administration capabilities and model > configuration take the hassle out of deploying and managing Subversion and > the tools developers use with it. Learn more about uberSVN and get a free > download at: http://p.sf.net/sfu/wandisco-dev2dev > > > _______________________________________________ > Archive-access-discuss mailing list > Arc...@li... > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss |
From: Bradley T. <br...@ar...> - 2011-08-15 14:56:04
|
Right on all counts: Wayback 1.6.0 has an under-documented option to cdx-indexer, "-format" which specifies the fields you want produced in your index. The default value is " CDX N b a m s k r M V g" (note the leading space.. part of the format specification Eric referenced via hyperlink.) The extra, "mystery guest" 10th field (actually field #8) is the META tag robot instructions, found within HTML resources. Values are "-" for none, or a combination if "I", "A", "F", for (No-Index, No-Archive, and No-Follow, respectively.) Starting with 1.6.0, the normal CDX Index implementation, org.archive.wayback.resourceindex.CDXIndex will handle CDX lines with either 9 or 10 columns, assuming the extra 8th column, if 10 are present, is the robot instructions field. In anticipation of potentially wanting to use the META tag robot instructions later, we opted to push the field into the default index format, and make the tools handle either format, hoping to eliminate/reduce future need to reindex content from scratch. 1.6.0 also includes another CDX implementation, org.archive.wayback.resourceindex.CDXFormatIndex, which allows for arbitrary index fields, reading the first line in the file, and assuming it contains the CDX header line (for example, " CDX N b a m s k r M V g") These are somewhat advanced features and unused at the moment, so probably not of much concern. Unless of course, you're using the 1.6.0 indexer with a 1.4.X Wayback.. in which case there's a compatibility issue. So, Kaisa, you can either: 1) strip the 8th field (perhaps better done with 'awk', or 'perl -ane' to ensure you strip the correct field?) as you're doing 2) add the options (-format " CDX N b a m s k r V g") (note lack of "M" and again note, leading SPACE before the CDX) to the cdx-indexer tool arguments. 3) upgrade your access Wayback to 1.6.X Hope this clarifies more than confuses! Brad On 8/9/11 5:46 PM, Kaisa Kaunonen wrote: > Quoting "Erik Hetzner"<eri...@uc...>: >> At Fri, 5 Aug 2011 12:11:59 +0300, >> Kaisa Kaunonen wrote: >>> Hello >>> >>> we have a newer java installation which forced us to index arc files >>> with Wayback 1.6.0 instead of 1.4.2 >>> >>> The Wayback TOMCAT application is still from 1.4.2 but it doesn't seem >>> to understand the new CDX file. >>> >>> For example, there are lines 'CDX N b a m s k r M V g' here and there >>> sprinkled around. >>> >>> Are these lines meaningful in some way? What if I remove them with a >>> script. In any case they are reduced to one single line after doing >>> sort -u newFile.cdx> sorted.cdx >>> >>> Does Wayback 1.6.0 TOMCAT application understand old& new CDX files >>> out-of-the-box? >> Hi Kaisa, >> >> This line should be at the beginning of the CDX file. >> >> http://www.archive.org/web/researcher/cdx_file_format.php >> >> I don’t believe that wayback 1.4 actually uses these lines, however, >> so you can remove them. >> >> If they are scattered around your CDX files, this is presumably >> because you are merging CDX files& sorting? >> >> best, Erik >> > > Yes, that's right. A script feeds ARC files to the CDX indexer and > those 'CDX N B a …' lines seem to be at file boundaries. > > There's also another slight difference between CDX produced by Wayback > 1.4.2 and 1.6.0 > > 1.4.2 version has > …… 200 3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ - 11461303 … > > 1.6.0 has > …… 200 3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ - - 1461303 …… > > > After I changed every instance of ' - - ' to ' - ' with sed, it was > possible to use new CDX with 1.4.2. > > Kaisa > > > ------------------------------------------------------------------------------ > uberSVN's rich system and user administration capabilities and model > configuration take the hassle out of deploying and managing Subversion and > the tools developers use with it. Learn more about uberSVN and get a free > download at: http://p.sf.net/sfu/wandisco-dev2dev > _______________________________________________ > Archive-access-discuss mailing list > Arc...@li... > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss |
From: Bradley T. <br...@ar...> - 2011-08-15 09:23:55
|
Hi Gina, I've looked at the source code, and it looks like a bug.. For the global Wayback release we implemented a forced redirect to the standard ArchivalURL for form queries (with GET CGI arguments.) Unfortunately, this was put into the mainline codebase, without a configuration option. I've added the AccessPoint option - this behavior will be disabled by default - and will post a release 1.6.2 in a couple days after some testing. Brad On 8/3/11 9:05 PM, Jones, Gina wrote: > > We are running Wayback 1.6. A programmer at LC was using an API that > returned XML, (see > https://webarchive.jira.com/wiki/display/wayback/OS+Wayback+API+Documentation > > ) > > to get capture dates. I also tested this against the LC archive > hosted at IA, and also got the same non xml results. The query > http://webarchive.loc.gov/lcwa0002/xmlquery?type=urlquery&url=http://house.gov/aderholt/ > <http://webarchive.loc.gov/lcwa0002/xmlquery?type=urlquery&url=http://house.gov/aderholt/> > > that used to provide an XML result of all of the capture dates. > > now gets redirected (302) to LC's landing page for that archived site: > http://webarchive.loc.gov/lcwa0002/*/http://house.gov/aderholt/ > <http://webarchive.loc.gov/lcwa0002/*/http:/house.gov/aderholt/>. > > Is this api still valid and was there a change to the query structure? > > Thanks, Gina > > Gina Jones > > Library of Congress > > Web Preservation Engineering Team. > > > ------------------------------------------------------------------------------ > BlackBerry® DevCon Americas, Oct. 18-20, San Francisco, CA > The must-attend event for mobile developers. Connect with experts. > Get tools for creating Super Apps. See the latest technologies. > Sessions, hands-on labs, demos& much more. Register early& save! > http://p.sf.net/sfu/rim-blackberry-1 > > > _______________________________________________ > Archive-access-discuss mailing list > Arc...@li... > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss |
From: Kaisa K. <kai...@he...> - 2011-08-09 10:46:30
|
Quoting "Erik Hetzner" <eri...@uc...>: > At Fri, 5 Aug 2011 12:11:59 +0300, > Kaisa Kaunonen wrote: >> >> Hello >> >> we have a newer java installation which forced us to index arc files >> with Wayback 1.6.0 instead of 1.4.2 >> >> The Wayback TOMCAT application is still from 1.4.2 but it doesn't seem >> to understand the new CDX file. >> >> For example, there are lines 'CDX N b a m s k r M V g' here and there >> sprinkled around. >> >> Are these lines meaningful in some way? What if I remove them with a >> script. In any case they are reduced to one single line after doing >> sort -u newFile.cdx > sorted.cdx >> >> Does Wayback 1.6.0 TOMCAT application understand old & new CDX files >> out-of-the-box? > > Hi Kaisa, > > This line should be at the beginning of the CDX file. > > http://www.archive.org/web/researcher/cdx_file_format.php > > I don’t believe that wayback 1.4 actually uses these lines, however, > so you can remove them. > > If they are scattered around your CDX files, this is presumably > because you are merging CDX files & sorting? > > best, Erik > Yes, that's right. A script feeds ARC files to the CDX indexer and those 'CDX N B a …' lines seem to be at file boundaries. There's also another slight difference between CDX produced by Wayback 1.4.2 and 1.6.0 1.4.2 version has …… 200 3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ - 11461303 … 1.6.0 has …… 200 3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ - - 1461303 …… After I changed every instance of ' - - ' to ' - ' with sed, it was possible to use new CDX with 1.4.2. Kaisa |
From: Erik H. <eri...@uc...> - 2011-08-05 17:55:08
|
At Fri, 5 Aug 2011 12:11:59 +0300, Kaisa Kaunonen wrote: > > Hello > > we have a newer java installation which forced us to index arc files > with Wayback 1.6.0 instead of 1.4.2 > > The Wayback TOMCAT application is still from 1.4.2 but it doesn't seem > to understand the new CDX file. > > For example, there are lines 'CDX N b a m s k r M V g' here and there > sprinkled around. > > Are these lines meaningful in some way? What if I remove them with a > script. In any case they are reduced to one single line after doing > sort -u newFile.cdx > sorted.cdx > > Does Wayback 1.6.0 TOMCAT application understand old & new CDX files > out-of-the-box? Hi Kaisa, This line should be at the beginning of the CDX file. http://www.archive.org/web/researcher/cdx_file_format.php I don’t believe that wayback 1.4 actually uses these lines, however, so you can remove them. If they are scattered around your CDX files, this is presumably because you are merging CDX files & sorting? best, Erik |
From: Kaisa K. <kai...@he...> - 2011-08-05 10:03:05
|
Hello we have a newer java installation which forced us to index arc files with Wayback 1.6.0 instead of 1.4.2 The Wayback TOMCAT application is still from 1.4.2 but it doesn't seem to understand the new CDX file. For example, there are lines 'CDX N b a m s k r M V g' here and there sprinkled around. Are these lines meaningful in some way? What if I remove them with a script. In any case they are reduced to one single line after doing sort -u newFile.cdx > sorted.cdx Does Wayback 1.6.0 TOMCAT application understand old & new CDX files out-of-the-box? Best Kaisa Kaunonen --- Nat. Lib. Finland --- |
From: Jones, G. <gj...@lo...> - 2011-08-03 14:43:00
|
We are running Wayback 1.6. A programmer at LC was using an API that returned XML, (see https://webarchive.jira.com/wiki/display/wayback/OS+Wayback+API+Documentation ) to get capture dates. I also tested this against the LC archive hosted at IA, and also got the same non xml results. The query http://webarchive.loc.gov/lcwa0002/xmlquery?type=urlquery&url=http://house.gov/aderholt/ that used to provide an XML result of all of the capture dates. now gets redirected (302) to LC's landing page for that archived site: http://webarchive.loc.gov/lcwa0002/*/http://house.gov/aderholt/<http://webarchive.loc.gov/lcwa0002/*/http:/house.gov/aderholt/>. Is this api still valid and was there a change to the query structure? Thanks, Gina Gina Jones Library of Congress Web Preservation Engineering Team. |