From: Ko, L. <Lau...@un...> - 2011-10-19 22:51:36
|
Hi Mat, I don't think the warcvalid.py actually does a very thorough check of what is in your WARC at validation, so you shouldn't rely on this. Looking at your WARC, I think one problem may be the Content-Length you are giving in your WARC record headers. Are you calculating that manually? According to the WARC specification, "A WARC record shall consist of a record header followed by a record content block and two new lines." The Content-Length should give the size of the entire block following the WARC headers. Looking at your response record that gives "Content-Length: 39" I see that this size is only calculating for the payload of the record, that is "<html><body>Hello World!</body></html>". It should also include the HTTP headers that you added, as they are part of the block. If you create your record using the Hanzo tools, if I remember correctly, it will calculate and add the Content-Length field automatically if you don't supply it. Whether it does or not though, you should look at that Content-Length field and what you are putting there. Lauren Ko Web Archiving Programmer UNT Libraries ________________________________________ From: Mat Kelly [mk...@cs...] Sent: Wednesday, October 19, 2011 2:33 PM To: arc...@li... Subject: Re: [Archive-access-discuss] WARC Manipulation and manually creating WARCs: Need guidance Hello, I have fabricated a WARC file with the help I have thus far obtained on this forum but am having difficulty getting Wayback to display the data contained within the record. I am able to add WARCs from other sources to my Wayback instance after adding my fabricated one and have them displayed, but the content within mine never displays. I have used the tools from Hanzo Archives (http://code.hanzoarchives.com/), particularly the warc-valid.py to assure that my WARC file has no trivial issues. warc-valid assures me that my WARC file is valid. How do I get this WARC file to be displayed in my Wayback instance? I have attached the WARC file. Thank you, Mat On Wed, Oct 5, 2011 at 12:23 PM, Bradley Tofel <br...@ar...> wrote: > HTTP headers are considered part of the response, and part of the archival > record - if it's possible to save them within your system, I'd suggest > grabbing them going forward, and also that you considering using Heritrix > for your archiving. Once you have it running, you'll have standard formats > available, tools that the rest of the community is using (a great resource > for getting help), and a lot of features that will be cumbersome to > replicate. > > You could fabricate the HTTP headers yourself for previously archived > materials - Wayback will need them to replay content. > > As to the question about getting your new content indexed with Wayback, > you'll need to either rename the file, so Wayback notices it as new content, > or reset your indexing directory state: > > * stop Tomcat > * delete all files under .../wayback/{index,index-data,file-db} > * place new W/ARC files under .../wayback/files{1,2} > * start Tomcat > > Hope this helps, > > Brad > > On 10/5/11 6:20 AM, Mat Kelly wrote: > > Brad, > I did not realize Wayback would consider uncompressed WARCs. That > information will be useful. I was also considering the ARC format to > get around my WARC issues but have only recently begun to explore > that. > > Regarding your questions, I do not currently collect HTTP headers for > my data. I have created a tool that essentially saves a certain type > of webpage and all associated media to a local directory and retains > information such as time of archiving and original URI as metadata. > Are HTTP headers critical for the format? Could they be artificially > created to comply with the standard? I do know Java and was also > looking into the three projects that Erik (thanks!) suggested to > extract some of the code for my use or at least get a basis for > porting the code but the WARC format seems pretty coupled with the > rest of each package. > > >From the truncating scheme I described in a past message, why should > it not work if it is simply truncating off records? Should something > else be adjusted in the resulting file to account for the difference > in length and/or record count? > > Thanks, > Mat > > > ---------- Forwarded message ---------- > From: Bradley Tofel <br...@ar...> > Date: Tue, Oct 4, 2011 at 9:28 PM > Subject: Re: [Archive-access-discuss] WARC Manipulation and manually > creating WARCs: Need guidance > To: "me...@ma..." <me...@ma...> > > > Hi Mat, > > Another solution to side-step the compression complexities while you > work on the WARC format issues, would be using uncompressed WARC files > - just skip the compress step altogether (be sure to remove the ".gz" > suffix) > > Wayback should handle those fine - note you do still need to create > WARC records to encapsulate the archived content, but this may lower > the bar to some iterative testing. > > A couple questions to help steer you in the right direction: > > 1) do you have HTTP response headers for your archived content? > 2) do you know Java? > > Brad > > On 10/4/11 5:09 PM, Erik Hetzner wrote: > > At Tue, 4 Oct 2011 20:02:01 -0400, > Mat Kelly wrote: > > Erik, > Thank you for the reply. Please do send your script, as it might be > helpful. From the procedure above, I was hoping to create a base case > WARC and if I am not doing so properly, is there a bare bones template > to create a WARC file? Once I am familiar enough with the > procedure/structure, I plan to write a script to do the work but > wanted first to understand how I go about constructing a WARC. Please > supply any insight you can, as I am just learning about this system. > > Hi Mat, > > Attached. > > As far as I know there is no template to create a WARC file. > > You might want to have a look at the warc-tools project [1] or the > it.unimi tools as well as the heritrix commons tools [3]. > > best, Erik > > 1. http://code.hanzoarchives.com/ > 2. > http://law.dsi.unimi.it/software/docs/it/unimi/dsi/law/warc/io/package-summary.html > 3. > http://builds.archive.org:8080/maven2/org/archive/heritrix/heritrix-commons/ > > > > Sent from my free software system <http://fsf.org/>. > > > ------------------------------------------------------------------------------ > All the data continuously generated in your IT infrastructure contains a > definitive record of customers, application performance, security > threats, fraudulent activity and more. Splunk takes this data and makes > sense of it. Business sense. IT sense. Common sense. > http://p.sf.net/sfu/splunk-d2dcopy1 > > _______________________________________________ > Archive-access-discuss mailing list > Arc...@li... > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss > > > ------------------------------------------------------------------------------ > All the data continuously generated in your IT infrastructure contains a > definitive record of customers, application performance, security > threats, fraudulent activity and more. Splunk takes this data and makes > sense of it. Business sense. IT sense. Common sense. > http://p.sf.net/sfu/splunk-d2dcopy1 > > _______________________________________________ > Archive-access-discuss mailing list > Arc...@li... > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss > |