Re: [Archive-access-discuss] WARC Manipulation and manually creating WARCs: Need guidance

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hi Mat,
I don't think the warcvalid.py actually does a very thorough check of what is in your WARC at validation, so you shouldn't rely on this. 

Looking at your WARC, I think one problem may be the Content-Length you are giving in your WARC record headers. Are you calculating that manually? According to the WARC specification, "A WARC record shall consist of a record header followed by a record content block and two new lines." The Content-Length should give the size of the entire block following the WARC headers. Looking at your response record that gives "Content-Length: 39" I see that this size is only calculating for the payload of the record, that is "<html><body>Hello World!</body></html>". It should also include the HTTP headers that you added, as they are part of the block. If you create your record using the Hanzo tools, if I remember correctly, it will calculate and add the Content-Length field automatically if you don't supply it. Whether it does or not though, you should look at that Content-Length field and what you are putting there.

Lauren Ko
Web Archiving Programmer
UNT Libraries
________________________________________
From: Mat Kelly [mk...@cs...]
Sent: Wednesday, October 19, 2011 2:33 PM
To: arc...@li...
Subject: Re: [Archive-access-discuss] WARC Manipulation and manually creating WARCs: Need guidance

Hello,
I have fabricated a WARC file with the help I have thus far obtained
on this forum but am having difficulty getting Wayback to display the
data contained within the record. I am able to add WARCs from other
sources to my Wayback instance after adding my fabricated one and have
them displayed, but the content within mine never displays. I have
used the tools from Hanzo Archives (http://code.hanzoarchives.com/),
particularly the warc-valid.py to assure that my WARC file has no
trivial issues. warc-valid assures me that my WARC file is valid. How
do I get this WARC file to be displayed in my Wayback instance? I have
attached the WARC file.

Thank you,
Mat

On Wed, Oct 5, 2011 at 12:23 PM, Bradley Tofel <br...@ar...> wrote:
> HTTP headers are considered part of the response, and part of the archival
> record - if it's possible to save them within your system, I'd suggest
> grabbing them going forward, and also that you considering using Heritrix
> for your archiving. Once you have it running, you'll have standard formats
> available, tools that the rest of the community is using (a great resource
> for getting help), and a lot of features that will be cumbersome to
> replicate.
>
> You could fabricate the HTTP headers yourself for previously archived
> materials - Wayback will need them to replay content.
>
> As to the question about getting your new content indexed with Wayback,
> you'll need to either rename the file, so Wayback notices it as new content,
> or reset your indexing directory state:
>
> * stop Tomcat
> * delete all files under .../wayback/{index,index-data,file-db}
> * place new W/ARC files under .../wayback/files{1,2}
> * start Tomcat
>
> Hope this helps,
>
> Brad
>
> On 10/5/11 6:20 AM, Mat Kelly wrote:
>
> Brad,
> I did not realize Wayback would consider uncompressed WARCs. That
> information will be useful. I was also considering the ARC format to
> get around my WARC issues but have only recently begun to explore
> that.
>
> Regarding your questions, I do not currently collect HTTP headers for
> my data. I have created a tool that essentially saves a certain type
> of webpage and all associated media to a local directory and retains
> information such as time of archiving and original URI as metadata.
> Are HTTP headers critical for the format? Could they be artificially
> created to comply with the standard? I do know Java and was also
> looking into the three projects that Erik (thanks!) suggested to
> extract some of the code for my use or at least get a basis for
> porting the code but the WARC format seems pretty coupled with the
> rest of each package.
>
> >From the truncating scheme I described in a past message, why should
> it not work if it is simply truncating off records? Should something
> else be adjusted in the resulting file to account for the difference
> in length and/or record count?
>
> Thanks,
> Mat
>
>
> ---------- Forwarded message ----------
> From: Bradley Tofel <br...@ar...>
> Date: Tue, Oct 4, 2011 at 9:28 PM
> Subject: Re: [Archive-access-discuss] WARC Manipulation and manually
> creating WARCs: Need guidance
> To: "me...@ma..." <me...@ma...>
>
>
> Hi Mat,
>
> Another solution to side-step the compression complexities while you
> work on the WARC format issues, would be using uncompressed WARC files
> - just skip the compress step altogether (be sure to remove the ".gz"
> suffix)
>
> Wayback should handle those fine - note you do still need to create
> WARC records to encapsulate the archived content, but this may lower
> the bar to some iterative testing.
>
> A couple questions to help steer you in the right direction:
>
> 1) do you have HTTP response headers for your archived content?
> 2) do you know Java?
>
> Brad
>
> On 10/4/11 5:09 PM, Erik Hetzner wrote:
>
> At Tue, 4 Oct 2011 20:02:01 -0400,
> Mat Kelly wrote:
>
> Erik,
> Thank you for the reply. Please do send your script, as it might be
> helpful. From the procedure above, I was hoping to create a base case
> WARC and if I am not doing so properly, is there a bare bones template
> to create a WARC file? Once I am familiar enough with the
> procedure/structure, I plan to write a script to do the work but
> wanted first to understand how I go about constructing a WARC. Please
> supply any insight you can, as I am just learning about this system.
>
> Hi Mat,
>
> Attached.
>
> As far as I know there is no template to create a WARC file.
>
> You might want to have a look at the warc-tools project [1] or the
> it.unimi tools as well as the heritrix commons tools [3].
>
> best, Erik
>
> 1. http://code.hanzoarchives.com/
> 2.
> http://law.dsi.unimi.it/software/docs/it/unimi/dsi/law/warc/io/package-summary.html
> 3.
> http://builds.archive.org:8080/maven2/org/archive/heritrix/heritrix-commons/
>
>
>
> Sent from my free software system <http://fsf.org/>.
>
>
> ------------------------------------------------------------------------------
> All the data continuously generated in your IT infrastructure contains a
> definitive record of customers, application performance, security
> threats, fraudulent activity and more. Splunk takes this data and makes
> sense of it. Business sense. IT sense. Common sense.
> http://p.sf.net/sfu/splunk-d2dcopy1
>
> _______________________________________________
> Archive-access-discuss mailing list
> Arc...@li...
> https://lists.sourceforge.net/lists/listinfo/archive-access-discuss
>
>
> ------------------------------------------------------------------------------
> All the data continuously generated in your IT infrastructure contains a
> definitive record of customers, application performance, security
> threats, fraudulent activity and more. Splunk takes this data and makes
> sense of it. Business sense. IT sense. Common sense.
> http://p.sf.net/sfu/splunk-d2dcopy1
>
> _______________________________________________
> Archive-access-discuss mailing list
> Arc...@li...
> https://lists.sourceforge.net/lists/listinfo/archive-access-discuss
>