Re: [Archive-access-discuss] WARC Manipulation and manually creating WARCs: Need guidance

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

HTTP headers are considered part of the response, and part of the 
archival record - if it's possible to save them within your system, I'd 
suggest grabbing them going forward, and also that you considering using 
Heritrix for your archiving. Once you have it running, you'll have 
standard formats available, tools that the rest of the community is 
using (a great resource for getting help), and a lot of features that 
will be cumbersome to replicate.

You could fabricate the HTTP headers yourself for previously archived 
materials - Wayback will need them to replay content.

As to the question about getting your new content indexed with Wayback, 
you'll need to either rename the file, so Wayback notices it as new 
content, or reset your indexing directory state:

* stop Tomcat
* delete all files under .../wayback/{index,index-data,file-db}
* place new W/ARC files under .../wayback/files{1,2}
* start Tomcat

Hope this helps,

Brad

On 10/5/11 6:20 AM, Mat Kelly wrote:
> Brad,
> I did not realize Wayback would consider uncompressed WARCs. That
> information will be useful. I was also considering the ARC format to
> get around my WARC issues but have only recently begun to explore
> that.
>
> Regarding your questions, I do not currently collect HTTP headers for
> my data. I have created a tool that essentially saves a certain type
> of webpage and all associated media to a local directory and retains
> information such as time of archiving and original URI as metadata.
> Are HTTP headers critical for the format? Could they be artificially
> created to comply with the standard? I do know Java and was also
> looking into the three projects that Erik (thanks!) suggested to
> extract some of the code for my use or at least get a basis for
> porting the code but the WARC format seems pretty coupled with the
> rest of each package.
>
> > From the truncating scheme I described in a past message, why should
> it not work if it is simply truncating off records? Should something
> else be adjusted in the resulting file to account for the difference
> in length and/or record count?
>
> Thanks,
> Mat
>
>
> ---------- Forwarded message ----------
> From: Bradley Tofel<br...@ar...>
> Date: Tue, Oct 4, 2011 at 9:28 PM
> Subject: Re: [Archive-access-discuss] WARC Manipulation and manually
> creating WARCs: Need guidance
> To: "me...@ma..."<me...@ma...>
>
>
> Hi Mat,
>
> Another solution to side-step the compression complexities while you
> work on the WARC format issues, would be using uncompressed WARC files
> - just skip the compress step altogether (be sure to remove the ".gz"
> suffix)
>
> Wayback should handle those fine - note you do still need to create
> WARC records to encapsulate the archived content, but this may lower
> the bar to some iterative testing.
>
> A couple questions to help steer you in the right direction:
>
> 1) do you have HTTP response headers for your archived content?
> 2) do you know Java?
>
> Brad
>
> On 10/4/11 5:09 PM, Erik Hetzner wrote:
>
> At Tue, 4 Oct 2011 20:02:01 -0400,
> Mat Kelly wrote:
>
> Erik,
> Thank you for the reply. Please do send your script, as it might be
> helpful. From the procedure above, I was hoping to create a base case
> WARC and if I am not doing so properly, is there a bare bones template
> to create a WARC file? Once I am familiar enough with the
> procedure/structure, I plan to write a script to do the work but
> wanted first to understand how I go about constructing a WARC. Please
> supply any insight you can, as I am just learning about this system.
>
> Hi Mat,
>
> Attached.
>
> As far as I know there is no template to create a WARC file.
>
> You might want to have a look at the warc-tools project [1] or the
> it.unimi tools as well as the heritrix commons tools [3].
>
> best, Erik
>
> 1. http://code.hanzoarchives.com/
> 2. http://law.dsi.unimi.it/software/docs/it/unimi/dsi/law/warc/io/package-summary.html
> 3. http://builds.archive.org:8080/maven2/org/archive/heritrix/heritrix-commons/
>
>
>
> Sent from my free software system<http://fsf.org/>.
>
>
> ------------------------------------------------------------------------------
> All the data continuously generated in your IT infrastructure contains a
> definitive record of customers, application performance, security
> threats, fraudulent activity and more. Splunk takes this data and makes
> sense of it. Business sense. IT sense. Common sense.
> http://p.sf.net/sfu/splunk-d2dcopy1
>
> _______________________________________________
> Archive-access-discuss mailing list
> Arc...@li...
> https://lists.sourceforge.net/lists/listinfo/archive-access-discuss
>
>
> ------------------------------------------------------------------------------
> All the data continuously generated in your IT infrastructure contains a
> definitive record of customers, application performance, security
> threats, fraudulent activity and more. Splunk takes this data and makes
> sense of it. Business sense. IT sense. Common sense.
> http://p.sf.net/sfu/splunk-d2dcopy1
>
>
> _______________________________________________
> Archive-access-discuss mailing list
> Arc...@li...
> https://lists.sourceforge.net/lists/listinfo/archive-access-discuss