Re: [Archive-access-discuss] WARC Manipulation and manually creating WARCs: Need guidance

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Brad,
I did not realize Wayback would consider uncompressed WARCs. That
information will be useful. I was also considering the ARC format to
get around my WARC issues but have only recently begun to explore
that.

Regarding your questions, I do not currently collect HTTP headers for
my data. I have created a tool that essentially saves a certain type
of webpage and all associated media to a local directory and retains
information such as time of archiving and original URI as metadata.
Are HTTP headers critical for the format? Could they be artificially
created to comply with the standard? I do know Java and was also
looking into the three projects that Erik (thanks!) suggested to
extract some of the code for my use or at least get a basis for
porting the code but the WARC format seems pretty coupled with the
rest of each package.

>From the truncating scheme I described in a past message, why should
it not work if it is simply truncating off records? Should something
else be adjusted in the resulting file to account for the difference
in length and/or record count?

Thanks,
Mat

---------- Forwarded message ----------
From: Bradley Tofel <br...@ar...>
Date: Tue, Oct 4, 2011 at 9:28 PM
Subject: Re: [Archive-access-discuss] WARC Manipulation and manually
creating WARCs: Need guidance
To: "me...@ma..." <me...@ma...>

Hi Mat,

Another solution to side-step the compression complexities while you
work on the WARC format issues, would be using uncompressed WARC files
- just skip the compress step altogether (be sure to remove the ".gz"
suffix)

Wayback should handle those fine - note you do still need to create
WARC records to encapsulate the archived content, but this may lower
the bar to some iterative testing.

A couple questions to help steer you in the right direction:

1) do you have HTTP response headers for your archived content?
2) do you know Java?

Brad

On 10/4/11 5:09 PM, Erik Hetzner wrote:

At Tue, 4 Oct 2011 20:02:01 -0400,
Mat Kelly wrote:

Erik,
Thank you for the reply. Please do send your script, as it might be
helpful. From the procedure above, I was hoping to create a base case
WARC and if I am not doing so properly, is there a bare bones template
to create a WARC file? Once I am familiar enough with the
procedure/structure, I plan to write a script to do the work but
wanted first to understand how I go about constructing a WARC. Please
supply any insight you can, as I am just learning about this system.

Hi Mat,

Attached.

As far as I know there is no template to create a WARC file.

You might want to have a look at the warc-tools project [1] or the
it.unimi tools as well as the heritrix commons tools [3].

best, Erik

1. http://code.hanzoarchives.com/
2. http://law.dsi.unimi.it/software/docs/it/unimi/dsi/law/warc/io/package-summary.html
3. http://builds.archive.org:8080/maven2/org/archive/heritrix/heritrix-commons/

Sent from my free software system <http://fsf.org/>.

------------------------------------------------------------------------------
All the data continuously generated in your IT infrastructure contains a
definitive record of customers, application performance, security
threats, fraudulent activity and more. Splunk takes this data and makes
sense of it. Business sense. IT sense. Common sense.
http://p.sf.net/sfu/splunk-d2dcopy1

_______________________________________________
Archive-access-discuss mailing list
Arc...@li...
https://lists.sourceforge.net/lists/listinfo/archive-access-discuss