From: Mat K. <mk...@cs...> - 2011-10-05 13:20:43
|
Brad, I did not realize Wayback would consider uncompressed WARCs. That information will be useful. I was also considering the ARC format to get around my WARC issues but have only recently begun to explore that. Regarding your questions, I do not currently collect HTTP headers for my data. I have created a tool that essentially saves a certain type of webpage and all associated media to a local directory and retains information such as time of archiving and original URI as metadata. Are HTTP headers critical for the format? Could they be artificially created to comply with the standard? I do know Java and was also looking into the three projects that Erik (thanks!) suggested to extract some of the code for my use or at least get a basis for porting the code but the WARC format seems pretty coupled with the rest of each package. >From the truncating scheme I described in a past message, why should it not work if it is simply truncating off records? Should something else be adjusted in the resulting file to account for the difference in length and/or record count? Thanks, Mat ---------- Forwarded message ---------- From: Bradley Tofel <br...@ar...> Date: Tue, Oct 4, 2011 at 9:28 PM Subject: Re: [Archive-access-discuss] WARC Manipulation and manually creating WARCs: Need guidance To: "me...@ma..." <me...@ma...> Hi Mat, Another solution to side-step the compression complexities while you work on the WARC format issues, would be using uncompressed WARC files - just skip the compress step altogether (be sure to remove the ".gz" suffix) Wayback should handle those fine - note you do still need to create WARC records to encapsulate the archived content, but this may lower the bar to some iterative testing. A couple questions to help steer you in the right direction: 1) do you have HTTP response headers for your archived content? 2) do you know Java? Brad On 10/4/11 5:09 PM, Erik Hetzner wrote: At Tue, 4 Oct 2011 20:02:01 -0400, Mat Kelly wrote: Erik, Thank you for the reply. Please do send your script, as it might be helpful. From the procedure above, I was hoping to create a base case WARC and if I am not doing so properly, is there a bare bones template to create a WARC file? Once I am familiar enough with the procedure/structure, I plan to write a script to do the work but wanted first to understand how I go about constructing a WARC. Please supply any insight you can, as I am just learning about this system. Hi Mat, Attached. As far as I know there is no template to create a WARC file. You might want to have a look at the warc-tools project [1] or the it.unimi tools as well as the heritrix commons tools [3]. best, Erik 1. http://code.hanzoarchives.com/ 2. http://law.dsi.unimi.it/software/docs/it/unimi/dsi/law/warc/io/package-summary.html 3. http://builds.archive.org:8080/maven2/org/archive/heritrix/heritrix-commons/ Sent from my free software system <http://fsf.org/>. ------------------------------------------------------------------------------ All the data continuously generated in your IT infrastructure contains a definitive record of customers, application performance, security threats, fraudulent activity and more. Splunk takes this data and makes sense of it. Business sense. IT sense. Common sense. http://p.sf.net/sfu/splunk-d2dcopy1 _______________________________________________ Archive-access-discuss mailing list Arc...@li... https://lists.sourceforge.net/lists/listinfo/archive-access-discuss |