archive-access-discuss Mailing List for Web Archive Access Utilities (Page 8)

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hello,
I have fabricated a WARC file with the help I have thus far obtained
on this forum but am having difficulty getting Wayback to display the
data contained within the record. I am able to add WARCs from other
sources to my Wayback instance after adding my fabricated one and have
them displayed, but the content within mine never displays. I have
used the tools from Hanzo Archives (http://code.hanzoarchives.com/),
particularly the warc-valid.py to assure that my WARC file has no
trivial issues. warc-valid assures me that my WARC file is valid. How
do I get this WARC file to be displayed in my Wayback instance? I have
attached the WARC file.

Thank you,
Mat

On Wed, Oct 5, 2011 at 12:23 PM, Bradley Tofel <br...@ar...> wrote:
> HTTP headers are considered part of the response, and part of the archival
> record - if it's possible to save them within your system, I'd suggest
> grabbing them going forward, and also that you considering using Heritrix
> for your archiving. Once you have it running, you'll have standard formats
> available, tools that the rest of the community is using (a great resource
> for getting help), and a lot of features that will be cumbersome to
> replicate.
>
> You could fabricate the HTTP headers yourself for previously archived
> materials - Wayback will need them to replay content.
>
> As to the question about getting your new content indexed with Wayback,
> you'll need to either rename the file, so Wayback notices it as new content,
> or reset your indexing directory state:
>
> * stop Tomcat
> * delete all files under .../wayback/{index,index-data,file-db}
> * place new W/ARC files under .../wayback/files{1,2}
> * start Tomcat
>
> Hope this helps,
>
> Brad
>
> On 10/5/11 6:20 AM, Mat Kelly wrote:
>
> Brad,
> I did not realize Wayback would consider uncompressed WARCs. That
> information will be useful. I was also considering the ARC format to
> get around my WARC issues but have only recently begun to explore
> that.
>
> Regarding your questions, I do not currently collect HTTP headers for
> my data. I have created a tool that essentially saves a certain type
> of webpage and all associated media to a local directory and retains
> information such as time of archiving and original URI as metadata.
> Are HTTP headers critical for the format? Could they be artificially
> created to comply with the standard? I do know Java and was also
> looking into the three projects that Erik (thanks!) suggested to
> extract some of the code for my use or at least get a basis for
> porting the code but the WARC format seems pretty coupled with the
> rest of each package.
>
> >From the truncating scheme I described in a past message, why should
> it not work if it is simply truncating off records? Should something
> else be adjusted in the resulting file to account for the difference
> in length and/or record count?
>
> Thanks,
> Mat
>
>
> ---------- Forwarded message ----------
> From: Bradley Tofel <br...@ar...>
> Date: Tue, Oct 4, 2011 at 9:28 PM
> Subject: Re: [Archive-access-discuss] WARC Manipulation and manually
> creating WARCs: Need guidance
> To: "me...@ma..." <me...@ma...>
>
>
> Hi Mat,
>
> Another solution to side-step the compression complexities while you
> work on the WARC format issues, would be using uncompressed WARC files
> - just skip the compress step altogether (be sure to remove the ".gz"
> suffix)
>
> Wayback should handle those fine - note you do still need to create
> WARC records to encapsulate the archived content, but this may lower
> the bar to some iterative testing.
>
> A couple questions to help steer you in the right direction:
>
> 1) do you have HTTP response headers for your archived content?
> 2) do you know Java?
>
> Brad
>
> On 10/4/11 5:09 PM, Erik Hetzner wrote:
>
> At Tue, 4 Oct 2011 20:02:01 -0400,
> Mat Kelly wrote:
>
> Erik,
> Thank you for the reply. Please do send your script, as it might be
> helpful. From the procedure above, I was hoping to create a base case
> WARC and if I am not doing so properly, is there a bare bones template
> to create a WARC file? Once I am familiar enough with the
> procedure/structure, I plan to write a script to do the work but
> wanted first to understand how I go about constructing a WARC. Please
> supply any insight you can, as I am just learning about this system.
>
> Hi Mat,
>
> Attached.
>
> As far as I know there is no template to create a WARC file.
>
> You might want to have a look at the warc-tools project [1] or the
> it.unimi tools as well as the heritrix commons tools [3].
>
> best, Erik
>
> 1. http://code.hanzoarchives.com/
> 2.
> http://law.dsi.unimi.it/software/docs/it/unimi/dsi/law/warc/io/package-summary.html
> 3.
> http://builds.archive.org:8080/maven2/org/archive/heritrix/heritrix-commons/
>
>
>
> Sent from my free software system <http://fsf.org/>.
>
>
> ------------------------------------------------------------------------------
> All the data continuously generated in your IT infrastructure contains a
> definitive record of customers, application performance, security
> threats, fraudulent activity and more. Splunk takes this data and makes
> sense of it. Business sense. IT sense. Common sense.
> http://p.sf.net/sfu/splunk-d2dcopy1
>
> _______________________________________________
> Archive-access-discuss mailing list
> Arc...@li...
> https://lists.sourceforge.net/lists/listinfo/archive-access-discuss
>
>
> ------------------------------------------------------------------------------
> All the data continuously generated in your IT infrastructure contains a
> definitive record of customers, application performance, security
> threats, fraudulent activity and more. Splunk takes this data and makes
> sense of it. Business sense. IT sense. Common sense.
> http://p.sf.net/sfu/splunk-d2dcopy1
>
> _______________________________________________
> Archive-access-discuss mailing list
> Arc...@li...
> https://lists.sourceforge.net/lists/listinfo/archive-access-discuss
>

2005	Jan	Feb	Mar	Apr	May	Jun	Jul (1)	Aug (4)	Sep (5)	Oct (17)	Nov (30)	Dec (3)
2006	Jan (4)	Feb (14)	Mar (8)	Apr (11)	May (2)	Jun (13)	Jul (9)	Aug (2)	Sep (2)	Oct (9)	Nov (20)	Dec (9)
2007	Jan (6)	Feb (4)	Mar (6)	Apr (7)	May (6)	Jun (6)	Jul (4)	Aug (3)	Sep (9)	Oct (26)	Nov (23)	Dec (2)
2008	Jan (17)	Feb (19)	Mar (16)	Apr (27)	May (3)	Jun (21)	Jul (21)	Aug (8)	Sep (13)	Oct (7)	Nov (8)	Dec (8)
2009	Jan (18)	Feb (14)	Mar (27)	Apr (14)	May (10)	Jun (14)	Jul (18)	Aug (30)	Sep (18)	Oct (12)	Nov (5)	Dec (26)
2010	Jan (27)	Feb (3)	Mar (8)	Apr (4)	May (6)	Jun (13)	Jul (25)	Aug (11)	Sep (2)	Oct (4)	Nov (7)	Dec (6)
2011	Jan (25)	Feb (17)	Mar (25)	Apr (23)	May (15)	Jun (12)	Jul (8)	Aug (13)	Sep (4)	Oct (17)	Nov (7)	Dec (6)
2012	Jan (4)	Feb (7)	Mar (1)	Apr (10)	May (11)	Jun (5)	Jul (7)	Aug (1)	Sep (1)	Oct (5)	Nov (6)	Dec (13)
2013	Jan (9)	Feb (7)	Mar (3)	Apr (1)	May (3)	Jun (19)	Jul (3)	Aug (3)	Sep	Oct (1)	Nov (1)	Dec (1)
2014	Jan (11)	Feb (1)	Mar	Apr (2)	May (6)	Jun	Jul	Aug (1)	Sep	Oct (1)	Nov (1)	Dec (1)
2015	Jan	Feb	Mar	Apr	May	Jun (1)	Jul (4)	Aug	Sep	Oct	Nov	Dec (1)
2016	Jan (4)	Feb (3)	Mar	Apr	May	Jun	Jul (1)	Aug	Sep	Oct (1)	Nov	Dec
2018	Jan	Feb	Mar	Apr (1)	May (1)	Jun	Jul (2)	Aug	Sep (1)	Oct	Nov (1)	Dec
2019	Jan (2)	Feb (1)	Mar	Apr	May	Jun (2)	Jul	Aug	Sep (1)	Oct (1)	Nov	Dec

archive-access-discuss Mailing List for Web Archive Access Utilities (Page 8)

archive-access-discuss — General discussion about archive-access projects