From: Bjarne A. <bj...@st...> - 2012-05-24 15:10:15
|
Thanks Roger. I got a java program from IA - but it by default required All your content to be stored on HDFS and then using hadoop to extract content. I Don't have that setup so I gave the hanzo warc-tools a shot I tried their python code last Fall with little luck but they have actually been working on the project and it worked out of the box this time They have (among several tools) - arc2warc.py to convert to WARC - warcfilter.py to filter a WARC file by e.g. URL (regexp) So using those two it is quite easy to extract material from one or more domains. A tricky situation is still embedded content from other domains that you want to include. The IA/hadoop approach supported that by analysing crawl-logs to find URIs of embedded things found at crawltime But for this specific case the warc-tools was actually quite helpful Best Bjarne Sendt fra min iPhone Den 24/05/2012 kl. 16.53 skrev "Coram, Roger" <Rog...@bl...>: > Hi Bjarne, > > Only just saw your message. I'm not sure if you've had better responses > so far but here's a bash script I've used in the past: > > https://gist.github.com/2781979 > > It should work via, for example: arc2warc -a INPUT_ARC.arc.gz -w > OUTPUT_WARC.warc.gz -r "http://www\\.bl\\.uk" > > It does have one dependency, a Python script for stripping HTTP headers > (in order to calculate the digest of the payload): > > https://gist.github.com/2781967 > > However, you can probably remove that and include a WARC-Block-Digest or > remove it altogether. > > Roger G. Coram > Web Archiving Engineer > The British Library > E: rog...@bl... > > > -----Original Message----- > From: Bjarne Andersen [mailto:bj...@st...] > Sent: 11 May 2012 22:04 > To: arc...@li... > Subject: [Archive-access-discuss] Extracting records from ARC files into > new(W)ARC files > > Hi. > A website owner is asking for an extract of material from a specific > domain > > Anybody aware of a tool that given either complete URLs or a URL regexp > Would run though an ARC file and write All records into a new (W)ARC > file? > > Best > Bjarne Andersen > > > > Sendt fra min iPhone > ------------------------------------------------------------------------ > ------ > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and threat > landscape has changed and how IT managers can respond. Discussions will > include endpoint security, mobile security and the latest in malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > _______________________________________________ > Archive-access-discuss mailing list > Arc...@li... > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss |