Share

Heritrix: Internet Archive Web Crawler

Tracker: Feature Requests

8 A 'dat' maker; A script to dump links - ID: 1058302
Last Update: Comment added ( karl-ia )

Need a script to go against ARCs and generate alexa
dat-like files. Would output detail like the following
per page:


http://www.mikeformayor.org/News/0827200187.shtml
0.0.0.0 20011004115320 alexa/dat 1644
m text/html
s 200
c b2ae6bfe7c7edda1ddf007d1025875e9
k 48f74f7a55a6c2eb090547768de65850
v 168
V 169
n 24580
t mike bloomberg [press room]
y
www.mikeformayor.org/Helper/JavaScript/SiteFunctions_v2.js
x www.mikeformayor.org/Helper/CSS/Style_v2.css
i www.mikeformayor.org/images/English_v2/header01.gif
i www.mikeformayor.org/images/English_v2/signup.gif
y www.mikeformayor.org/images/English_v2/go.gif
i www.mikeformayor.org/images/English_v2/navdiv.gif
i www.mikeformayor.org/images/English_v2/secmidspc.gif
i www.mikeformayor.org/images/English_v2/icn_print.gif
i www.mikeformayor.org/images/English_v2/icn_email.gif
i www.mikeformayor.org/images/English_v2/pixline.gif
l www.mikeformayor.org/Biography.shtml
l www.mikeformayor.org/Issues/Educationhighlights.shtml
l www.mikeformayor.org/Issues/Traffichighlights.shtml
l www.mikeformayor.org/Issues/Housinghighlights.shtml
l www.mikeformayor.org/Issues/Parkshighlights.shtml
l www.mikeformayor.org/Issues/seniorshighlights.shtml
i www.mikeformayor.org/images/English_v2/arrow.gif
l www.mikeformayor.org/Issues/publicsafety.shtml
l www.mikeformayor.org/News/security.shtml
l www.mikeformayor.org/News/contact.shtml
i www.mikeformayor.org/images/English_v2/spacer.gif
i www.mikeformayor.org/images/English_v2/navmidspc.gif
l www.mikeformayor.org/biography.shtml
l www.mikeformayor.org/issues.shtml
l www.mikeformayor.org/news.shtml
l www.mikeformayor.org/volunteer.shtml
l www.mikeformayor.org/gallery.asp
l www.mikeformayor.org/multimedia.asp
l www.mikeformayor.org/askmike.asp
l www.mikeformayor.org/pressroom.shtml
l www.mikeformayor.org/index.shtml
l www.mikeformayor.org/spanish/index.shtml

Would be grand if we could reuse the heritrix
extractors; would also be sweet if we could resuse our
settings to system to specify what and what order
extractors are to run.

On using the extractors, they take a CrawlURI with at
least a HttpRecorder prepopulated. As currently
stands, HttpRecorder is made by reading down a stream
into a file on disk. Would like to avoid this step and
give the extractor a stream that went against an ARC
directly. Need to make HttpRecorder into an interface
that a version of ARCReader can implement.

On reusing our settings system, its not possible
without extensive retrofitting; it has crawl-order
hardcodings expecting all wrapped in a crawl controller
with the crawl controller character stipulated in code
rather than out in the xsd file.


Michael Stack ( stack-sf ) - 2004-11-01 17:59

8

Closed

None

Michael Stack

scripts

None

Public


Comments ( 2 )

Date: 2007-03-14 01:35
Sender: karl-ia


This issue is now discussed in the new JIRA tracker at
http://webteam.archive.org/jira/browse/HER-848 -- please add further
comments at that location.


Date: 2005-03-16 03:28
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Added an ExtractorTool with a wrapper script 'extractor'.
Allows running any set of listed processors in order in
which they are listed. Extraction runs very slowly as each
ARC Record is written to a temporary file before being being
given to the extractor chain. This will do for a first cut.
Would need to implement random seek over compressed stream
with support for multibyte CharSequences before we can do
away with the copy of the ARCRecord.


Closing.

Here is commit.

Implementation of '[ 1058302 ] A 'dat' maker; A script to
dump links'.
* src/java/org/archive/crawler/extractor/ExtractorHTML.java
Formatting and allow for controller being null which is
case when running
this extractor out of ExtractorTool. Removed useless
javadoc.
* src/java/org/archive/crawler/extractor/ExtractorHTTP.java
Formatting. Removed useless javadoc.
* src/java/org/archive/io/RecordingOutputStream.java
Formatting.
* src/java/org/archive/io/ReplayCharSequenceFactory.java
Formatting. Changed level of recentering of buffer
logging and put it
inside of a test for whether its loggable or not.
* src/java/org/archive/io/arc/ARCConstants.java
Added some ARCConstants.
* src/java/org/archive/util/FileUtils.java
Formatting.
* src/java/org/archive/util/HttpRecorder.java
Formatting. Removed the finalize method. It was being
run at random
times removing backing file on a processor.
* src/java/org/archive/util/TmpDirTestCase.java
Formatting.



Attached File

No Files Currently Attached

Changes ( 4 )

Field Old Value Date By
status_id Open 2005-03-16 03:28 stack-sf
close_date - 2005-03-16 03:28 stack-sf
priority 7 2005-03-02 20:07 gojomo
priority 5 2005-02-10 00:41 stack-sf