Need a script to go against ARCs and generate alexa
dat-like files. Would output detail like the following
per page:
http://www.mikeformayor.org/News/0827200187.shtml
0.0.0.0 20011004115320 alexa/dat 1644
m text/html
s 200
c b2ae6bfe7c7edda1ddf007d1025875e9
k 48f74f7a55a6c2eb090547768de65850
v 168
V 169
n 24580
t mike bloomberg [press room]
y
www.mikeformayor.org/Helper/JavaScript/SiteFunctions_v2.js
x www.mikeformayor.org/Helper/CSS/Style_v2.css
i www.mikeformayor.org/images/English_v2/header01.gif
i www.mikeformayor.org/images/English_v2/signup.gif
y www.mikeformayor.org/images/English_v2/go.gif
i www.mikeformayor.org/images/English_v2/navdiv.gif
i www.mikeformayor.org/images/English_v2/secmidspc.gif
i www.mikeformayor.org/images/English_v2/icn_print.gif
i www.mikeformayor.org/images/English_v2/icn_email.gif
i www.mikeformayor.org/images/English_v2/pixline.gif
l www.mikeformayor.org/Biography.shtml
l www.mikeformayor.org/Issues/Educationhighlights.shtml
l www.mikeformayor.org/Issues/Traffichighlights.shtml
l www.mikeformayor.org/Issues/Housinghighlights.shtml
l www.mikeformayor.org/Issues/Parkshighlights.shtml
l www.mikeformayor.org/Issues/seniorshighlights.shtml
i www.mikeformayor.org/images/English_v2/arrow.gif
l www.mikeformayor.org/Issues/publicsafety.shtml
l www.mikeformayor.org/News/security.shtml
l www.mikeformayor.org/News/contact.shtml
i www.mikeformayor.org/images/English_v2/spacer.gif
i www.mikeformayor.org/images/English_v2/navmidspc.gif
l www.mikeformayor.org/biography.shtml
l www.mikeformayor.org/issues.shtml
l www.mikeformayor.org/news.shtml
l www.mikeformayor.org/volunteer.shtml
l www.mikeformayor.org/gallery.asp
l www.mikeformayor.org/multimedia.asp
l www.mikeformayor.org/askmike.asp
l www.mikeformayor.org/pressroom.shtml
l www.mikeformayor.org/index.shtml
l www.mikeformayor.org/spanish/index.shtml
Would be grand if we could reuse the heritrix
extractors; would also be sweet if we could resuse our
settings to system to specify what and what order
extractors are to run.
On using the extractors, they take a CrawlURI with at
least a HttpRecorder prepopulated. As currently
stands, HttpRecorder is made by reading down a stream
into a file on disk. Would like to avoid this step and
give the extractor a stream that went against an ARC
directly. Need to make HttpRecorder into an interface
that a version of ARCReader can implement.
On reusing our settings system, its not possible
without extensive retrofitting; it has crawl-order
hardcodings expecting all wrapped in a crawl controller
with the crawl controller character stipulated in code
rather than out in the xsd file.
Michael Stack
scripts
None
Public
|
Date: 2007-03-14 01:35
|
|
Date: 2005-03-16 03:28 Logged In: YES |
| Field | Old Value | Date | By |
|---|---|---|---|
| status_id | Open | 2005-03-16 03:28 | stack-sf |
| close_date | - | 2005-03-16 03:28 | stack-sf |
| priority | 7 | 2005-03-02 20:07 | gojomo |
| priority | 5 | 2005-02-10 00:41 | stack-sf |
Copyright © 2010 Geeknet, Inc. All rights reserved. Terms of Use