[Archive-access-discuss] Number of crawled documents in our archives

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hi,

I already sent message to heritrix group (
http://tech.groups.yahoo.com/group/archive-crawler/message/7653  ). But it
probably take more sense here.

We are looking for most effecient way (in terms of speed and precision) to
count
number of harvested documents in our archives. We dont have
crawl-report.logs
nor crawl.logs from whole history of our harvests. And we don't have hadoop
infrastructure to make fast map-reduce operations.

So contemporary approach is to work with arcreader utility: find /archives/
-name '*arc.gz*' -exec ./arcreader -d false '{}' \; > arcs.cdx then getting
rid
of patterns"CDX b e a m s c v n g" and "filedesc://" and finaly counting
lines
with wc -l resulting with number of documents in our archive.

Is there any more faster or more precise apporach? Occasionally arcreader
doesnt like entries in ARC files regarding corrupt Zips (invalid stored
block
lengths, corrupt GZIP traielers etc.) or by means of whole corrupted ARC
(*.arc.gz is not an Internet Archive ARC file). Such stastics are
depedening on
arcreader ability to read properly ARC files and there is chance that
statistic
will change with new version of arcreader.

Thank you very much,

rudolf