From: Rudolf K. <wes...@gm...> - 2012-04-16 12:26:15
|
Hi, I already sent message to heritrix group ( http://tech.groups.yahoo.com/group/archive-crawler/message/7653 ). But it probably take more sense here. We are looking for most effecient way (in terms of speed and precision) to count number of harvested documents in our archives. We dont have crawl-report.logs nor crawl.logs from whole history of our harvests. And we don't have hadoop infrastructure to make fast map-reduce operations. So contemporary approach is to work with arcreader utility: find /archives/ -name '*arc.gz*' -exec ./arcreader -d false '{}' \; > arcs.cdx then getting rid of patterns"CDX b e a m s c v n g" and "filedesc://" and finaly counting lines with wc -l resulting with number of documents in our archive. Is there any more faster or more precise apporach? Occasionally arcreader doesnt like entries in ARC files regarding corrupt Zips (invalid stored block lengths, corrupt GZIP traielers etc.) or by means of whole corrupted ARC (*.arc.gz is not an Internet Archive ARC file). Such stastics are depedening on arcreader ability to read properly ARC files and there is chance that statistic will change with new version of arcreader. Thank you very much, rudolf |