Share

Heritrix: Internet Archive Web Crawler

Tracker: Bugs

9 make_reports.pl outdated, broken - ID: 1465369
Last Update: Comment added ( karl-ia )

Bjarne (DK) reported on list that make_reports was not
working, due to the addition of the 'source' column in
crawl.log.

Additionally, the version of make_reports in CVS does
not match the report column orderings made a release or
two ago.

The script should work with current crawl.logs and
closely match the usual reports.


Gordon Mohr ( gojomo ) - 2006-04-05 22:08

9

Closed

Fixed

Gordon Mohr

General

1.8.0

Public


Comments ( 4 )

Date: 2007-03-14 01:05
Sender: karl-ia


This issue is now discussed in the new JIRA tracker at
http://webteam.archive.org/jira/browse/HER-547 -- please add further
comments at that location.


Date: 2006-04-12 21:34
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

Except for the addition of the crawl.log 'source' field, all
changes were from the most-recently-used version Dan (and
possibly Igor) had been using locally. (It hasn't been used
in a while.)

Changing back to "!#/usr/bin/env perl" at suggestion it's
more portable.

Also, splitting long assignment lines to be shorter, and
make the right-hand-sides match old version.

Unsure of the reason for the split declarations, and added
'T','Z', but will retain as they are representative of the
last version of this script regularly used.

Any discrepancies with native Heritirx reports, I'm just
chalking up to limitations of this fallback utility -- only
if they prove problematic for some user will we address.

Committing changes to restore script to operation. Commit
comment:

[ 1465369 ] make_reports.pl outdated, broken
* make_reports.pl
adapt for new longer crawl.log lines; also integrate
internal changes improving (but not perfecting) congruence
with native Heritrix reports; changes reviewed for commit
through freeze by stack



Date: 2006-04-11 23:22
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

+ Why change to /usr/bin/perl? The former was more
portable. The latter is not.

Above should be changed before commit.

Otherwise, i tested it against a log
(homeserver:~stack/report/crawl.log) and compared to log
made by crawler? Are they supposed to be same? E.g:

For crawl-report.txt from crawler:

Crawl Status: Finished - Ended by operator
Duration Time: 8m4s236ms
Total Seeds Crawled: 2
Total Seeds not Crawled: 0
Total Hosts Crawled: 248
Total Documents Crawled: 51525
Processed docs/sec: 106.45
Bandwidth in Kbytes/sec: 180
Total Raw Data Size in Bytes: 89472839 (85 MB)

From script:

bash-2.05$ more crawl-report.txt.new
Duration Time: 0h8m4s
Total Hosts Crawled: 248
Total Documents Crawled: 51029
Total Raw Data Size in Bytes: 80839366 (0.08 GB)

Comparing mimetypes, for crawler:

-bash-3.00$ more mimetype-report.txt
[#urls] [#bytes] [mime-types]
51276 89456821 text/html
248 15826 text/dns
1 192 no-type

For script:

bash-2.05$ more mimetype-report.txt.new
[#urls] [#bytes] [mime-types]
51276 80839366 text/html
1 0 no-type



Below are some other comments:

+ The addition of the 'Z' and 'T' are intentional on this
line: +( $syear, $smonth, $sday, $shour, $smin, $ssec, $sms
) = ( $starttime =~ /(....)-(..)-(..)T(..):(..):(..).(...)Z/ );

Also looks like line was changed so > 80 characters in length.

Why make a separate line just above declaring the variables
with the 'my' keyword? Why not leave it as it was where the
variables were declared and assigned on same line? Does the
'defined' test work differently if you conflate declaration
and assignment (I wouldn't be surprised. Its perl).


Date: 2006-04-05 23:39
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

Patch for review attached. Patch is based on version of
make_reports at /home/webcrawl/tools/make_reports.pl --
which is most recent version Igor and Dan have used.

Patch integrates those changes, plus accomodates the extra
crawl.log line.



Attached File ( 1 )

Filename Description Download
1465369-make_reports.patch Download

Changes ( 6 )

Field Old Value Date By
status_id Open 2006-04-12 21:34 gojomo
resolution_id None 2006-04-12 21:34 gojomo
close_date - 2006-04-12 21:34 gojomo
assigned_to stack-sf 2006-04-11 23:22 stack-sf
File Added 173566: 1465369-make_reports.patch 2006-04-05 23:39 gojomo
assigned_to gojomo 2006-04-05 23:39 gojomo