Share

Heritrix: Internet Archive Web Crawler

Tracker: Feature Requests

8 New \\\"seed source report\\\" of # of URLs per host per source - ID: 1445970
Last Update: Comment added ( karl-ia )

Need the "seed source report" of number of urls per
host per source, hosts sorted within each source by
descending number of URLs. Request comes from
archive-it.org team.


Michael Stack ( stack-sf ) - 2006-03-08 22:08

8

Closed

None

Karl Thiessen

logging

1.8.0

Public


Comments ( 3 )

Date: 2007-03-14 01:46
Sender: karl-ia


This issue is now discussed in the new JIRA tracker at
http://webteam.archive.org/jira/browse/HER-999 -- please add further
comments at that location.


Date: 2006-04-12 22:53
Sender: karl-ia

Logged In: YES
user_id=1269624

Verified in Archive-It testing. Closing.


Date: 2006-03-08 22:24
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Implemented. Commit msg. below. Assigning Karl w/ upped
priority so gets into pending release:

TO TEST:

Run crawl with 'source-tag-seeds' == FALSE. Ensure all works
as before and that no 'source' report when crawl is done.

Then run new crawl with 'source-tag-seeds' set to TRUE. Let
it run a while. Check new report is present at crawl
termination. Check out the content to see it makes some sense.


Implement "[ 1445970 ] New "seed source report" of # of URLs
per host per source"
Patch from Dan Avery (minor changes by St.Ack. Removed custom
serialization/deserialization of Hashtable. Let bdbje do
this for us).
If 'source-tag-seeds' is true, keep a (costly looking) count
per host per seed and
at crawl end, emit a new sources-report.

* src/java/org/archive/crawler/admin/StatisticsTracker.java
(sourceHostDistribution): Added new BigMap.
(saveSourceStats, writeSourceReportTo): Added.


Attached File

No Files Currently Attached

Changes ( 7 )

Field Old Value Date By
summary New \"seed source report\" of # of URLs per host per source 2006-09-11 22:32 karl-ia
summary New "seed source report" of # of URLs per host per source 2006-09-11 22:31 karl-ia
close_date - 2006-04-12 22:53 karl-ia
status_id Open 2006-04-12 22:53 karl-ia
artifact_group_id None 2006-03-17 21:05 gojomo
priority 5 2006-03-08 22:24 stack-sf
assigned_to stack-sf 2006-03-08 22:24 stack-sf