Share

Heritrix: Internet Archive Web Crawler

Tracker: Feature Requests

8 Hash content-bodies, show in logs (and future ARCs) - ID: 869584
Last Update: Comment added ( karl-ia )

As crawled resource content-bodies are retrieved, they
should be hashed (for example, SHA1), with the value
stored in the CrawlURI and displayed in the crawl logs.

This will allow after-the-fact duplicate analysis, and
potentially future during-the-crawl special duplicate
handling.

Also, when the ARC format is extended to capture
extensible per-resource metadata, or utilize efficient
storage of duplicates, the content-body hashes will be
important there.


Gordon Mohr ( gojomo ) - 2004-01-02 21:55

8

Closed

None

Gordon Mohr

None

None

Public


Comments ( 2 )

Date: 2007-03-14 01:23
Sender: karl-ia


This issue is now discussed in the new JIRA tracker at
http://webteam.archive.org/jira/browse/HER-679 -- please add further
comments at that location.


Date: 2004-07-22 21:30
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

Implementation of [ 869584 ] Hash content-bodies, show in
logs (and future ARCs):

* CrawlURI.java
Added byte[] contentDigest field & accessors.

* FetchHTTP.java
New attribute: 'sha1-content'; if true (the default),
HTTP content-bodies will be SHA1-hashed on the fly and the
raw byte value added to the CrawlURI's contentDigest field

* UriProcessingFormatter.java
Display content SHA1, when available, in Base32 (to
match urn:sha1: convention)

* RecordingOutputStream.java
When a MessageDigest has been set, start it at the
beginning of the HTTP content-body. (Completes earlier work
to add optional on-the-fly digesting.)

Important notes:

- may have noticeable performance impact, slowing crawl
- does not take into account any fancy
content-transfer-encodings (like 'chunked') or
mime-enveloping in the content-body, just treats all raw
data after the HTTP headers as one content-body.

Integration into ARCs will wait until ARC format has room.



Attached File

No Files Currently Attached

Changes ( 4 )

Field Old Value Date By
status_id Open 2004-07-22 21:30 gojomo
close_date - 2004-07-22 21:30 gojomo
priority 5 2004-07-07 22:06 gojomo
assigned_to nobody 2004-01-08 00:11 gojomo