As crawled resource content-bodies are retrieved, they
should be hashed (for example, SHA1), with the value
stored in the CrawlURI and displayed in the crawl logs.
This will allow after-the-fact duplicate analysis, and
potentially future during-the-crawl special duplicate
handling.
Also, when the ARC format is extended to capture
extensible per-resource metadata, or utilize efficient
storage of duplicates, the content-body hashes will be
important there.
Gordon Mohr
None
None
Public
|
Date: 2007-03-14 01:23
|
|
Date: 2004-07-22 21:30 Logged In: YES |
| Field | Old Value | Date | By |
|---|---|---|---|
| status_id | Open | 2004-07-22 21:30 | gojomo |
| close_date | - | 2004-07-22 21:30 | gojomo |
| priority | 5 | 2004-07-07 22:06 | gojomo |
| assigned_to | nobody | 2004-01-08 00:11 | gojomo |
Copyright © 2010 Geeknet, Inc. All rights reserved. Terms of Use