Share

Heritrix: Internet Archive Web Crawler

Tracker: Feature Requests

7 annotate what robots.txt would have precluded - ID: 983051
Last Update: Comment added ( karl-ia )

When running in a mode that fully or partially ignores
robots.txt, it would be helpful in analyzing what's
being gained or what robots rules would be best to
follow (wihtout losing key content) if crawl.log
entries were annotated with an indicator of whether
robots.txt rules, if applied, would have precluded a fetch.

This could use the CrawlURI addAnnotation() facility.

This might be most easily configured on
PreconditionEnforcer as two separate robots policies:
one to honor, one to check and annotate without honoring.


Gordon Mohr ( gojomo ) - 2004-06-30 22:13

7

Closed

None

Karl Thiessen

None

1.6.0

Public


Comments ( 2 )

Date: 2007-03-14 01:31
Sender: karl-ia


This issue is now discussed in the new JIRA tracker at
http://webteam.archive.org/jira/browse/HER-793 -- please add further
comments at that location.


Date: 2005-11-03 02:27
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

Implemented. A new 'calculate-robots-only' expert setting on
PreconditionEnforcer, default 'false', will if 'true'
calculate the robots exclusion that would have applied, but
only annotate the CrawlURI with 'robotExcluded' for
exclusions, rather than cancelling fetching with a -9998
status code. Commit comment:

Implementation of [ 983051 ] annotate what robots.txt would
have precluded
* PreconditionEnforcer.java
add setting 'calculate-robots-only'; if true, when an
exclusion applies, only annotate URI rather than cancelling
fetch with -9998

Assigning to Karl for verification/close.


Attached File

No Files Currently Attached

Changes ( 7 )

Field Old Value Date By
status_id Open 2005-12-02 17:29 stack-sf
close_date - 2005-12-02 17:29 stack-sf
artifact_group_id None 2005-11-03 02:27 gojomo
assigned_to gojomo 2005-11-03 02:27 gojomo
priority 6 2005-11-01 00:59 gojomo
assigned_to nobody 2005-11-01 00:59 gojomo
priority 5 2004-09-01 21:51 gojomo