Share

Heritrix: Internet Archive Web Crawler

Tracker: Bugs

6 uri-errors.log: old timestamps, too many errors - ID: 1204643
Last Update: Comment added ( karl-ia )

Looking at a 1.4 uri-errors.log:

- it still uses old timestamps (non-ISO8601)
- "unsupported scheme:mailto", "clsid", and perhaps
others don't need to be logged here; they're expected
and ignored by design


Gordon Mohr ( gojomo ) - 2005-05-19 00:20

6

Closed

Fixed

Gordon Mohr

None

1.6.0

Public


Comments ( 7 )

Date: 2007-03-14 00:52
Sender: karl-ia


This issue is now discussed in the new JIRA tracker at
http://webteam.archive.org/jira/browse/HER-411 -- please add further
comments at that location.


Date: 2005-09-13 23:49
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

Datestamp fixed. Commit comment:

Fix for [ 1204643 ] uri-errors.log: old timestamps, too many
errors
* UriErrorFormatter.java
use ArchiveUtils.getLog17Date(), as with crawl.log

Closing.



Date: 2005-08-04 20:41
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

Still need to fix timestamp format here.


Date: 2005-06-12 19:41
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

Added new UURIFactory property, 'ignored-schemes'.
URIExceptions caused by such schemes have a special
cause-code, and are thus ignored by the uri-error logging
mechanism. Initial ignored-schemes are:

org.archive.crawler.datamodel.UURIFactory.ignored-schemes =
mailto, clsid, res, ftp, file, rtsp

Commit comment:

Partial fix for [ 1204643 ] uri-errors.log: old timestamps,
too many errors
* heritrix.properties
add new property further configuring UURIFactory with a
known set of ignored-schemes -- the URIException thrown by
these is marked different than unsupported and unknown schemes
* UURIFactory.java
throw slightly different URIException for
intentionally-ignored schemes
* CrawlController.java
have uri-errors ignore exceptions for intentionally
ignored schemes



Date: 2005-05-19 16:59
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

We may want to check for a supported scheme before even
trying to create a UURI.


Date: 2005-05-19 16:07
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

The 'unsupported scheme' is awkward to address. Currently
the UURIFactory.getInstance either produces a UURI or else
it throws an exception . Returning null would mean all
places its used, we'd have to add a null check or check for
an unsupportedscheme exception (Check for
unsupportedschemeexception seems like the way to go).

Upping priority.


Date: 2005-05-19 15:49
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Good point on the 'mailto', etc. A note from the list this
morning actually finds these loggings disturbing:

Unsupported scheme means that the crawler can't handle the
relevant protocol (currently only DNS, HTTP and HTTPS are
supported).

Heritrix's link extraction is a bit overly aggressive. The
errors you are seeing means that links, such as mailto: have
been extracted. In fact I'm belive that the errors relating
to 'error', 'res', and 'clsid' are completely bogus link
extraction. Any link that doesn't start with dns: http: or
https: will cause an error like that one.

Generally, you can ignore these errors. In fact, you can
(usually) ignore all the errors in the local-errors.log and
the uri-errors.log unless you are looking for indications
why a specific URI was not crawled etc. The errors in them
do not indicate any major problems with the ongoing crawl,
only that some issues were encountered (that the crawler was
able to handle).

- Kris

-----Original Message-----
From: archive-crawler@yahoogroups.com
[mailto:archive-crawler@yahoogroups.com] On Behalf Of cash_05
Sent: 19. maí 2005 08:40
To: archive-crawler@yahoogroups.com
Subject: [archive-crawler] Help! What is "Unsupported
scheme:"means?

Dear all,

I noticed inside all of the uri-errors.log, there are many
"Unsupported scheme" error. For example,
"Unsupported scheme: mailto"
"Unsupported scheme: res"
"Unsupported scheme: clsid"

May i know what is this error message really means? I
try to search in
this group but found nothing related to it.

Appreciate if someone can explain to me that what/why
caused this
error. How configure to avoid/fix this error.

Thank you.


Attached File

No Files Currently Attached

Changes ( 7 )

Field Old Value Date By
artifact_group_id None 2005-09-23 18:01 gojomo
status_id Open 2005-09-13 23:49 gojomo
resolution_id None 2005-09-13 23:49 gojomo
close_date - 2005-09-13 23:49 gojomo
assigned_to nobody 2005-08-04 20:42 gojomo
priority 7 2005-08-04 20:41 gojomo
priority 6 2005-05-19 16:07 stack-sf