Via the list, Bjarne Andersen reports:
"a seed like: www.pølse.dk is rejected"
My investigation so far reveals:
This hits a URIException thrown by Heritrix's
UURIFactory in checkDomainLabel(), where we enforce legal
classic domain names (as if IDN punycoding has already
been applied), even though there's been no step to
apply such encoding, if necessary, yet.
If this check is disabled, the URI will be scheduled... but
then it appears that URI-escaping, rather than
punycoding, is
applied to the domain-name -- even though there is some
indication that the HttpClient URI class knows about
punycoding.
In the case of a narrow crawl, I then saw the URL ruled
out --
upon scope rechecking, apparently it didn't match the
domain
allowed as part of scope initialization. (The surt
prefix deduced still had the original unicode
character; the URI as tested had URI-escaping: no match.)
In the case of a broad crawl, the prerequisite DNS
lookup was hanging. The DNSJava library we're using
claims current IDN support, but it's unclear when it
was added. (We're currently bundling version 1.6.2 from
March 2004; it's up to 2.0 as of a few weeks ago.)
Some potential next steps:
- evaluate if UURIFactory.checkDomainLabel() is
required at all -- or if we can just rely on URI's checks
- check if our version of DNSJava should be upgraded to
get IDN support
- check if HttpClient can usually fetch such URLs with
IDN domains. If it's just our usage that's failing,
fix; if it's an HttpClient limitation, report as bug
and patch around.
Karl Thiessen
Protocols
1.6.0
Public
|
Date: 2007-03-14 00:55
|
|
Date: 2005-08-09 01:02 Logged In: YES |
|
Date: 2005-07-14 19:35 Logged In: YES |
|
Date: 2005-07-01 10:39 Logged In: YES |
|
Date: 2005-07-01 10:38 Logged In: NO |
Copyright © 2010 Geeknet, Inc. All rights reserved. Terms of Use