Share

Heritrix: Internet Archive Web Crawler

Tracker: Bugs

8 unicode/idn domain names fail (seeds and more?)- punycode - ID: 1222229
Last Update: Comment added ( karl-ia )

Via the list, Bjarne Andersen reports:

"a seed like: www.pølse.dk is rejected"

My investigation so far reveals:

This hits a URIException thrown by Heritrix's
UURIFactory in checkDomainLabel(), where we enforce legal
classic domain names (as if IDN punycoding has already
been applied), even though there's been no step to
apply such encoding, if necessary, yet.

If this check is disabled, the URI will be scheduled... but
then it appears that URI-escaping, rather than
punycoding, is
applied to the domain-name -- even though there is some
indication that the HttpClient URI class knows about
punycoding.

In the case of a narrow crawl, I then saw the URL ruled
out --
upon scope rechecking, apparently it didn't match the
domain
allowed as part of scope initialization. (The surt
prefix deduced still had the original unicode
character; the URI as tested had URI-escaping: no match.)

In the case of a broad crawl, the prerequisite DNS
lookup was hanging. The DNSJava library we're using
claims current IDN support, but it's unclear when it
was added. (We're currently bundling version 1.6.2 from
March 2004; it's up to 2.0 as of a few weeks ago.)

Some potential next steps:
- evaluate if UURIFactory.checkDomainLabel() is
required at all -- or if we can just rely on URI's checks
- check if our version of DNSJava should be upgraded to
get IDN support
- check if HttpClient can usually fetch such URLs with
IDN domains. If it's just our usage that's failing,
fix; if it's an HttpClient limitation, report as bug
and patch around.




Gordon Mohr ( gojomo ) - 2005-06-16 22:03

8

Closed

None

Karl Thiessen

Protocols

1.6.0

Public


Comments ( 5 )

Date: 2007-03-14 00:55
Sender: karl-ia


This issue is now discussed in the new JIRA tracker at
http://webteam.archive.org/jira/browse/HER-444 -- please add further
comments at that location.


Date: 2005-08-09 01:02
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

LGPL libidn has been integrated to the UURIFactory 'fixup'
process, to change any Unicode domain names into their
IDN/punycoded form. A unit test has been added to
UURIFactoryTest. The domain Bjarne lists, www.pølse.dk, can
now be crawled successfully.

Commit comment:

Fix for [ 1213095 ] UURI handling of inconsistent escaping
makes broken instance
Fix for [ 1222229 ] unicode/idn domain names fail (seeds and
more?)- punycode
Fix for [ 1242747 ] over-escaping (of '%', etc) compared to
browsers
Fix for [ 1212377 ] URIException in deserialization, post
CrawlURI slimming
* lib/libidn-0.5.9.jar, project.properties, project.xml
integrate LGPL libidn library for IDN-encoding Unicode
domain names
* LaxURI.java
Specialization of HttpClient URI to tolerate the same
sort of partial/inconsistent encoding as browsers do
* LaxURLCodec.java
Specialization of Apache URLCodec to allow additional
characters to skip encoding
* UURI.java
derive from LaxURI; eliminate custom local fix that's
been integrated into HttpClient 3.0 RC3
* UURIFactory.java
change to do all needed/desired escaping ourself (no
isEscaped test; fixup always results in 'escaped' URI)
factor authority/domain fixup to helper methods; apply
IDN encoding to Unicode domain names
* UURIFactoryTest.java
updated unit tests to match new desired behavior
testFailedGetPath() disabled; desired behavior
unclear/undefined
converted many assertTrue()s to assertEquals() so that
contrast between expected and actual is clearer
added testEscapingNotNecessary() verifying characters
passed by Firefox aren't escaped
added testIdn() for IDN-encoding of unicode domain name
* SurtPrefixSet.java
ensure fixup (IDN-encoding) occurs on seeds before they
are used as SURT prefixes

Assigning to Karl for verification/closing.


Date: 2005-07-14 19:35
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

Christian's suggestion to IDN-punycode URIs at creation time
sounds like the right way to go. The UURI fixup process is
the perfect place for this -- making a URI encountered in
the wild into a 'usable URI' (U URI) for our purposes.
Working on that approach now.


Date: 2005-07-01 10:39
Sender: ck-heritrix

Logged In: YES
user_id=1220421

(forgot to login).

Cheers,
Christian



Date: 2005-07-01 10:38
Sender: nobody

Logged In: NO

Another possibility would be to escape punycoded URLs (ie.
convert to xn-- ASCII representation) at extraction/Link
creation level.

This would remarkably reduce the amount of required changes
and you do not even need to have all Heritrix components
explicitly support punycode (for example, you do not have to
worry about possible ARC file encoding problems).

Moreover, showing punycode hostnames as a sequence of
Unicode glyphs is somewhat problematic regarding security
(see the homograph attack: http://www.shmoo.com/idn/).
Recent Firefox versions, for example, will show the escaped
ASCII representation by default.



Attached File

No Files Currently Attached

Changes ( 5 )

Field Old Value Date By
status_id Open 2005-12-02 17:14 stack-sf
close_date - 2005-12-02 17:14 stack-sf
artifact_group_id None 2005-09-23 18:29 gojomo
assigned_to gojomo 2005-08-09 01:03 gojomo
assigned_to nobody 2005-07-14 19:35 gojomo