Share

Heritrix: Internet Archive Web Crawler

Tracker: Bugs

9 UURI handling of inconsistent escaping makes broken instance - ID: 1213095
Last Update: Comment added ( karl-ia )

This initially exhibited itself in the report on...

[ 1212377 ] URIException in (de)serialization, post
CrawlURI slimming
http://sourceforge.net/tracker/index.php?func=detail&aid=1212377&group_id=7
3833&atid=539099

...but can be independently reproduced as a problem in
the way UURI deals with inconsistently-escaped URIs.

Specifically, try...

UURIFactory.getInstance("http://www.example.com/test%20path?foo=bar%").getU
RI();

... which causes a URIException only on the getURI(), or...

UURIFactory.getInstance("http://www.example.com/test%20path%");

...which causes a URIException on instance-creation.
That's a little better, in that it doesn't give you a
broken UURI instance, but ideally, we should tolerate
and recover a usable URI from this half-escapedness,
because browsers do.

The fix might include changing the
UURIFactory.isEscaped() test somehow, and/or improving
the UURIFactor.fixup() method.






Gordon Mohr ( gojomo ) - 2005-06-01 22:05

9

Closed

None

Karl Thiessen

None

1.6.0

Public


Comments ( 4 )

Date: 2007-03-14 00:53
Sender: karl-ia


This issue is now discussed in the new JIRA tracker at
http://webteam.archive.org/jira/browse/HER-424 -- please add further
comments at that location.


Date: 2005-08-09 00:52
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

Believed fixed with latest commit of significant changes to
make encoding/decoding of URIs more 'lax' in the ways
browser are tolerant. Commit comment:

Fix for [ 1213095 ] UURI handling of inconsistent escaping
makes broken instance
Fix for [ 1222229 ] unicode/idn domain names fail (seeds and
more?)- punycode
Fix for [ 1242747 ] over-escaping (of '%', etc) compared to
browsers
Fix for [ 1212377 ] URIException in deserialization, post
CrawlURI slimming
* lib/libidn-0.5.9.jar, project.properties, project.xml
integrate LGPL libidn library for IDN-encoding Unicode
domain names
* LaxURI.java
Specialization of HttpClient URI to tolerate the same
sort of partial/inconsistent encoding as browsers do
* LaxURLCodec.java
Specialization of Apache URLCodec to allow additional
characters to skip encoding
* UURI.java
derive from LaxURI; eliminate custom local fix that's
been integrated into HttpClient 3.0 RC3
* UURIFactory.java
change to do all needed/desired escaping ourself (no
isEscaped test; fixup always results in 'escaped' URI)
factor authority/domain fixup to helper methods; apply
IDN encoding to Unicode domain names
* UURIFactoryTest.java
updated unit tests to match new desired behavior
testFailedGetPath() disabled; desired behavior
unclear/undefined
converted many assertTrue()s to assertEquals() so that
contrast between expected and actual is clearer
added testEscapingNotNecessary() verifying characters
passed by Firefox aren't escaped
added testIdn() for IDN-encoding of unicode domain name
* SurtPrefixSet.java
ensure fixup (IDN-encoding) occurs on seeds before they
are used as SURT prefixes

Assigning to Karl for verification/closing. There's a fair
chance these changes will generate other small bugs in the
handling of idiosyncratic URIs, but the issue listed here
should be fixed. Any new issues that require a different
recipe to reproduce should be filed as new bugs.


Date: 2005-07-12 01:18
Sender: karl-ia

Logged In: YES
user_id=1269624

This bug is related to <a
href="http://sourceforge.net/tracker/index.php?func=detail&aid=1212377&group_id=73833&atid=539099">Bug
#1212377 -- URIException post crawl slimming</a>.



Test in harness under bug ID 1212377 -- this bug will close
when that one does. Only trailing percent issue remains;
leading colon is fixed.

Assigning to Gordon for further investigation.


Date: 2005-06-02 20:28
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

Added step which cleans up stray '%' characters in
UURIFactory.fixup(). Commit comment:

Fix for [ 1213095 ] UURI handling of inconsistent escaping
makes broken instance
* UURIFactory.java
Add step which 'ensures minimal escaping' of any stray
'%' characters that aren't already part of a valid URI escape
* UURIFactoryTest.java
Added test for leading ':' issue (see notes for [1212377])
Added test for trailing '%' issues
Removed test that was insisting on a URIException that
we no longer generate (or should generate, because it is
plausible that we shoudl try to create and fetch the flawed
URI in the test)

--
Believed fixed. Assigned to Karl for final disposition.


Attached File

No Files Currently Attached

Changes ( 7 )

Field Old Value Date By
close_date - 2005-12-02 17:14 stack-sf
status_id Open 2005-12-02 17:14 stack-sf
artifact_group_id None 2005-09-23 18:29 gojomo
assigned_to gojomo 2005-08-09 00:52 gojomo
priority 6 2005-08-03 01:16 gojomo
assigned_to karl-ia 2005-07-12 01:18 karl-ia
assigned_to nobody 2005-06-02 20:28 gojomo