Share

Heritrix: Internet Archive Web Crawler

Tracker: Bugs

8 '@' in URI path confuses SURT (bad queues, scope probs, etc) - ID: 1224531
Last Update: Comment added ( karl-ia )

SURT.fromURI("http://www.example.com/foo@bar");
gives

http://(bar,@www.example.com/foo)

when it should give

http://(com,example,www,)/foo@bar

The URI_SPLITTER regex in SURT is flawed.

First noticed when URIs in the AU crawl were placed in
the wrong (SURT-authority-based) queue.

Would also cause problems for scoping based on SURT
prefixes: all such mis-calculated SURTs would be ruled
out of scope in error.


Gordon Mohr ( gojomo ) - 2005-06-21 00:45

8

Closed

None

Karl Thiessen

Protocols

1.6.0

Public


Comments ( 2 )

Date: 2007-03-14 00:55
Sender: karl-ia


This issue is now discussed in the new JIRA tracker at
http://webteam.archive.org/jira/browse/HER-447 -- please add further
comments at that location.


Date: 2005-07-09 00:54
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

Fix for [ 1224531 ] '@' in URI path confuses SURT (bad
queues, scope probs, etc)
* SURT.java
tighten URI_SPLITTER to require 'userinfo@' (if present)
to be only rfc2396-allowed chars, preventing mistaken
gobbling of host/path into userinfo if there's an '@' late
in the URI
* SURTTest.java
atSymbolInPath test case exhibiting bug 1224531 pre-fix;
passing post-fix

Assigning to Karl for verification/close.


Attached File

No Files Currently Attached

Changes ( 4 )

Field Old Value Date By
status_id Open 2005-12-02 17:14 stack-sf
close_date - 2005-12-02 17:14 stack-sf
artifact_group_id None 2005-09-23 18:29 gojomo
assigned_to gojomo 2005-07-09 00:54 gojomo