Share

Heritrix: Internet Archive Web Crawler

Tracker: Bugs

9 crawl.log has URIs with spaces in them - ID: 1010966
Last Update: Comment added ( karl-ia )

Makes processing awkward. Shouldn't be spaces in
logged URIs (Figure why spaces at all and then at least
escape them in crawl.log logging formatter).


Michael Stack ( stack-sf ) - 2004-08-17 19:40

9

Closed

Fixed

Michael Stack

Logging

1.0.1

Public


Comments ( 7 )

Date: 2007-03-14 00:15
Sender: karl-ia


This issue is now discussed in the new JIRA tracker at
http://webteam.archive.org/jira/browse/HER-233 -- please add further
comments at that location.


Date: 2004-09-23 02:17
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Committed fix. Backported to heritrix_1_0. Will make a
release so boys can do their regular old crawls.

Closing again.


Date: 2004-09-23 02:03
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

And Igor found URLs with tabs. This is happening because
the URLs are already escaped and because they're already
escaped, we think the author knew what they were doing and
don't touch them. Previous I at least went through and
replaced spaces. I've now beefed up this routine so that it
will replace anything that java thinks a whitespace -- as
opposed to java regex which has a smaller set of characters
for whitespace -- with hex encoding.


Date: 2004-09-23 00:06
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Reopening. Igor found that mimetypes can have spaces in
them. Upping priority to maximum. Means that we've been
writing crawl.logs AND ARC metadata lines with spaces in
them. Bad. Readying patch for heritrix_1_0 branch.


Date: 2004-08-25 22:18
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Fixed.

Here is commit and main part of patch:

Fix for "[ 1010966 ] crawl.log has URIs with spaces in them"
* src/java/org/archive/crawler/datamodel/UURIFactory.java
(create): If a URI has been judged already-escaped, run
it through a check
for spaces that will escape any found for case where the
string has
been incorrectly escaped. Doing this, we'll avoid spaces
in logs and arcs.
(escapeSpaces): Added.
* src/java/org/archive/crawler/datamodel/UURIFactoryTest.java
(testSpaceDoubleEncoding): Added.


Index: src/java/org/archive/crawler/datamodel/UURIFactory.java
===================================================================
RCS file:
/cvsroot/archive-crawler/ArchiveOpenCrawler/src/java/org/archive/crawler/datamodel/UURIFactory.java,v
retrieving revision 1.5
diff -u -r1.5 UURIFactory.java
--- src/java/org/archive/crawler/datamodel/UURIFactory.java
25 Aug 2004 19:27:05 -0000 1.5
+++ src/java/org/archive/crawler/datamodel/UURIFactory.java
25 Aug 2004 22:14:49 -0000
@@ -252,7 +252,8 @@
*/
private UURI create(String uri, String charset) throws
URIException {
UURI uuri = isEscaped(uri)?
- new UURIImpl(fixup(uri, null).toCharArray(),
charset):
+ new UURIImpl(escapeSpaces(fixup(uri,
null)).toCharArray(),
+ charset):
new UURIImpl(fixup(uri, null), charset);
if (logger.isLoggable(Level.FINE)) {
logger.fine("URI " + uri +
@@ -321,7 +322,7 @@
// after all the fixup and normalization has
been done.
throw new URIException("URI length > " +
MAX_URL_LENGTH);
}
-
+
// Replace nbsp with normal spaces (so that they
get stripped if at
// ends, or encoded if in middle)
uri = TextUtils.replaceAll(NBSP, uri, SPACE);
@@ -422,6 +423,27 @@
}

/**
+ * Escape any spaces found.
+ *
+ * The parent class takes care of the bulk of escaping.
But if any
+ * instance of escaping is found in the URI, then we
ask for parent
+ * to do NO escaping. Here we escape any spaces found
irrespective
+ * of whether the uri has already been escaped. We do
this for
+ * case where uri has been judged already-escaped only,
its been
+ * incompletly done and spaces remain. Spaces in the
URI are
+ * a real pain. They're presence will break log file and
+ * ARC parsing.
+ * @param uri URI string to check.
+ * @return uri with spaces escaped if any found.
+ */
+ protected String escapeSpaces(String uri) {
+ if (uri.indexOf(" ") >= 0) {
+ uri = TextUtils.replaceAll(SPACE, uri, "%20");
+ }
+ return uri;
+ }
+
+ /**
* Check the domain label part of the authority.
*
* We're more lax than the spec. in that we allow
underscores.


Date: 2004-08-25 21:10
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Here is an URI from dan that was in crawl.log with a space:

http://www.brook.edu/data/brookings_taxonomy.xml?
taxonomy=Politics,%20Global

This URI doesn't seem to exist on the site anymore -- least
can crawl site and not uncover it (The spaces seem to have
been removed. When I crawled I found:
http://www.brook.edu/data/brookings_taxonomy.xml?taxonomy=Politics,%20Global).

But we fail on the first type of URI. Will make fixes for
that.


Date: 2004-08-17 19:47
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Assign to new 1.0.2 group to be fixed in next point release.




Attached File

No Files Currently Attached

Changes ( 12 )

Field Old Value Date By
close_date 2004-08-25 22:18 2004-09-23 02:17 stack-sf
status_id Open 2004-09-23 02:17 stack-sf
status_id Closed 2004-09-23 00:06 stack-sf
priority 7 2004-09-23 00:06 stack-sf
resolution_id None 2004-08-25 22:18 stack-sf
status_id Open 2004-08-25 22:18 stack-sf
close_date - 2004-08-25 22:18 stack-sf
priority 6 2004-08-23 23:52 gojomo
priority 5 2004-08-23 20:30 gojomo
artifact_group_id 1.0.2 2004-08-19 21:23 stack-sf
assigned_to nobody 2004-08-17 19:47 stack-sf
artifact_group_id None 2004-08-17 19:47 stack-sf