Documents with a content charset of "ISO-2022-CN" will
cause Heritrix to report an uncaught
UnsupportedOperationException with Java 5.0
(see http://www.nthuajia.com/robots.txt, for example)
The reason is that "ISO-2022-CN" itself can only be
decoded, not encoded, since it is an alias for "either
ISO-2022-CN-CNS or ISO-2022-CN-GB".
The corresponding Charset class simply throws an
exception if an attempt is made to encode a string.
I have rewritten the faulty method, since it also
contained some hard-coded charset information, which
can be avoided if appopriate Java 1.4 NIO Charset
methods are called.
The corresponding patch is attached.
Christian
Michael Stack
Extraction
None
Public
|
Date: 2007-03-14 00:21
|
|
Date: 2005-03-07 22:56 Logged In: YES |
|
Date: 2005-03-07 18:13 Logged In: YES |
|
Date: 2005-03-07 13:13 Logged In: YES |
| Filename | Description | Download |
|---|---|---|
| replaycharseq-UnsupportedOperationException.patch | ReplayCharSequenceFactory patch | Download |
| Field | Old Value | Date | By |
|---|---|---|---|
| close_date | - | 2005-03-07 22:56 | stack-sf |
| resolution_id | None | 2005-03-07 22:56 | stack-sf |
| status_id | Open | 2005-03-07 22:56 | stack-sf |
| priority | 5 | 2005-03-07 18:13 | stack-sf |
| assigned_to | nobody | 2005-03-07 18:13 | stack-sf |
| File Added | 124477: replaycharseq-UnsupportedOperationException.patch | 2005-03-07 12:48 | ck-heritrix |
Copyright © 2010 Geeknet, Inc. All rights reserved. Terms of Use