Share

Heritrix: Internet Archive Web Crawler

Tracker: Bugs

7 isMultibyteEncoding: Uncaught UnsupportedOperationException - ID: 1158270
Last Update: Comment added ( karl-ia )

Documents with a content charset of "ISO-2022-CN" will
cause Heritrix to report an uncaught
UnsupportedOperationException with Java 5.0
(see http://www.nthuajia.com/robots.txt, for example)

The reason is that "ISO-2022-CN" itself can only be
decoded, not encoded, since it is an alias for "either
ISO-2022-CN-CNS or ISO-2022-CN-GB".
The corresponding Charset class simply throws an
exception if an attempt is made to encode a string.

I have rewritten the faulty method, since it also
contained some hard-coded charset information, which
can be avoided if appopriate Java 1.4 NIO Charset
methods are called.

The corresponding patch is attached.


Christian


Christian Kohlschütter ( ck-heritrix ) - 2005-03-07 12:48

7

Closed

Fixed

Michael Stack

Extraction

None

Public


Comments ( 4 )

Date: 2007-03-14 00:21
Sender: karl-ia


This issue is now discussed in the new JIRA tracker at
http://webteam.archive.org/jira/browse/HER-376 -- please add further
comments at that location.


Date: 2005-03-07 22:56
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Closing as fixed. Below is commit:

Applied '[ 1158270 ] isMultibyteEncoding: Uncaught
UnsupportedOperationException'
Patch contributed by Christian Kohlschutter (ck-heritrix at
users dot
sourceforge dot net).
* src/java/org/archive/io/ReplayCharSequenceFactory.java
Ask Charset if charset is multibyte in place of testing
against list of
likely multibyte charsets.



Date: 2005-03-07 18:13
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Thank you for the patch Christian.

Upping the priority so patch gets into the coming 1.4 release.


Date: 2005-03-07 13:13
Sender: ck-heritrix

Logged In: YES
user_id=1220421

btw, this also redundantizes the constant
HYPHENS_UNDERSCORES (in the same class).



Attached File ( 1 )

Filename Description Download
replaycharseq-UnsupportedOperationException.patch ReplayCharSequenceFactory patch Download

Changes ( 6 )

Field Old Value Date By
close_date - 2005-03-07 22:56 stack-sf
resolution_id None 2005-03-07 22:56 stack-sf
status_id Open 2005-03-07 22:56 stack-sf
priority 5 2005-03-07 18:13 stack-sf
assigned_to nobody 2005-03-07 18:13 stack-sf
File Added 124477: replaycharseq-UnsupportedOperationException.patch 2005-03-07 12:48 ck-heritrix