Re: [Htmlparser-developer] JIS encoding problem
Brought to you by:
derrickoswald
|
From: Matthew B. <mat...@ou...> - 2006-04-19 10:03:37
|
Yuta Okamoto wrote:
> But it's one thing after another. When HTML parser find a "Content-Type"
> META tag, correct the current charset and read string before META tag once
> again to compare with the buffer already read by default encoding in
> org.htmlparser.lexer.InputStreamSource.setEncoding(). In this case, HTML
> parser throws ParserException(EncodingChangeException) because of comparing
> "[ESC]" from first character of old buffer with double byte character from
> that of new buffer.
>
> I'm overwhelmed by that. What should I do? In the meantime, I attach the
> revised code to this mail. please see the below.
Throwning an Exception is the sensible thing todo as otherwise you may
have mishandled the content due to the incorrect encoding.
I changed EncodingChangeException so that you could find the orginal and
replacement encodings. Then you can reset the parser and attempt to
reparse the whole document using the new encoding. Eg:
try
{
parser.visitAllNodesWith(visitor);
}
catch (EncodingChangeException ece)
{
log.debug("Switch from " + ece.getOrginalEncoding() + " to "
+ ece.getReplacementEncoding());
String encoding = ece.getReplacementEncoding();
parser.reset();
parser.setEncoding(encoding);
visitor = getUserFilter(type);
parser.visitAllNodesWith(visitor);
}
I don't believe I ever sent the patch for EncodingChangeException back
to the list. Unfortunately my hacked copy of HTMLParser is on my work
computer at the moment, but I can dig it out when I'm back at work.
--
-- Matthew Buckett, VLE Developer
-- Learning Technologies Group, Oxford University Computing Services
-- Tel: +44 (0)1865 283660 http://www.oucs.ox.ac.uk/ltg/
|