Re: [Htmlparser-developer] JIS encoding problem

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Yuta Okamoto wrote:

> But it's one thing after another. When HTML parser find a "Content-Type"
> META tag, correct the current charset and read string before META tag once
> again to compare with the buffer already read by default encoding in
> org.htmlparser.lexer.InputStreamSource.setEncoding(). In this case, HTML
> parser throws ParserException(EncodingChangeException) because of comparing
> "[ESC]" from first character of old buffer with double byte character from
> that of new buffer.
> 
> I'm overwhelmed by that. What should I do? In the meantime, I attach the
> revised code to this mail. please see the below.

Throwning an Exception is the sensible thing todo as otherwise you may
have mishandled the content due to the incorrect encoding.

I changed EncodingChangeException so that you could find the orginal and
replacement encodings. Then you can reset the parser and attempt to
reparse the whole document using the new encoding. Eg:

            try
            {
                parser.visitAllNodesWith(visitor);
            }
            catch (EncodingChangeException ece)
            {
                log.debug("Switch from " + ece.getOrginalEncoding() + " to "
                    + ece.getReplacementEncoding());
                String encoding = ece.getReplacementEncoding();
                parser.reset();
                parser.setEncoding(encoding);
                visitor = getUserFilter(type);
                parser.visitAllNodesWith(visitor);
            }

I don't believe I ever sent the patch for EncodingChangeException back
to the list. Unfortunately my hacked copy of HTMLParser is on my work
computer at the moment, but I can dig it out when I'm back at work.

-- 
 -- Matthew Buckett, VLE Developer
 -- Learning Technologies Group, Oxford University Computing Services
 -- Tel: +44 (0)1865 283660 http://www.oucs.ox.ac.uk/ltg/