Re: [Htmlparser-developer] JIS encoding problem

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Yuta,

Thanks for the updated JIS handling. I will incorporate it into the Lexer.

As Matthew has indicated, the EncodingChangeException is thrown to let
the user know that some nodes already handed out by the parser are
incorrect according to the encoding. This is really the fault of the
HTTP server, which should have sent the correct encoding as part of the
Content-Type header string. But, given that you have no control over the
server, the exception is the only solution.

After the exception is thrown, the parser has set it's encoding to the
new value, so you should be able to just reset and reparse, see for
example the handling in StringBean:

catch (EncodingChangeException ece)
{
mIsPre = false;
mIsScript = false;
mIsStyle = false;
try
{ // try again with the encoding now in force
mParser.reset ();
mBuffer = new StringBuffer (4096);
mParser.visitAllNodesWith (this);
updateStrings (mBuffer.toString ());
}
catch (ParserException pe)
{
updateStrings (pe.toString ());
}
finally
{
mBuffer = new StringBuffer (4096);
}
}

You'll notice that it is up to the user code (StringBean for example) to
reset it's own state so that the reparse doesn't start from an arbitrary
state.

Derrick

Matthew Buckett wrote:

>Yuta Okamoto wrote:
>
>  
>
>>But it's one thing after another. When HTML parser find a "Content-Type"
>>META tag, correct the current charset and read string before META tag once
>>again to compare with the buffer already read by default encoding in
>>org.htmlparser.lexer.InputStreamSource.setEncoding(). In this case, HTML
>>parser throws ParserException(EncodingChangeException) because of comparing
>>"[ESC]" from first character of old buffer with double byte character from
>>that of new buffer.
>>
>>I'm overwhelmed by that. What should I do? In the meantime, I attach the
>>revised code to this mail. please see the below.
>>    
>>
>
>Throwning an Exception is the sensible thing todo as otherwise you may
>have mishandled the content due to the incorrect encoding.
>
>I changed EncodingChangeException so that you could find the orginal and
>replacement encodings. Then you can reset the parser and attempt to
>reparse the whole document using the new encoding. Eg:
>
>            try
>            {
>                parser.visitAllNodesWith(visitor);
>            }
>            catch (EncodingChangeException ece)
>            {
>                log.debug("Switch from " + ece.getOrginalEncoding() + " to "
>                    + ece.getReplacementEncoding());
>                String encoding = ece.getReplacementEncoding();
>                parser.reset();
>                parser.setEncoding(encoding);
>                visitor = getUserFilter(type);
>                parser.visitAllNodesWith(visitor);
>            }
>
>I don't believe I ever sent the patch for EncodingChangeException back
>to the list. Unfortunately my hacked copy of HTMLParser is on my work
>computer at the moment, but I can dig it out when I'm back at work.
>
>  
>