Re: [Htmlparser-developer] JIS encoding problem
Brought to you by:
derrickoswald
From: Derrick O. <Der...@Ro...> - 2006-04-19 12:21:42
|
Yuta, Thanks for the updated JIS handling. I will incorporate it into the Lexer. As Matthew has indicated, the EncodingChangeException is thrown to let the user know that some nodes already handed out by the parser are incorrect according to the encoding. This is really the fault of the HTTP server, which should have sent the correct encoding as part of the Content-Type header string. But, given that you have no control over the server, the exception is the only solution. After the exception is thrown, the parser has set it's encoding to the new value, so you should be able to just reset and reparse, see for example the handling in StringBean: catch (EncodingChangeException ece) { mIsPre = false; mIsScript = false; mIsStyle = false; try { // try again with the encoding now in force mParser.reset (); mBuffer = new StringBuffer (4096); mParser.visitAllNodesWith (this); updateStrings (mBuffer.toString ()); } catch (ParserException pe) { updateStrings (pe.toString ()); } finally { mBuffer = new StringBuffer (4096); } } You'll notice that it is up to the user code (StringBean for example) to reset it's own state so that the reparse doesn't start from an arbitrary state. Derrick Matthew Buckett wrote: >Yuta Okamoto wrote: > > > >>But it's one thing after another. When HTML parser find a "Content-Type" >>META tag, correct the current charset and read string before META tag once >>again to compare with the buffer already read by default encoding in >>org.htmlparser.lexer.InputStreamSource.setEncoding(). In this case, HTML >>parser throws ParserException(EncodingChangeException) because of comparing >>"[ESC]" from first character of old buffer with double byte character from >>that of new buffer. >> >>I'm overwhelmed by that. What should I do? In the meantime, I attach the >>revised code to this mail. please see the below. >> >> > >Throwning an Exception is the sensible thing todo as otherwise you may >have mishandled the content due to the incorrect encoding. > >I changed EncodingChangeException so that you could find the orginal and >replacement encodings. Then you can reset the parser and attempt to >reparse the whole document using the new encoding. Eg: > > try > { > parser.visitAllNodesWith(visitor); > } > catch (EncodingChangeException ece) > { > log.debug("Switch from " + ece.getOrginalEncoding() + " to " > + ece.getReplacementEncoding()); > String encoding = ece.getReplacementEncoding(); > parser.reset(); > parser.setEncoding(encoding); > visitor = getUserFilter(type); > parser.visitAllNodesWith(visitor); > } > >I don't believe I ever sent the patch for EncodingChangeException back >to the list. Unfortunately my hacked copy of HTMLParser is on my work >computer at the moment, but I can dig it out when I'm back at work. > > > |