Re: [Htmlparser-developer] JIS encoding problem
Brought to you by:
derrickoswald
From: Matthew B. <mat...@ou...> - 2006-04-19 10:03:37
|
Yuta Okamoto wrote: > But it's one thing after another. When HTML parser find a "Content-Type" > META tag, correct the current charset and read string before META tag once > again to compare with the buffer already read by default encoding in > org.htmlparser.lexer.InputStreamSource.setEncoding(). In this case, HTML > parser throws ParserException(EncodingChangeException) because of comparing > "[ESC]" from first character of old buffer with double byte character from > that of new buffer. > > I'm overwhelmed by that. What should I do? In the meantime, I attach the > revised code to this mail. please see the below. Throwning an Exception is the sensible thing todo as otherwise you may have mishandled the content due to the incorrect encoding. I changed EncodingChangeException so that you could find the orginal and replacement encodings. Then you can reset the parser and attempt to reparse the whole document using the new encoding. Eg: try { parser.visitAllNodesWith(visitor); } catch (EncodingChangeException ece) { log.debug("Switch from " + ece.getOrginalEncoding() + " to " + ece.getReplacementEncoding()); String encoding = ece.getReplacementEncoding(); parser.reset(); parser.setEncoding(encoding); visitor = getUserFilter(type); parser.visitAllNodesWith(visitor); } I don't believe I ever sent the patch for EncodingChangeException back to the list. Unfortunately my hacked copy of HTMLParser is on my work computer at the moment, but I can dig it out when I'm back at work. -- -- Matthew Buckett, VLE Developer -- Learning Technologies Group, Oxford University Computing Services -- Tel: +44 (0)1865 283660 http://www.oucs.ox.ac.uk/ltg/ |