Thread: [Htmlparser-developer] JIS encoding problem
Brought to you by:
derrickoswald
From: Yuta O. <ok...@ar...> - 2006-04-19 09:49:18
|
Dear All, I'm Yuta Okamoto, parttime employee of Ariel Networks, Inc.. I'm writing to ask you problems with HTML documents including "JIS encoding" (ISO-2022-JP) strings. In Japan, there are many type and version of character set. JIS encoding, one of the popular Japanese charset, is defined as a subset of ISO-2022. We're developing an application using HTML parser library, and face some problems. For example, some kind of HTML document including JIS encoding strings as below: <HTML> <HEAD> <TITLE>[JIS encoding strings]</TITLE> <meta http-equiv="Content-Type" content="text/html; charset=iso-2022-jp"> ... </HEAD> <BODY> ... </BODY> </HTML> In this case, HTML parser can't recognize "</TITLE>" and set down following tags and strings as content of "TITLE". For finding a reason, I get the source of HTML parser and trace its process. In the result, I found causes in org.htmlparser.lexer.Lexer.parseString() and scanJIS(). Within JIS encoding strings, several kind of "escape sequence" defined by ISO-2022 to switch character set. For example, [ESC] $ B [double byte characters] [ESC] ( B Where "[ESC] $ B" means "switch to JIS X 0208-1983(new JIS) charset". And "[ESC] ( B" means "switch to US-ASCII charset". For more detail, please see ISO-2022, RFC1468 or RFC1554. HTML parser recognize a string enclosed by ISO-2022 escape sequences. However, It recognize the string only beginning with "[ESC] $ B" and ending with "[ESC] ( J", meaning "switch to JIS X 0201-1976 ("Roman" set)". On the above example, HTML parser can't recognize the end of JIS encoding string by the end of the document. In order to resolve it, I revised "org.htmlparser.lexer.Lexer.java" and this problem is improved. But it's one thing after another. When HTML parser find a "Content-Type" META tag, correct the current charset and read string before META tag once again to compare with the buffer already read by default encoding in org.htmlparser.lexer.InputStreamSource.setEncoding(). In this case, HTML parser throws ParserException(EncodingChangeException) because of comparing "[ESC]" from first character of old buffer with double byte character from that of new buffer. I'm overwhelmed by that. What should I do? In the meantime, I attach the revised code to this mail. please see the below. Regards, Okamoto ---------- /** * Advance the cursor through a JIS escape sequence.<p> * * NOTE:<br> * A list of ISO-2022 escape sequences for charset switching.<br> * For more detail, see ISO-2022, RFC1468 or RFC1554.<p> * * [ double byte characters ] * <ul> * <li>(*) JIS X 0208-1978(old JIS): [ESC] $ @ * <li>(*) JIS X 0208-1983(new JIS): [ESC] $ B * <li>JIS X 0208-1990: [ESC] & @ [ESC] $ B * <li>JIS X 0212-1990: [ESC] $ ( D * <li>1st plane of JIS X 0213:2000: [ESC] $ ( O * <li>1st plane of JIS X 0213:2004: [ESC] $ ( Q * <li>2nd plane of JIS X 0213:2000: [ESC] $ ( P * </ul> * * <p>[ single byte characters ] * <ul> * <li>(*) ISO/IEC 646 IRV(US-ASCII): [ESC] ( B * <li>(*) JIS X 0201-1976 ("Roman" set) * <ul> * <li>[ESC] ( J * <li>[ESC] ( H (NOT RECOMMENDED but rarely used) * </ul> * <li>JIS X 0201-1976 ("Kana" set): [ESC] ( I (NOT RECOMMENDED but rarely used) * </ul> * * <p>(*): commonly used * * @param cursor A cursor positioned within the escape sequence. * @exception ParserException If a problem occurs reading from the source. */ protected void scanJIS (Cursor cursor) throws ParserException { boolean done; char ch; int state; done = false; state = 0; while (!done) { ch = mPage.getCharacter (cursor); if (Page.EOF == ch) done = true; else switch (state) { case 0: if (0x1b == ch) // escape state = 1; break; case 1: if ('(' == ch) state = 2; else state = 0; break; case 2: if ('B' == ch || 'J' == ch || 'H' == ch || 'I' == ch) done = true; else state = 0; break; default: throw new IllegalStateException ("state " + state); } } } /** * Parse a string node. * Scan characters until "</", "<%", "<!" or < followed by a * letter is encountered, or the input stream is exhausted, in which * case <code>null</code> is returned. * @param start The position at which to start scanning. * @param quotesmart If <code>true</code>, strings ignore quoted contents. * @return The parsed node. * @exception ParserException If a problem occurs reading from the source. */ protected Node parseString (int start, boolean quotesmart) throws ParserException { boolean done; char ch; char quote; done = false; quote = 0; while (!done) { ch = mPage.getCharacter (mCursor); if (Page.EOF == ch) done = true; else if (0x1b == ch) // escape { ch = mPage.getCharacter (mCursor); if (Page.EOF == ch) done = true; else if ('$' == ch) { ch = mPage.getCharacter (mCursor); if (Page.EOF == ch) done = true; // JIS X 0208-1978 and JIS X 0208-1983 else if ('@' == ch || 'B' == ch) scanJIS (mCursor); /* // JIS X 0212-1990 else if ('(' == ch) { ch = mPage.getCharacter (mCursor); if (Page.EOF == ch) done = true; else if ('D' == ch) scanJIS (mCursor); else { mCursor.retreat (); mCursor.retreat (); mCursor.retreat (); } } */ else { mCursor.retreat (); mCursor.retreat (); } } else mCursor.retreat (); } else if ( ... } } |
From: Matthew B. <mat...@ou...> - 2006-04-19 10:03:37
|
Yuta Okamoto wrote: > But it's one thing after another. When HTML parser find a "Content-Type" > META tag, correct the current charset and read string before META tag once > again to compare with the buffer already read by default encoding in > org.htmlparser.lexer.InputStreamSource.setEncoding(). In this case, HTML > parser throws ParserException(EncodingChangeException) because of comparing > "[ESC]" from first character of old buffer with double byte character from > that of new buffer. > > I'm overwhelmed by that. What should I do? In the meantime, I attach the > revised code to this mail. please see the below. Throwning an Exception is the sensible thing todo as otherwise you may have mishandled the content due to the incorrect encoding. I changed EncodingChangeException so that you could find the orginal and replacement encodings. Then you can reset the parser and attempt to reparse the whole document using the new encoding. Eg: try { parser.visitAllNodesWith(visitor); } catch (EncodingChangeException ece) { log.debug("Switch from " + ece.getOrginalEncoding() + " to " + ece.getReplacementEncoding()); String encoding = ece.getReplacementEncoding(); parser.reset(); parser.setEncoding(encoding); visitor = getUserFilter(type); parser.visitAllNodesWith(visitor); } I don't believe I ever sent the patch for EncodingChangeException back to the list. Unfortunately my hacked copy of HTMLParser is on my work computer at the moment, but I can dig it out when I'm back at work. -- -- Matthew Buckett, VLE Developer -- Learning Technologies Group, Oxford University Computing Services -- Tel: +44 (0)1865 283660 http://www.oucs.ox.ac.uk/ltg/ |
From: Derrick O. <Der...@Ro...> - 2006-04-19 12:21:42
|
Yuta, Thanks for the updated JIS handling. I will incorporate it into the Lexer. As Matthew has indicated, the EncodingChangeException is thrown to let the user know that some nodes already handed out by the parser are incorrect according to the encoding. This is really the fault of the HTTP server, which should have sent the correct encoding as part of the Content-Type header string. But, given that you have no control over the server, the exception is the only solution. After the exception is thrown, the parser has set it's encoding to the new value, so you should be able to just reset and reparse, see for example the handling in StringBean: catch (EncodingChangeException ece) { mIsPre = false; mIsScript = false; mIsStyle = false; try { // try again with the encoding now in force mParser.reset (); mBuffer = new StringBuffer (4096); mParser.visitAllNodesWith (this); updateStrings (mBuffer.toString ()); } catch (ParserException pe) { updateStrings (pe.toString ()); } finally { mBuffer = new StringBuffer (4096); } } You'll notice that it is up to the user code (StringBean for example) to reset it's own state so that the reparse doesn't start from an arbitrary state. Derrick Matthew Buckett wrote: >Yuta Okamoto wrote: > > > >>But it's one thing after another. When HTML parser find a "Content-Type" >>META tag, correct the current charset and read string before META tag once >>again to compare with the buffer already read by default encoding in >>org.htmlparser.lexer.InputStreamSource.setEncoding(). In this case, HTML >>parser throws ParserException(EncodingChangeException) because of comparing >>"[ESC]" from first character of old buffer with double byte character from >>that of new buffer. >> >>I'm overwhelmed by that. What should I do? In the meantime, I attach the >>revised code to this mail. please see the below. >> >> > >Throwning an Exception is the sensible thing todo as otherwise you may >have mishandled the content due to the incorrect encoding. > >I changed EncodingChangeException so that you could find the orginal and >replacement encodings. Then you can reset the parser and attempt to >reparse the whole document using the new encoding. Eg: > > try > { > parser.visitAllNodesWith(visitor); > } > catch (EncodingChangeException ece) > { > log.debug("Switch from " + ece.getOrginalEncoding() + " to " > + ece.getReplacementEncoding()); > String encoding = ece.getReplacementEncoding(); > parser.reset(); > parser.setEncoding(encoding); > visitor = getUserFilter(type); > parser.visitAllNodesWith(visitor); > } > >I don't believe I ever sent the patch for EncodingChangeException back >to the list. Unfortunately my hacked copy of HTMLParser is on my work >computer at the moment, but I can dig it out when I'm back at work. > > > |
From: Yuta O. <ok...@ar...> - 2006-04-20 09:20:39
|
Thank you for your advice! I modified our code as reseting the parser and calling visitAllNodesWith() again, parsing process is done successfully by corrected encoding. And I have correction about JIS handling. I make scanJIS() to recognize "[ESC] ( I" as the end of JIS encoding string, but it is mistake. According to ISO-2022-JP, It is necessary to return to ASCII charset at the end of the line and the text. JIS X 0201-1976 "Kana" charset, that is single byte charset, is not ASCII charset. Note that the codes I modified are only support the Japanese charsets. There are many type of charset(ex. Chinese, Korean, Latin, etc...) which use other escape sequences. If another problem is happen about escape sequence handling, following URLs help you to settle the problem. Wikipedia - ISO/IEC 2022 http://en.wikipedia.org/wiki/ISO_2022 International Register of Coded Character Sets http://www.itscj.ipsj.or.jp/ISO-IR/ |