[Htmlparser-developer] JIS encoding problem
Brought to you by:
derrickoswald
From: Yuta O. <ok...@ar...> - 2006-04-19 09:49:18
|
Dear All, I'm Yuta Okamoto, parttime employee of Ariel Networks, Inc.. I'm writing to ask you problems with HTML documents including "JIS encoding" (ISO-2022-JP) strings. In Japan, there are many type and version of character set. JIS encoding, one of the popular Japanese charset, is defined as a subset of ISO-2022. We're developing an application using HTML parser library, and face some problems. For example, some kind of HTML document including JIS encoding strings as below: <HTML> <HEAD> <TITLE>[JIS encoding strings]</TITLE> <meta http-equiv="Content-Type" content="text/html; charset=iso-2022-jp"> ... </HEAD> <BODY> ... </BODY> </HTML> In this case, HTML parser can't recognize "</TITLE>" and set down following tags and strings as content of "TITLE". For finding a reason, I get the source of HTML parser and trace its process. In the result, I found causes in org.htmlparser.lexer.Lexer.parseString() and scanJIS(). Within JIS encoding strings, several kind of "escape sequence" defined by ISO-2022 to switch character set. For example, [ESC] $ B [double byte characters] [ESC] ( B Where "[ESC] $ B" means "switch to JIS X 0208-1983(new JIS) charset". And "[ESC] ( B" means "switch to US-ASCII charset". For more detail, please see ISO-2022, RFC1468 or RFC1554. HTML parser recognize a string enclosed by ISO-2022 escape sequences. However, It recognize the string only beginning with "[ESC] $ B" and ending with "[ESC] ( J", meaning "switch to JIS X 0201-1976 ("Roman" set)". On the above example, HTML parser can't recognize the end of JIS encoding string by the end of the document. In order to resolve it, I revised "org.htmlparser.lexer.Lexer.java" and this problem is improved. But it's one thing after another. When HTML parser find a "Content-Type" META tag, correct the current charset and read string before META tag once again to compare with the buffer already read by default encoding in org.htmlparser.lexer.InputStreamSource.setEncoding(). In this case, HTML parser throws ParserException(EncodingChangeException) because of comparing "[ESC]" from first character of old buffer with double byte character from that of new buffer. I'm overwhelmed by that. What should I do? In the meantime, I attach the revised code to this mail. please see the below. Regards, Okamoto ---------- /** * Advance the cursor through a JIS escape sequence.<p> * * NOTE:<br> * A list of ISO-2022 escape sequences for charset switching.<br> * For more detail, see ISO-2022, RFC1468 or RFC1554.<p> * * [ double byte characters ] * <ul> * <li>(*) JIS X 0208-1978(old JIS): [ESC] $ @ * <li>(*) JIS X 0208-1983(new JIS): [ESC] $ B * <li>JIS X 0208-1990: [ESC] & @ [ESC] $ B * <li>JIS X 0212-1990: [ESC] $ ( D * <li>1st plane of JIS X 0213:2000: [ESC] $ ( O * <li>1st plane of JIS X 0213:2004: [ESC] $ ( Q * <li>2nd plane of JIS X 0213:2000: [ESC] $ ( P * </ul> * * <p>[ single byte characters ] * <ul> * <li>(*) ISO/IEC 646 IRV(US-ASCII): [ESC] ( B * <li>(*) JIS X 0201-1976 ("Roman" set) * <ul> * <li>[ESC] ( J * <li>[ESC] ( H (NOT RECOMMENDED but rarely used) * </ul> * <li>JIS X 0201-1976 ("Kana" set): [ESC] ( I (NOT RECOMMENDED but rarely used) * </ul> * * <p>(*): commonly used * * @param cursor A cursor positioned within the escape sequence. * @exception ParserException If a problem occurs reading from the source. */ protected void scanJIS (Cursor cursor) throws ParserException { boolean done; char ch; int state; done = false; state = 0; while (!done) { ch = mPage.getCharacter (cursor); if (Page.EOF == ch) done = true; else switch (state) { case 0: if (0x1b == ch) // escape state = 1; break; case 1: if ('(' == ch) state = 2; else state = 0; break; case 2: if ('B' == ch || 'J' == ch || 'H' == ch || 'I' == ch) done = true; else state = 0; break; default: throw new IllegalStateException ("state " + state); } } } /** * Parse a string node. * Scan characters until "</", "<%", "<!" or < followed by a * letter is encountered, or the input stream is exhausted, in which * case <code>null</code> is returned. * @param start The position at which to start scanning. * @param quotesmart If <code>true</code>, strings ignore quoted contents. * @return The parsed node. * @exception ParserException If a problem occurs reading from the source. */ protected Node parseString (int start, boolean quotesmart) throws ParserException { boolean done; char ch; char quote; done = false; quote = 0; while (!done) { ch = mPage.getCharacter (mCursor); if (Page.EOF == ch) done = true; else if (0x1b == ch) // escape { ch = mPage.getCharacter (mCursor); if (Page.EOF == ch) done = true; else if ('$' == ch) { ch = mPage.getCharacter (mCursor); if (Page.EOF == ch) done = true; // JIS X 0208-1978 and JIS X 0208-1983 else if ('@' == ch || 'B' == ch) scanJIS (mCursor); /* // JIS X 0212-1990 else if ('(' == ch) { ch = mPage.getCharacter (mCursor); if (Page.EOF == ch) done = true; else if ('D' == ch) scanJIS (mCursor); else { mCursor.retreat (); mCursor.retreat (); mCursor.retreat (); } } */ else { mCursor.retreat (); mCursor.retreat (); } } else mCursor.retreat (); } else if ( ... } } |