[Htmlparser-developer] JIS encoding problem
Brought to you by:
derrickoswald
|
From: Yuta O. <ok...@ar...> - 2006-04-19 09:49:18
|
Dear All,
I'm Yuta Okamoto, parttime employee of Ariel Networks, Inc..
I'm writing to ask you problems with HTML documents including "JIS encoding"
(ISO-2022-JP) strings. In Japan, there are many type and version of
character set. JIS encoding, one of the popular Japanese charset, is defined
as a subset of ISO-2022.
We're developing an application using HTML parser library, and face some
problems. For example, some kind of HTML document including JIS encoding
strings as below:
<HTML>
<HEAD>
<TITLE>[JIS encoding strings]</TITLE>
<meta http-equiv="Content-Type" content="text/html;
charset=iso-2022-jp">
...
</HEAD>
<BODY> ... </BODY>
</HTML>
In this case, HTML parser can't recognize "</TITLE>" and set down following
tags and strings as content of "TITLE". For finding a reason, I get the
source of HTML parser and trace its process.
In the result, I found causes in org.htmlparser.lexer.Lexer.parseString()
and scanJIS(). Within JIS encoding strings, several kind of "escape
sequence" defined by ISO-2022 to switch character set. For example,
[ESC] $ B [double byte characters] [ESC] ( B
Where "[ESC] $ B" means "switch to JIS X 0208-1983(new JIS) charset". And
"[ESC] ( B" means "switch to US-ASCII charset". For more detail, please see
ISO-2022, RFC1468 or RFC1554.
HTML parser recognize a string enclosed by ISO-2022 escape sequences.
However, It recognize the string only beginning with "[ESC] $ B" and ending
with "[ESC] ( J", meaning "switch to JIS X 0201-1976 ("Roman" set)". On the
above example, HTML parser can't recognize the end of JIS encoding string by
the end of the document. In order to resolve it, I revised
"org.htmlparser.lexer.Lexer.java" and this problem is improved.
But it's one thing after another. When HTML parser find a "Content-Type"
META tag, correct the current charset and read string before META tag once
again to compare with the buffer already read by default encoding in
org.htmlparser.lexer.InputStreamSource.setEncoding(). In this case, HTML
parser throws ParserException(EncodingChangeException) because of comparing
"[ESC]" from first character of old buffer with double byte character from
that of new buffer.
I'm overwhelmed by that. What should I do? In the meantime, I attach the
revised code to this mail. please see the below.
Regards,
Okamoto
----------
/**
* Advance the cursor through a JIS escape sequence.<p>
*
* NOTE:<br>
* A list of ISO-2022 escape sequences for charset switching.<br>
* For more detail, see ISO-2022, RFC1468 or RFC1554.<p>
*
* [ double byte characters ]
* <ul>
* <li>(*) JIS X 0208-1978(old JIS): [ESC] $ @
* <li>(*) JIS X 0208-1983(new JIS): [ESC] $ B
* <li>JIS X 0208-1990: [ESC] & @ [ESC] $ B
* <li>JIS X 0212-1990: [ESC] $ ( D
* <li>1st plane of JIS X 0213:2000: [ESC] $ ( O
* <li>1st plane of JIS X 0213:2004: [ESC] $ ( Q
* <li>2nd plane of JIS X 0213:2000: [ESC] $ ( P
* </ul>
*
* <p>[ single byte characters ]
* <ul>
* <li>(*) ISO/IEC 646 IRV(US-ASCII): [ESC] ( B
* <li>(*) JIS X 0201-1976 ("Roman" set)
* <ul>
* <li>[ESC] ( J
* <li>[ESC] ( H (NOT RECOMMENDED but rarely used)
* </ul>
* <li>JIS X 0201-1976 ("Kana" set): [ESC] ( I (NOT RECOMMENDED but
rarely used)
* </ul>
*
* <p>(*): commonly used
*
* @param cursor A cursor positioned within the escape sequence.
* @exception ParserException If a problem occurs reading from the
source.
*/
protected void scanJIS (Cursor cursor)
throws
ParserException
{
boolean done;
char ch;
int state;
done = false;
state = 0;
while (!done)
{
ch = mPage.getCharacter (cursor);
if (Page.EOF == ch)
done = true;
else
switch (state)
{
case 0:
if (0x1b == ch) // escape
state = 1;
break;
case 1:
if ('(' == ch)
state = 2;
else
state = 0;
break;
case 2:
if ('B' == ch || 'J' == ch || 'H' == ch || 'I' ==
ch)
done = true;
else
state = 0;
break;
default:
throw new IllegalStateException ("state " + state);
}
}
}
/**
* Parse a string node.
* Scan characters until "</", "<%", "<!" or < followed by a
* letter is encountered, or the input stream is exhausted, in which
* case <code>null</code> is returned.
* @param start The position at which to start scanning.
* @param quotesmart If <code>true</code>, strings ignore quoted
contents.
* @return The parsed node.
* @exception ParserException If a problem occurs reading from the
source.
*/
protected Node parseString (int start, boolean quotesmart)
throws
ParserException
{
boolean done;
char ch;
char quote;
done = false;
quote = 0;
while (!done)
{
ch = mPage.getCharacter (mCursor);
if (Page.EOF == ch)
done = true;
else if (0x1b == ch) // escape
{
ch = mPage.getCharacter (mCursor);
if (Page.EOF == ch)
done = true;
else if ('$' == ch)
{
ch = mPage.getCharacter (mCursor);
if (Page.EOF == ch)
done = true;
// JIS X 0208-1978 and JIS X 0208-1983
else if ('@' == ch || 'B' == ch)
scanJIS (mCursor);
/*
// JIS X 0212-1990
else if ('(' == ch)
{
ch = mPage.getCharacter (mCursor);
if (Page.EOF == ch)
done = true;
else if ('D' == ch)
scanJIS (mCursor);
else
{
mCursor.retreat ();
mCursor.retreat ();
mCursor.retreat ();
}
}
*/
else
{
mCursor.retreat ();
mCursor.retreat ();
}
}
else
mCursor.retreat ();
}
else if ( ...
}
}
|