[Htmlparser-developer] JIS encoding problem

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Dear All,

I'm Yuta Okamoto, parttime employee of Ariel Networks, Inc..

I'm writing to ask you problems with HTML documents including "JIS encoding"
(ISO-2022-JP) strings. In Japan, there are many type and version of
character set. JIS encoding, one of the popular Japanese charset, is defined
as a subset of ISO-2022.

We're developing an application using HTML parser library, and face some
problems. For example, some kind of HTML document including JIS encoding
strings as below:

    <HTML>
    <HEAD>
    <TITLE>[JIS encoding strings]</TITLE>
    <meta http-equiv="Content-Type" content="text/html;
charset=iso-2022-jp">
    ...
    </HEAD>
    <BODY> ... </BODY>
    </HTML>

In this case, HTML parser can't recognize "</TITLE>" and set down following
tags and strings as content of "TITLE". For finding a reason, I get the
source of HTML parser and trace its process.

In the result, I found causes in org.htmlparser.lexer.Lexer.parseString()
and scanJIS(). Within JIS encoding strings, several kind of "escape
sequence" defined by ISO-2022 to switch character set. For example,

    [ESC] $ B [double byte characters] [ESC] ( B

Where "[ESC] $ B" means "switch to JIS X 0208-1983(new JIS) charset". And
"[ESC] ( B" means "switch to US-ASCII charset". For more detail, please see
ISO-2022, RFC1468 or RFC1554.

HTML parser recognize a string enclosed by ISO-2022 escape sequences.
However, It recognize the string only beginning with "[ESC] $ B" and ending
with "[ESC] ( J", meaning "switch to JIS X 0201-1976 ("Roman" set)". On the
above example, HTML parser can't recognize the end of JIS encoding string by
the end of the document. In order to resolve it, I revised
"org.htmlparser.lexer.Lexer.java" and this problem is improved.

But it's one thing after another. When HTML parser find a "Content-Type"
META tag, correct the current charset and read string before META tag once
again to compare with the buffer already read by default encoding in
org.htmlparser.lexer.InputStreamSource.setEncoding(). In this case, HTML
parser throws ParserException(EncodingChangeException) because of comparing
"[ESC]" from first character of old buffer with double byte character from
that of new buffer.

I'm overwhelmed by that. What should I do? In the meantime, I attach the
revised code to this mail. please see the below.

Regards,
Okamoto

----------

    /**
     * Advance the cursor through a JIS escape sequence.<p>
     *
     * NOTE:<br>
     * A list of ISO-2022 escape sequences for charset switching.<br>
     * For more detail, see ISO-2022, RFC1468 or RFC1554.<p>
     *
     * [ double byte characters ]
     * <ul>
     * <li>(*) JIS X 0208-1978(old JIS): [ESC] $ @
     * <li>(*) JIS X 0208-1983(new JIS): [ESC] $ B
     * <li>JIS X 0208-1990: [ESC] & @ [ESC] $ B
     * <li>JIS X 0212-1990: [ESC] $ ( D
     * <li>1st plane of JIS X 0213:2000: [ESC] $ ( O
     * <li>1st plane of JIS X 0213:2004: [ESC] $ ( Q
     * <li>2nd plane of JIS X 0213:2000: [ESC] $ ( P
     * </ul>
     *
     * <p>[ single byte characters ]
     * <ul>
     * <li>(*) ISO/IEC 646 IRV(US-ASCII): [ESC] ( B
     * <li>(*) JIS X 0201-1976 ("Roman" set)
     * <ul>
     * <li>[ESC] ( J
     * <li>[ESC] ( H (NOT RECOMMENDED but rarely used)
     * </ul>
     * <li>JIS X 0201-1976 ("Kana" set): [ESC] ( I (NOT RECOMMENDED but
rarely used)
     * </ul>
     *
     * <p>(*): commonly used
     *
     * @param cursor A cursor positioned within the escape sequence.
     * @exception ParserException If a problem occurs reading from the
source.
     */
    protected void scanJIS (Cursor cursor)
        throws
            ParserException
    {
        boolean done;
        char ch;
        int state;

        done = false;
        state = 0;
        while (!done)
        {
            ch = mPage.getCharacter (cursor);
            if (Page.EOF == ch)
                done = true;
            else
                switch (state)
                {
                    case 0:
                        if (0x1b == ch) // escape
                            state = 1;
                        break;
                    case 1:
                        if ('(' == ch)
                            state = 2;
                        else
                            state = 0;
                        break;
                    case 2:
                        if ('B' == ch || 'J' == ch || 'H' == ch || 'I' ==
ch)
                            done = true;
                        else
                            state = 0;
                        break;
                    default:
                        throw new IllegalStateException ("state " + state);
                }
        }
    }

    /**
     * Parse a string node.
     * Scan characters until "&lt;/", "&lt;%", "&lt;!" or &lt; followed by a
     * letter is encountered, or the input stream is exhausted, in which
     * case <code>null</code> is returned.
     * @param start The position at which to start scanning.
     * @param quotesmart If <code>true</code>, strings ignore quoted
contents.
     * @return The parsed node.
     * @exception ParserException If a problem occurs reading from the
source.
     */
    protected Node parseString (int start, boolean quotesmart)
        throws
            ParserException
    {
        boolean done;
        char ch;
        char quote;

        done = false;
        quote = 0;
        while (!done)
        {
            ch = mPage.getCharacter (mCursor);
            if (Page.EOF == ch)
                done = true;
            else if (0x1b == ch) // escape
            {
                ch = mPage.getCharacter (mCursor);
                if (Page.EOF == ch)
                    done = true;
                else if ('$' == ch)
                {
                    ch = mPage.getCharacter (mCursor);
                    if (Page.EOF == ch)
                        done = true;
                    // JIS X 0208-1978 and JIS X 0208-1983
                    else if ('@' == ch || 'B' == ch)
                        scanJIS (mCursor);
                    /*
                    // JIS X 0212-1990
                    else if ('(' == ch)
                    {
                        ch = mPage.getCharacter (mCursor);
                        if (Page.EOF == ch)
                            done = true;
                        else if ('D' == ch)
                            scanJIS (mCursor);
                        else
                        {
                            mCursor.retreat ();
                            mCursor.retreat ();
                            mCursor.retreat ();
                        }
                    }
                    */
                    else
                    {
                        mCursor.retreat ();
                        mCursor.retreat ();
                    }
                }
                else
                    mCursor.retreat ();
            }
            else if ( ...
        }
    }