[Jtidy-devel] [ jtidy-Bugs-3349161 ] problem parsing CDATA

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Bugs item #3349161, was opened at 2011-07-01 15:56
Message generated for change (Comment added) made by furman82
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=113153&aid=3349161&group_id=13153

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: Tidy functionality
Group: None
Status: Open
Resolution: None
Priority: 5
Private: No
Submitted By: Aaron Herstein (aarongh2012)
Assigned to: Nobody/Anonymous (nobody)
Summary: problem parsing CDATA

Initial Comment:
When parsing this page: http://www.nytimes.com/2011/04/14/world/asia/14quake.html?_r=2, a StringIndexOutOfBoundsException is being thrown with this stack trace:

java.lang.StringIndexOutOfBoundsException: String index out of range: 16385
	at java.lang.String.checkBounds(Unknown Source)
	at java.lang.String.<init>(Unknown Source)
	at org.w3c.tidy.TidyUtils.getString(TidyUtils.java:658)
	at org.w3c.tidy.Lexer.getCDATA(Lexer.java:1835)
	at org.w3c.tidy.ParserImpl$ParseScript.parse(ParserImpl.java:667)
	at org.w3c.tidy.ParserImpl.parseTag(ParserImpl.java:203)
	at org.w3c.tidy.ParserImpl$ParseBlock.parse(ParserImpl.java:2464)
	at org.w3c.tidy.ParserImpl.parseTag(ParserImpl.java:203)
	at org.w3c.tidy.ParserImpl$ParseBlock.parse(ParserImpl.java:2464)
	at org.w3c.tidy.ParserImpl.parseTag(ParserImpl.java:203)
	at org.w3c.tidy.ParserImpl$ParseBlock.parse(ParserImpl.java:2464)
	at org.w3c.tidy.ParserImpl.parseTag(ParserImpl.java:203)
	at org.w3c.tidy.ParserImpl$ParseBlock.parse(ParserImpl.java:2464)
	at org.w3c.tidy.ParserImpl.parseTag(ParserImpl.java:203)
	at org.w3c.tidy.ParserImpl$ParseBlock.parse(ParserImpl.java:2464)
	at org.w3c.tidy.ParserImpl.parseTag(ParserImpl.java:203)
	at org.w3c.tidy.ParserImpl$ParseBlock.parse(ParserImpl.java:2464)
	at org.w3c.tidy.ParserImpl.parseTag(ParserImpl.java:203)
	at org.w3c.tidy.ParserImpl$ParseBlock.parse(ParserImpl.java:2464)
	at org.w3c.tidy.ParserImpl.parseTag(ParserImpl.java:203)
	at org.w3c.tidy.ParserImpl$ParseBlock.parse(ParserImpl.java:2464)
	at org.w3c.tidy.ParserImpl.parseTag(ParserImpl.java:203)
	at org.w3c.tidy.ParserImpl$ParseBody.parse(ParserImpl.java:971)
	at org.w3c.tidy.ParserImpl.parseTag(ParserImpl.java:203)
	at org.w3c.tidy.ParserImpl$ParseHTML.parse(ParserImpl.java:483)
	at org.w3c.tidy.ParserImpl.parseDocument(ParserImpl.java:3401)
	at org.w3c.tidy.Tidy.parse(Tidy.java:435)
	at org.w3c.tidy.Tidy.parse(Tidy.java:658)

----------------------------------------------------------------------

Comment By: Matt Furman (furman82)
Date: 2011-09-08 14:12

Message:
I also ran into this issue and "fixed" it locally... 

It appears to be a flaw with addByte within Lexer.java. The function
assumes that the buffer only gets examined one byte at a time, however in
the CDATA function, the call to TidyUtils.getString passes in a length that
is greater than 1. I overloaded the appropriate functions to allow to pass
in the size the buffer needs to grow by.

public void addByte(int c) {
    	addByte(c, 1);
    }

    /**
     * Adds a byte to lexer buffer.
     * @param c byte to add
     */
    public void addByte(int c, int size)
    {
        if (this.lexsize + size >= this.lexlength)
        {
            while (this.lexsize + size >= this.lexlength)
            {
                if (this.lexlength == 0)
                {
                    this.lexlength = 8192;
                }
                else
                {
                    this.lexlength = this.lexlength * 2;
                }
            }

            byte[] temp = this.lexbuf;
            this.lexbuf = new byte[this.lexlength];
            if (temp != null)
            {
                System.arraycopy(temp, 0, this.lexbuf, 0, temp.length);
                updateNodeTextArrays(temp, this.lexbuf);
            }
        }

        this.lexbuf[this.lexsize++] = (byte) c;
        this.lexbuf[this.lexsize] = (byte) '\0'; // debug
    }

Once I changed the necessary associated functions, it seemed to do the
trick.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=113153&aid=3349161&group_id=13153