JTidy / Bugs / #205 StringIndexOutOfBoundsException while lexing script content

aditsu - 2010-01-04

Fixed in svn, thanks.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Comment has been marked as spam.
Undo

View and moderate all "bugs Discussion" comments posted by this user

Mark all as spam, and block user from posting to "Bugs"

Anonymous - 2010-03-15

It has not been fixed, at least not in revision 927 :

java.lang.StringIndexOutOfBoundsException caught: java.lang.StringIndexOutOfBoundsException: String index out of range: 16385
at java.lang.String.checkBounds(String.java:401)
at java.lang.String.<init>(String.java:442)
at org.w3c.tidy.TidyUtils.getString(null:-1)
at org.w3c.tidy.Lexer.getCDATA(null:-1)
at org.w3c.tidy.ParserImpl$ParseScript.parse(null:-1)
at org.w3c.tidy.ParserImpl.parseTag(null:-1)
at org.w3c.tidy.ParserImpl$ParseBody.parse(null:-1)
at org.w3c.tidy.ParserImpl.parseTag(null:-1)
at org.w3c.tidy.ParserImpl$ParseHTML.parse(null:-1)
at org.w3c.tidy.ParserImpl.parseDocument(null:-1)
at org.w3c.tidy.Tidy.parse(null:-1)
at org.w3c.tidy.Tidy.parse(null:-1)
at org.w3c.tidy.Tidy.parseDOM(null:-1)</init>

Problem is caused by

matches = container.element.equalsIgnoreCase(TidyUtils.getString(lexbuf, start,
container.element.length()));

in situations when start + container.element.length() > lexlength

To fix that problem above lines should be replaced by

matches = container.element.equalsIgnoreCase(TidyUtils.getString(lexbuf, start,
lexsize - start - 1));

both in block handling state CDATA_STARTTAG and CDATA_ENDTAG

It has not been fixed, at least not in revision 927 : java.lang.StringIndexOutOfBoundsException caught: java.lang.StringIndexOutOfBoundsException: String index out of range: 16385 at java.lang.String.checkBounds(String.java:401) at java.lang.String.<init>(String.java:442) at org.w3c.tidy.TidyUtils.getString(null:-1) at org.w3c.tidy.Lexer.getCDATA(null:-1) at org.w3c.tidy.ParserImpl$ParseScript.parse(null:-1) at org.w3c.tidy.ParserImpl.parseTag(null:-1) at org.w3c.tidy.ParserImpl$ParseBody.parse(null:-1) at org.w3c.tidy.ParserImpl.parseTag(null:-1) at org.w3c.tidy.ParserImpl$ParseHTML.parse(null:-1) at org.w3c.tidy.ParserImpl.parseDocument(null:-1) at org.w3c.tidy.Tidy.parse(null:-1) at org.w3c.tidy.Tidy.parse(null:-1) at org.w3c.tidy.Tidy.parseDOM(null:-1) Problem is caused by matches = container.element.equalsIgnoreCase(TidyUtils.getString(lexbuf, start, container.element.length())); in situations when start + container.element.length() > lexlength To fix that problem above lines should be replaced by matches = container.element.equalsIgnoreCase(TidyUtils.getString(lexbuf, start, lexsize - start - 1)); both in block handling state CDATA_STARTTAG and CDATA_ENDTAG

Add attachments
Cancel
You seem to have CSS turned off. Please don't fill out this field.

You seem to have CSS turned off. Please don't fill out this field.

New Attachment:

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

aditsu - 2010-03-15

The bug was reported against revision 938 and you're complaining that it's not fixed in revision 927?!? That doesn't make any sense.
Anyway, the bug was fixed in revision 1094.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Comment has been marked as spam.
Undo

View and moderate all "bugs Discussion" comments posted by this user

Mark all as spam, and block user from posting to "Bugs"

Anonymous - 2010-03-15

sorry, but this is source that I am seeing in Lexer.java from jtidy-r938-sources.zip downloaded from sourceforge and also in one that I can see in svn repository (e.g. http://jtidy.svn.sourceforge.net/viewvc/jtidy/trunk/jtidy/src/main/java/org/w3c/tidy/Lexer.java?view=log)

Last edit: Anonymous 2014-08-01

sorry, but this is source that I am seeing in Lexer.java from jtidy-r938-sources.zip downloaded from sourceforge and also in one that I can see in svn repository (e.g. http://jtidy.svn.sourceforge.net/viewvc/jtidy/trunk/jtidy/src/main/java/org/w3c/tidy/Lexer.java?view=log)

Add attachments
Cancel
You seem to have CSS turned off. Please don't fill out this field.

You seem to have CSS turned off. Please don't fill out this field.

New Attachment:

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Michael Schierl - 2010-03-15

The bug was not fixed by aditsu in r1094, it was just "taped over" (or "silenced") by making sure that getString will not throw SIOOBE any longer:

http://jtidy.svn.sourceforge.net/viewvc/jtidy/trunk/jtidy/src/main/java/org/w3c/tidy/TidyUtils.java?r1=1094&r2=1093&pathrev=1094

When I saw this "fix" I knew that I'll have to migrate to some other HTML sanitizer than JTidy, at least for cases where the sanitizing happens automatically without anyone (except customers) looking at the output afterwards. (yes I know the NO WARRANTY clause of most open source licenses, so I will not complain about it).

cigaly: If you managed to track down the root cause (I tried it but gave up after searching for a few hours), feel free to attach a patch to this bug. I will also test it and if it works fine with the "real" files that show that bug (with r1094 reversed) I will at least apply it to my private version, if aditsu does not want to apply it).

I don't think forking is the right way to handle those issues, but if there is no alternative to it (r1094 is not, for me), I'll do my private fork.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

aditsu - 2010-03-16

cigaly: you're talking about the revision of a single file, Lexer.java, but the fix is not there.
I agree with schierlm about one thing: the change in r1094 was in TidyUtils, not in Lexer.
However, I fully disagree about the other comments.

First of all, I would like to remind everybody that this is a port of Tidy (aka HTML Tidy), which was written in C. JTidy strives to be completely compatible with it.

cigaly says the problem is in these lines of Lexer.java:
matches = container.element.equalsIgnoreCase(TidyUtils.getString(lexbuf, start,
container.element.length()));
The corresponding lines of lexer code in Tidy are:
matches = TY_(tmbstrncasecmp)(container->element, lexer->lexbuf + start,
TY_(tmbstrlen)(container->element)) == 0;
As you can see, it's a direct translation, except strncasecmp which is replaced by equalsIgnoreCase and TidyUtils.getString.
And the error happened because TidyUtils.getString (in conjunction with equalsIgnoreCase) didn't work exactly like tmbstrncasecmp.

Now, by changing the code to:
matches = container.element.equalsIgnoreCase(TidyUtils.getString(lexbuf, start,
lexsize - start - 1));
the meaning of the code will change and will be different from Tidy. That's already a big problem, however the greatest problem is that TidyUtils.getString remains unfixed, so it can still cause similar problems when it is called from (many) other places.
Therefore, I see "lexsize - start - 1" as the "taping over" (I'm not even sure it works correctly), and changing TidyUtils.getString to stop at the end of the buffer (which is null-terminated in C) as the correct fix, and that is why I chose it. It wasn't a "let's silence this error" decision. There may be better ways to port tmbstrncasecmp, but at least this fix can make the current way work correctly.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Michael Schierl - 2010-03-16

Hello Adrian,

Thank you very much for your fast, polite and informative comment. We need more developers like you :-)

I can understand your point (assuming that most calls of getString are in similar situations where some compare function that stops at a null byte was replaced).

You might want to adjust the method comment and maybe rename the length parameter to max length, and maybe even check for embedded nulls and cut the string at the first one, as tmbstrncasecmp("foo\0bar", "foo\0foo", 7) will return 0 as the strings are the same up to the first 0 byte.

I hope you have double-checked the other calls to getString, as this cutting might easily hide an obscure bug even deeper :)

Some minor issue: the length() function and the tmbstrlen (which counts UTF-8 bytes) will return different values if non-ASCII-characters are involved; therefore a comparison that worked in C might fail in Java, as getString will convert too few bytes. But I understand this will really take time to fix properly, so just using the C version for international documents might be the better short-term solution :-)

[I wrote a long comment about that including some real-world examples, but trying to send the comment ran into a server error, maybe because of some weird characters, so I am writing it for the second time now, without any examples. Sorry for that...]

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

StringIndexOutOfBoundsException while lexing script content

Group

Searches

Help

#205 StringIndexOutOfBoundsException while lexing script content

Discussion