JTidy version: jtidy-r938.jar
Consider this example file:
public class JTidyBug {
public static void main(String[] args) throws Exception {
final String SOURCE = "\n" +
"<script>\n" + <br>
"var o={x=9};\n" + <br>
"var q=x<o.x;\n" + <br>
"</script>";
char[] padding = new char[8165];
java.util.Arrays.fill(padding, 'x');
String source = new String(padding)+SOURCE;
org.w3c.tidy.Tidy tidy = new org.w3c.tidy.Tidy();
tidy.setShowWarnings(false);
tidy.parse(new java.io.ByteArrayInputStream(source.getBytes("ISO-8859-1")), System.out);
}
}
Expected result: A tidied HTML is output that is similar to that one produced when replacing the 8165 in the code above by 8160
Actual result:
Exception in thread "main" java.lang.StringIndexOutOfBoundsException: String index out of range: 8193
at java.lang.String.checkBounds(String.java:402)
at java.lang.String.<init>(String.java:443)
at org.w3c.tidy.TidyUtils.getString(Unknown Source)
at org.w3c.tidy.Lexer.getCDATA(Unknown Source)
at org.w3c.tidy.ParserImpl$ParseScript.parse(Unknown Source)
at org.w3c.tidy.ParserImpl.parseTag(Unknown Source)
at org.w3c.tidy.ParserImpl$ParseBody.parse(Unknown Source)
at org.w3c.tidy.ParserImpl.parseTag(Unknown Source)
at org.w3c.tidy.ParserImpl$ParseHTML.parse(Unknown Source)
at org.w3c.tidy.ParserImpl.parseDocument(Unknown Source)
at org.w3c.tidy.Tidy.parse(Unknown Source)
at org.w3c.tidy.Tidy.parse(Unknown Source)
at JTidyBug.main(JTidyBug.java:13)</init>
Fixed in svn, thanks.
View and moderate all "bugs Discussion" comments posted by this user
Mark all as spam, and block user from posting to "Bugs"
It has not been fixed, at least not in revision 927 :
java.lang.StringIndexOutOfBoundsException caught: java.lang.StringIndexOutOfBoundsException: String index out of range: 16385
at java.lang.String.checkBounds(String.java:401)
at java.lang.String.<init>(String.java:442)
at org.w3c.tidy.TidyUtils.getString(null:-1)
at org.w3c.tidy.Lexer.getCDATA(null:-1)
at org.w3c.tidy.ParserImpl$ParseScript.parse(null:-1)
at org.w3c.tidy.ParserImpl.parseTag(null:-1)
at org.w3c.tidy.ParserImpl$ParseBody.parse(null:-1)
at org.w3c.tidy.ParserImpl.parseTag(null:-1)
at org.w3c.tidy.ParserImpl$ParseHTML.parse(null:-1)
at org.w3c.tidy.ParserImpl.parseDocument(null:-1)
at org.w3c.tidy.Tidy.parse(null:-1)
at org.w3c.tidy.Tidy.parse(null:-1)
at org.w3c.tidy.Tidy.parseDOM(null:-1)</init>
Problem is caused by
matches = container.element.equalsIgnoreCase(TidyUtils.getString(lexbuf, start,
container.element.length()));
in situations when start + container.element.length() > lexlength
To fix that problem above lines should be replaced by
matches = container.element.equalsIgnoreCase(TidyUtils.getString(lexbuf, start,
lexsize - start - 1));
both in block handling state CDATA_STARTTAG and CDATA_ENDTAG
The bug was reported against revision 938 and you're complaining that it's not fixed in revision 927?!? That doesn't make any sense.
Anyway, the bug was fixed in revision 1094.
View and moderate all "bugs Discussion" comments posted by this user
Mark all as spam, and block user from posting to "Bugs"
sorry, but this is source that I am seeing in Lexer.java from jtidy-r938-sources.zip downloaded from sourceforge and also in one that I can see in svn repository (e.g. http://jtidy.svn.sourceforge.net/viewvc/jtidy/trunk/jtidy/src/main/java/org/w3c/tidy/Lexer.java?view=log)
Last edit: Anonymous 2014-08-01
The bug was not fixed by aditsu in r1094, it was just "taped over" (or "silenced") by making sure that getString will not throw SIOOBE any longer:
http://jtidy.svn.sourceforge.net/viewvc/jtidy/trunk/jtidy/src/main/java/org/w3c/tidy/TidyUtils.java?r1=1094&r2=1093&pathrev=1094
When I saw this "fix" I knew that I'll have to migrate to some other HTML sanitizer than JTidy, at least for cases where the sanitizing happens automatically without anyone (except customers) looking at the output afterwards. (yes I know the NO WARRANTY clause of most open source licenses, so I will not complain about it).
cigaly: If you managed to track down the root cause (I tried it but gave up after searching for a few hours), feel free to attach a patch to this bug. I will also test it and if it works fine with the "real" files that show that bug (with r1094 reversed) I will at least apply it to my private version, if aditsu does not want to apply it).
I don't think forking is the right way to handle those issues, but if there is no alternative to it (r1094 is not, for me), I'll do my private fork.
cigaly: you're talking about the revision of a single file, Lexer.java, but the fix is not there.
I agree with schierlm about one thing: the change in r1094 was in TidyUtils, not in Lexer.
However, I fully disagree about the other comments.
First of all, I would like to remind everybody that this is a port of Tidy (aka HTML Tidy), which was written in C. JTidy strives to be completely compatible with it.
cigaly says the problem is in these lines of Lexer.java:
matches = container.element.equalsIgnoreCase(TidyUtils.getString(lexbuf, start,
container.element.length()));
The corresponding lines of lexer code in Tidy are:
matches = TY_(tmbstrncasecmp)(container->element, lexer->lexbuf + start,
TY_(tmbstrlen)(container->element)) == 0;
As you can see, it's a direct translation, except strncasecmp which is replaced by equalsIgnoreCase and TidyUtils.getString.
And the error happened because TidyUtils.getString (in conjunction with equalsIgnoreCase) didn't work exactly like tmbstrncasecmp.
Now, by changing the code to:
matches = container.element.equalsIgnoreCase(TidyUtils.getString(lexbuf, start,
lexsize - start - 1));
the meaning of the code will change and will be different from Tidy. That's already a big problem, however the greatest problem is that TidyUtils.getString remains unfixed, so it can still cause similar problems when it is called from (many) other places.
Therefore, I see "lexsize - start - 1" as the "taping over" (I'm not even sure it works correctly), and changing TidyUtils.getString to stop at the end of the buffer (which is null-terminated in C) as the correct fix, and that is why I chose it. It wasn't a "let's silence this error" decision. There may be better ways to port tmbstrncasecmp, but at least this fix can make the current way work correctly.
Hello Adrian,
Thank you very much for your fast, polite and informative comment. We need more developers like you :-)
I can understand your point (assuming that most calls of getString are in similar situations where some compare function that stops at a null byte was replaced).
You might want to adjust the method comment and maybe rename the length parameter to max length, and maybe even check for embedded nulls and cut the string at the first one, as tmbstrncasecmp("foo\0bar", "foo\0foo", 7) will return 0 as the strings are the same up to the first 0 byte.
I hope you have double-checked the other calls to getString, as this cutting might easily hide an obscure bug even deeper :)
Some minor issue: the length() function and the tmbstrlen (which counts UTF-8 bytes) will return different values if non-ASCII-characters are involved; therefore a comparison that worked in C might fail in Java, as getString will convert too few bytes. But I understand this will really take time to fix properly, so just using the C version for international documents might be the better short-term solution :-)
[I wrote a long comment about that including some real-world examples, but trying to send the comment ran into a server error, maybe because of some weird characters, so I am writing it for the second time now, without any examples. Sorry for that...]