From: SourceForge.net <no...@so...> - 2008-07-22 07:54:13
|
Bugs item #1968435, was opened at 2008-05-21 06:31 Message generated for change (Comment added) made by mguillem You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=952178&aid=1968435&group_id=195122 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: scanner Group: 1.9.7 >Status: Closed >Resolution: Fixed Priority: 5 Private: No Submitted By: David Kellum (dekellum) >Assigned to: Marc Guillemot (mguillem) Summary: ArrayIndexOutOfBoundsException on attached Initial Comment: Another badly malformed HTML doc (attached) produces the following (1.9.7): java.lang.ArrayIndexOutOfBoundsException: -1 at org.cyberneko.html.HTMLScanner.read(HTMLScanner.java:1118) at org.cyberneko.html.HTMLScanner$ContentScanner.scanAttribute(HTMLScanner.java:2626) at org.cyberneko.html.HTMLScanner$ContentScanner.scanAttribute(HTMLScanner.java:2463) at org.cyberneko.html.HTMLScanner$ContentScanner.scanStartElement(HTMLScanner.java:2353) at org.cyberneko.html.HTMLScanner$ContentScanner.scan(HTMLScanner.java:1955) at org.cyberneko.html.HTMLScanner.scanDocument(HTMLScanner.java:877) at org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:495) at org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:448) at org.apache.xerces.parsers.XMLParser.parse(Unknown Source) at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source) And on SVN trunk: java.lang.ArrayIndexOutOfBoundsException: -1 at org.cyberneko.html.HTMLScanner.read(HTMLScanner.java:1119) at org.cyberneko.html.HTMLScanner$ContentScanner.scanAttribute(HTMLScanner.java:2627) at org.cyberneko.html.HTMLScanner$ContentScanner.scanAttribute(HTMLScanner.java:2464) at org.cyberneko.html.HTMLScanner$ContentScanner.scanStartElement(HTMLScanner.java:2354) at org.cyberneko.html.HTMLScanner$ContentScanner.scan(HTMLScanner.java:1956) at org.cyberneko.html.HTMLScanner.scanDocument(HTMLScanner.java:878) at org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:495) at org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:448) at org.apache.xerces.parsers.XMLParser.parse(Unknown Source) at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source) ---------------------------------------------------------------------- >Comment By: Marc Guillemot (mguillem) Date: 2008-07-22 09:54 Message: Logged In: YES user_id=402164 Originator: NO Seems that recent improvements have solved this problem as well: I can't reproduce it any more with latest sources from SVN. ---------------------------------------------------------------------- Comment By: Jacob Kjome (jacobk) Date: 2008-05-21 17:30 Message: Logged In: YES user_id=517746 Originator: NO Given your last comment that this is resolved by replacing the trailing CR with CRLF, I think this might be related to bug #1939338 (NekoHTML line ending reading bug)[1]. NekoHTML seems to re-read the last part of the file after it has already read in the last character in the file when using \n or \r, but not when using \r\n. This is a total blocker for me, as I use special input streams and readers that automatically close themselves when the last character has been read. This causes an exception to be thrown saying that the stream/reader has already been closed. I don't get this problem when using Xerces as the parser, only when using NekoHTML. I hope this new bug report will provide the impetus to finally address this issue. [1] https://sourceforge.net/tracker/index.php?func=detail&aid=1939338&group_id=195122&atid=952178 ---------------------------------------------------------------------- Comment By: David Kellum (dekellum) Date: 2008-05-21 14:55 Message: Logged In: YES user_id=1672413 Originator: YES ArrayIndexOutOfBoundsException is also avoided by replacing the trailing CR with a CRLF. ---------------------------------------------------------------------- Comment By: David Kellum (dekellum) Date: 2008-05-21 14:47 Message: Logged In: YES user_id=1672413 Originator: YES Yes, the original has internal CRLF line breaks and a final trailing CR. I too find that I can avoid the ArrayIndexOutOfBoundsException by removing the trailing CR. ---------------------------------------------------------------------- Comment By: Nobody/Anonymous (nobody) Date: 2008-05-21 12:05 Message: Logged In: NO Interesting, the file seems to contain illegal character at the end which are causing the problem. If I save it as it, I can reproduce the problem locally, but if I open it with an editor and save it from there without modifying anything, the editor fixes the incorrect bytes and the problem disappears. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=952178&aid=1968435&group_id=195122 |