The attached page is parsed (in streaming) without problems if the character buffer size is fixed at 65536, but throws the following exception if the buffer is unlimited:
Exception in thread "main" java.lang.IllegalStateException: StreamedText position 111 has been discarded
at net.htmlparser.jericho.StreamedText.checkPos(StreamedText.java:205)
at net.htmlparser.jericho.StreamedText.charAt(StreamedText.java:99)
at net.htmlparser.jericho.Source.getNameEnd(Source.java:1441)
at net.htmlparser.jericho.StartTagTypeGenericImplementation.constructTagAt(StartTagTypeGenericImplementation.java:120)
at net.htmlparser.jericho.StartTagTypeMarkupDeclaration.constructTagAt(StartTagTypeMarkupDeclaration.java:38)
at net.htmlparser.jericho.TagType.getTagAt(TagType.java:681)
at net.htmlparser.jericho.StreamedSource$StreamedSourceIterator.findNextParsedSegment(StreamedSource.java:644)
at net.htmlparser.jericho.StreamedSource$StreamedSourceIterator.loadNextParsedSegment(StreamedSource.java:623)
at net.htmlparser.jericho.StreamedSource$StreamedSourceIterator.hasNext(StreamedSource.java:605)
at it.unimi.di.law.bubing.parser.HTMLParser.parse(HTMLParser.java:489)
at it.unimi.di.law.bubing.parser.HTMLParser.main(HTMLParser.java:731)
Note that it is quite unexpected that providing more memory can cause an exception.
Definitely a bug. I'll try to have a look tomorrow but I might only get to it later in the week.
Fixed in version 3.4.
Until version 3.4 is officially released, the development version is available here:
http://jericho.htmlparser.net/temp/jericho-html-3.4-dev.zip
The fix requires that the buffer is big enough to contain any two consecutive tags and the text between them, so files that used to parse ok with a fixed buffer size may now result in a buffer overflow.
Please let me know if the fix causes any problems.
I don't understand. The bug is about unlimited buffers. Are you really saying that performance with fixed buffers will be worse because of the fix?
The bug wasn't specifically related to having an unlimited (automatically growing) buffer. It was prematurely marking a position as discardable. The only reason it didn't happen with your fixed buffer is because the buffer was big enough to hold the whole source document. The dodgy DOCTYPE tag at the start of the document makes the parser search the entire document for its end delimiter.
This bug has now been closed - Version 3.4 released.