Jericho HTML Parser / Bugs / #80 Jericho throws a "position discarded" exception with unlimited buffer

#80 Jericho throws a "position discarded" exception with unlimited buffer

Milestone: General

Status: closed

Owner: nobody

Labels: None

Priority: 5

Updated: 2015-10-24

Created: 2015-02-02

Creator: Sebastiano Vigna

Private: No

The attached page is parsed (in streaming) without problems if the character buffer size is fixed at 65536, but throws the following exception if the buffer is unlimited:

Exception in thread "main" java.lang.IllegalStateException: StreamedText position 111 has been discarded
at net.htmlparser.jericho.StreamedText.checkPos(StreamedText.java:205)
at net.htmlparser.jericho.StreamedText.charAt(StreamedText.java:99)
at net.htmlparser.jericho.Source.getNameEnd(Source.java:1441)
at net.htmlparser.jericho.StartTagTypeGenericImplementation.constructTagAt(StartTagTypeGenericImplementation.java:120)
at net.htmlparser.jericho.StartTagTypeMarkupDeclaration.constructTagAt(StartTagTypeMarkupDeclaration.java:38)
at net.htmlparser.jericho.TagType.getTagAt(TagType.java:681)
at net.htmlparser.jericho.StreamedSource$StreamedSourceIterator.findNextParsedSegment(StreamedSource.java:644)
at net.htmlparser.jericho.StreamedSource$StreamedSourceIterator.loadNextParsedSegment(StreamedSource.java:623)
at net.htmlparser.jericho.StreamedSource$StreamedSourceIterator.hasNext(StreamedSource.java:605)
at it.unimi.di.law.bubing.parser.HTMLParser.parse(HTMLParser.java:489)
at it.unimi.di.law.bubing.parser.HTMLParser.main(HTMLParser.java:731)

Note that it is quite unexpected that providing more memory can cause an exception.

1 Attachments

bug.html

Discussion

Martin Jericho - 2015-02-02

status: unread --> pending
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Martin Jericho - 2015-02-02

Definitely a bug. I'll try to have a look tomorrow but I might only get to it later in the week.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Martin Jericho - 2015-02-05

Fixed in version 3.4.

Until version 3.4 is officially released, the development version is available here:
http://jericho.htmlparser.net/temp/jericho-html-3.4-dev.zip

The fix requires that the buffer is big enough to contain any two consecutive tags and the text between them, so files that used to parse ok with a fixed buffer size may now result in a buffer overflow.

Please let me know if the fix causes any problems.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Sebastiano Vigna - 2015-02-05

I don't understand. The bug is about unlimited buffers. Are you really saying that performance with fixed buffers will be worse because of the fix?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Martin Jericho - 2015-02-05

The bug wasn't specifically related to having an unlimited (automatically growing) buffer. It was prematurely marking a position as discardable. The only reason it didn't happen with your fixed buffer is because the buffer was big enough to hold the whole source document. The dodgy DOCTYPE tag at the start of the document makes the parser search the entire document for its end delimiter.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Martin Jericho - 2015-10-24

status: pending --> closed
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Martin Jericho - 2015-10-24

This bug has now been closed - Version 3.4 released.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jericho throws a "position discarded" exception with unlimited buffer

Group

Searches

Help

#80 Jericho throws a "position discarded" exception with unlimited buffer

Discussion