Menu

#80 Jericho throws a "position discarded" exception with unlimited buffer

General
closed
nobody
None
5
2015-10-24
2015-02-02
No

The attached page is parsed (in streaming) without problems if the character buffer size is fixed at 65536, but throws the following exception if the buffer is unlimited:

Exception in thread "main" java.lang.IllegalStateException: StreamedText position 111 has been discarded
at net.htmlparser.jericho.StreamedText.checkPos(StreamedText.java:205)
at net.htmlparser.jericho.StreamedText.charAt(StreamedText.java:99)
at net.htmlparser.jericho.Source.getNameEnd(Source.java:1441)
at net.htmlparser.jericho.StartTagTypeGenericImplementation.constructTagAt(StartTagTypeGenericImplementation.java:120)
at net.htmlparser.jericho.StartTagTypeMarkupDeclaration.constructTagAt(StartTagTypeMarkupDeclaration.java:38)
at net.htmlparser.jericho.TagType.getTagAt(TagType.java:681)
at net.htmlparser.jericho.StreamedSource$StreamedSourceIterator.findNextParsedSegment(StreamedSource.java:644)
at net.htmlparser.jericho.StreamedSource$StreamedSourceIterator.loadNextParsedSegment(StreamedSource.java:623)
at net.htmlparser.jericho.StreamedSource$StreamedSourceIterator.hasNext(StreamedSource.java:605)
at it.unimi.di.law.bubing.parser.HTMLParser.parse(HTMLParser.java:489)
at it.unimi.di.law.bubing.parser.HTMLParser.main(HTMLParser.java:731)

Note that it is quite unexpected that providing more memory can cause an exception.

1 Attachments

Discussion

  • Martin Jericho

    Martin Jericho - 2015-02-02
    • status: unread --> pending
     
  • Martin Jericho

    Martin Jericho - 2015-02-02

    Definitely a bug. I'll try to have a look tomorrow but I might only get to it later in the week.

     
  • Martin Jericho

    Martin Jericho - 2015-02-05

    Fixed in version 3.4.

    Until version 3.4 is officially released, the development version is available here:
    http://jericho.htmlparser.net/temp/jericho-html-3.4-dev.zip

    The fix requires that the buffer is big enough to contain any two consecutive tags and the text between them, so files that used to parse ok with a fixed buffer size may now result in a buffer overflow.

    Please let me know if the fix causes any problems.

     
  • Sebastiano Vigna

    I don't understand. The bug is about unlimited buffers. Are you really saying that performance with fixed buffers will be worse because of the fix?

     
  • Martin Jericho

    Martin Jericho - 2015-02-05

    The bug wasn't specifically related to having an unlimited (automatically growing) buffer. It was prematurely marking a position as discardable. The only reason it didn't happen with your fixed buffer is because the buffer was big enough to hold the whole source document. The dodgy DOCTYPE tag at the start of the document makes the parser search the entire document for its end delimiter.

     
  • Martin Jericho

    Martin Jericho - 2015-10-24
    • status: pending --> closed
     
  • Martin Jericho

    Martin Jericho - 2015-10-24

    This bug has now been closed - Version 3.4 released.

     

Log in to post a comment.