#63 The parser uses exceptions for normal operations


We are using Jericho in a highly multithreaded environment with heavy CPU workload. Inspecting thread dumps, it turns out that most of our threads are actually doing a fillStackTrace() catched here by findNextParsedSegment():

        } catch (IndexOutOfBoundsException ex) {
            // normal way to catch end of stream.

The problem here is that throwing an exception is expensive—very expensive. They should be used for exceptional cases only (see Item 39 of "Effective Java").

It is likely we are the only ones detecting this problem because we parse >10000 web pages/s. But it is a bit disappointing to see our parallel threads stuck in throwing exceptions, mostly, instead of parsing :(.


  • Martin Jericho

    Martin Jericho - 2013-01-08
    • status: unread --> pending
    • milestone: -->
  • Martin Jericho

    Martin Jericho - 2013-01-08

    Hi Sebastiano,

    You're absolutely right and I assume there was a good reason I implemented it that way, but having a quick look at the code now I can't see what it might have been. I'll see if I can fix it in the next couple of days.


  • Martin Jericho

    Martin Jericho - 2013-01-10

    Fixed in version 3.4.

    Until version 3.4 is officially released, the development version is available here:

    The fix does require breaking the CharSequence interface rules in the streamed version which may cause problems with either existing or future code that relies on the normal interface behaviour, but I think the performance improvement is likely worth the risk.

  • Martin Jericho

    Martin Jericho - 2013-01-10
    • status: pending --> open-fixed
  • Sebastiano Vigna

    Thank you! With the development version, we are observing an overall 15% increase in the number of parsed pages per second, which is actually quite impressive.

  • Martin Jericho

    Martin Jericho - 2015-10-24
    • status: open-fixed --> closed
    • Group: --> General
  • Martin Jericho

    Martin Jericho - 2015-10-24

    This bug has now been closed - Version 3.4 released.


Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:

No, thanks