Work at SourceForge, help us to make it a better place! We have an immediate need for a Support Technician in our San Francisco or Denver office.

Close

#63 The parser uses exceptions for normal operations

None
open-fixed
nobody
None
5
2013-01-10
2013-01-07
Sebastiano Vigna
No

We are using Jericho in a highly multithreaded environment with heavy CPU workload. Inspecting thread dumps, it turns out that most of our threads are actually doing a fillStackTrace() catched here by findNextParsedSegment():

        } catch (IndexOutOfBoundsException ex) {
            // normal way to catch end of stream.
        }

The problem here is that throwing an exception is expensive—very expensive. They should be used for exceptional cases only (see Item 39 of "Effective Java").

It is likely we are the only ones detecting this problem because we parse >10000 web pages/s. But it is a bit disappointing to see our parallel threads stuck in throwing exceptions, mostly, instead of parsing :(.

Discussion

  • Martin Jericho
    Martin Jericho
    2013-01-08

    • status: unread --> pending
    • milestone: -->
     
  • Martin Jericho
    Martin Jericho
    2013-01-08

    Hi Sebastiano,

    You're absolutely right and I assume there was a good reason I implemented it that way, but having a quick look at the code now I can't see what it might have been. I'll see if I can fix it in the next couple of days.

    Cheers
    Martin

     
  • Martin Jericho
    Martin Jericho
    2013-01-10

    Fixed in version 3.4.

    Until version 3.4 is officially released, the development version is available here:
    http://jericho.htmlparser.net/temp/jericho-html-3.4-dev.zip

    The fix does require breaking the CharSequence interface rules in the streamed version which may cause problems with either existing or future code that relies on the normal interface behaviour, but I think the performance improvement is likely worth the risk.

     
  • Martin Jericho
    Martin Jericho
    2013-01-10

    • status: pending --> open-fixed
     
  • Thank you! With the development version, we are observing an overall 15% increase in the number of parsed pages per second, which is actually quite impressive.