The parser uses exceptions for normal operations

Brought to you by: mjericho

#63 The parser uses exceptions for normal operations

Milestone: General

Status: closed

Owner: nobody

Labels: None

Priority: 5

Updated: 2015-10-24

Created: 2013-01-07

Creator: Sebastiano Vigna

Private: No

We are using Jericho in a highly multithreaded environment with heavy CPU workload. Inspecting thread dumps, it turns out that most of our threads are actually doing a fillStackTrace() catched here by findNextParsedSegment():

        } catch (IndexOutOfBoundsException ex) {
            // normal way to catch end of stream.
        }

The problem here is that throwing an exception is expensive—very expensive. They should be used for exceptional cases only (see Item 39 of "Effective Java").

It is likely we are the only ones detecting this problem because we parse >10000 web pages/s. But it is a bit disappointing to see our parallel threads stuck in throwing exceptions, mostly, instead of parsing :(.

Discussion

Martin Jericho - 2013-01-08

status: unread --> pending

milestone: -->
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Martin Jericho - 2013-01-08

Hi Sebastiano,

You're absolutely right and I assume there was a good reason I implemented it that way, but having a quick look at the code now I can't see what it might have been. I'll see if I can fix it in the next couple of days.

Cheers
Martin

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Martin Jericho - 2013-01-10

Fixed in version 3.4.

Until version 3.4 is officially released, the development version is available here:
http://jericho.htmlparser.net/temp/jericho-html-3.4-dev.zip

The fix does require breaking the CharSequence interface rules in the streamed version which may cause problems with either existing or future code that relies on the normal interface behaviour, but I think the performance improvement is likely worth the risk.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Martin Jericho - 2013-01-10

status: pending --> open-fixed
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Sebastiano Vigna - 2013-01-10

Thank you! With the development version, we are observing an overall 15% increase in the number of parsed pages per second, which is actually quite impressive.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Martin Jericho - 2015-10-24

status: open-fixed --> closed

Group: --> General
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Martin Jericho - 2015-10-24

This bug has now been closed - Version 3.4 released.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Log in to post a comment.