The parser uses exceptions for normal operations
Brought to you by:
mjericho
We are using Jericho in a highly multithreaded environment with heavy CPU workload. Inspecting thread dumps, it turns out that most of our threads are actually doing a fillStackTrace() catched here by findNextParsedSegment():
} catch (IndexOutOfBoundsException ex) {
// normal way to catch end of stream.
}
The problem here is that throwing an exception is expensive—very expensive. They should be used for exceptional cases only (see Item 39 of "Effective Java").
It is likely we are the only ones detecting this problem because we parse >10000 web pages/s. But it is a bit disappointing to see our parallel threads stuck in throwing exceptions, mostly, instead of parsing :(.
Hi Sebastiano,
You're absolutely right and I assume there was a good reason I implemented it that way, but having a quick look at the code now I can't see what it might have been. I'll see if I can fix it in the next couple of days.
Cheers
Martin
Fixed in version 3.4.
Until version 3.4 is officially released, the development version is available here:
http://jericho.htmlparser.net/temp/jericho-html-3.4-dev.zip
The fix does require breaking the CharSequence interface rules in the streamed version which may cause problems with either existing or future code that relies on the normal interface behaviour, but I think the performance improvement is likely worth the risk.
Thank you! With the development version, we are observing an overall 15% increase in the number of parsed pages per second, which is actually quite impressive.
This bug has now been closed - Version 3.4 released.