Work at SourceForge, help us to make it a better place! We have an immediate need for a Support Technician in our San Francisco or Denver office.
I've been using Jericho for a while due to its high stability and performance.
However, lately I've been running into unexpected performance issues. I've run profiling to figure out the root of the problem and have found that HTML text is searched over and over using the method ParseText.indexOf.
Reading the code, it seems many places in the code call this function. This came as a surprise to me, as I have seen many instances of caches, holding next tags and so on. All of which are filled when calling fullSequentialParse.
Unfortunately, it seems like common search methods such as Source.getAllStartTags or StartTag.getElement are not internally using these caches.
Disclaimer: The code is very complex and definitely not one I could write myself. However, after reading a lot of code in my life, it seems that on the surface, something can be done better. Due its complexity, I would love if a dev could take a look and correct me if I am wrong on this subject. If it is correct and just an issue of developer time, I would love coding and submitting a patch.
It seems there is code duplication for finding tags, rather than being consolidated in a few methods all of which could use the caches.
The most glaring example I have found is StartTag.getElement, which calls getEndTagInternal which in turn calls source.getNextEndTag, which calls EndTag.getNext, which there performs an uncached lookup of the end tag using indexOf. This comes as a surprise, due to the method Tag.getNextTag() which finds the next tag and uses all the caches.
A check could be then ran if it is an EndTag and use that as the next EndTag.
Thanks in advnace!
I have implemented some improvements in version 3.4.
Until version 3.4 is officially released, the development version is
The Source.getAllStartTags method always used the cache, but you are correct that StartTag.getElement was using indexOf to find the end tag instead of stepping through all the end tags in the cache after a fullSequentialParse. This was actually a deliberate design decision as it was thought that in many circumstances indexOf would be faster than iterating through multiple tags and checking their names individually. Because of your question I did some quick comparisons on a sample page and found that indexOf was ten times slower than iterating through tags after a fullSequentialParse, so I have now optimised name based searches based on that fact.
Note that even when indexOf is used to find possible tags the cache is still always being used to get the tag at a certain position, so it wasn't reparsing the entire tag multiple times.
I also noticed a couple of other minor opportunities for performance improvements.
Unit test coverage of this library is very poor so it is possible that these performance improvements have introduced regression bugs. Please report if you notice any when using the new dev version.
What is not clear to me, is in which case would you not want the search function to search for simply the next tag (being start or end).
As far as parsing goes, please correct me if I'm wrong, the moment you want to find a tag X, you have to parse everything up to where it is (so that you won't, by mistake, find a tag inside a script or style tag, or even a comment).
For that, it seems like only a single "find a tag, start or end or anything" is needed, rather than one for start, one for end, one from a position (which also might be inside a script/style) and so on. While a full parse of a tag is usually not needed (unless it's a script / style / etc tag), it could be done lazily.
I have the (untested) belief that finding just "the next tag" and filtering out the the ones needed will have similar performance to special casing of finding specifically a start/end tag, but greatly streamlining the code, consolidating most of the searches and allowing for a centralized place to add optimizations.
Of course I talk only from experience with my specific case which requires multiple queries from top to bottom of the page, rather than a simpler case that passes once on the page to replace server tags etc.
For great performance,
Check out the documentation of the TagType.isValidPosition method, which explains why the parser does not have to parse every preceding tag to determine if a tag is in a valid position.
I have implemented further performance improvements related to name and attribute base searches after a full sequential parse.