Jericho HTML Parser is a simple but powerful Java library for analysing and modifying HTML. It ignores any server-side code/markup or invalid HTML, while still being able to analyse and modify parts and reproduce the rest verbatim.
The library distinguishes itself from other HTML parsers by its three major features:
1. No parse tree of the entire document is ever generated. In this sense the toolkit is strictly speaking not a true parser. The document source text is searched only for the markup relevant to the current operation. This allows the toolkit to analyse and modify documents containing JSP, ASP, PHP, incorrect or badly formatted HTML, or any other server or client side code, script, macro or markup. Most other parsers can't handle content that they are not explicitly programmed to accept.
2. The beginning and end positions in the source text of all parsed segments are accessible, allowing modification of only selected segments of the document without having to reconstruct the entire document from a parse tree. This feature, in combination with the one above, makes the toolkit extremely powerful in its simplicity.
3. An entire set of FormField objects can automatically be generated from the source document. These provide a very useful means for determining how to store and present data that is submitted from an arbitrary HTML form.
All classes and methods have been comprehensively documented, and because of the small amount of code, could easily be ported to other languages.
Log in to post a comment.