Jericho HTML Parser is a java library allowing analysis and manipulation of parts of an HTML document, including server-side tags, while reproducing verbatim any unrecognised or invalid HTML. Also provides high-level HTML form manipulation functions.
Version 3.1 includes important bug fixes and the following enhancements:
- A new stream based parsing option using the StreamedSource class, which
allows memory efficient processing of large files using an event iterator.
This is essentially a StAX alternative with the ability to process HTML
and non-validating XML, as well as several other features not available
in other streaming parsers.
- Tag search methods based on HTML class and attribute value regular
- Bug Fixes:
-  Infinite loop on Segment.getAllStartTags()
- Infinite loop on Segment.getAllElements()
- Segment.getFirst* methods returned segments outside the bounding
- Segment.getAllElements methods did not return all enclosed elements
in some circumstances.
- Fixed documentation errors in Segment.getAllElements methods.
- Added StreamedSource class.
- CHANGES THAT COULD AFFECT THE BEHAVIOUR OF EXISTING PROGRAMS:
- Changed ParseText from class to interface.
- Segment.getNodeIterator() now returns character references as
- Added tag search methods based on attribute value regular expressions.
- Added tag search methods based on HTML class attribute.
- Added static Source.LegacyNodeIteratorCompatabilityMode property
temporarily to restore Segment.getNodeIterator() functionality to
that of previous versions.
- Removed char based search methods in ParseText.
- Added CharacterReference.appendCharTo(Appendable) method.
- Added OutputDocument(Segment) constructor.
- Added StreamedSourceCopy sample program.