Jericho HTML Parser is a simple but powerful java library allowing analysis and manipulation of parts of an HTML document, including some common server-side tags, while reproducing verbatim any unrecognised or invalid HTML. Also provides high-level HTML form manipulation functions.
Version 2.5 includes important bug fixes, and introduces the following minor enhancements:
- Elements inside SCRIPT elements are ignored
- Improved encoding detection and analysis
- Improved parsing of attributes containing server tags
Changes since version 2.4:
- Bug Fixes:
- [1747493] RenderToText does not handle multiple <br> correctly.
- RenderToText does not handle whitespace after <br> correctly.
- Resetting to invalid mark exception during encoding detection.
- INPUT elements of type "button" and "reset" incorrectly
interpreted as form controls of type FormControlType.TEXT.
- Valid end tags containing white space rejected.
- Elements inside <script> elements are now ignored.
- Improved encoding detection.
- Added Source.getPreliminaryEncodingInfo() method.
- Improved parsing of attributes containing server tags.
- Changed Source.isXML() algorithm.
- Added Renderer.ConvertNonBreakingSpaces property.
- Added TextExtractor.ConvertNonBreakingSpaces property.
- Added TextExtractor.ExcludeNonHTMLElements property.
- Added extendible TextExtractor.excludeElement(StartTag) method.
- TextExtractor now includes value of content attribute.
- Deprecated OverlappingOutputSegmentsException class.
- Added OutputDocument.getRegisteredOutputSegments() method.
- Added OutputDocument.getDebugInfo() method.
- Added fullSequentialParseData parameter to TagType.isValidPosition.
- Removed all methods/classes deprecated in 2.2.