Jericho HTML Parser is a powerful java library allowing analysis and manipulation of parts of an HTML document, including server-side tags, while reproducing verbatim any unrecognised or invalid HTML. Also provides high-level HTML form manipulation functions.
Version 2.6 includes important bug fixes and the following enhancements:
- Non-server tags are no longer recognised inside server tags
- Recognition of Microsoft downlevel-revealed conditional comments
- Ability to remove all unnecessary white space from source document
- Various other enhancements to existing features
Change Log:
- Bug Fixes:
- [1906051] Exponential recursion when non-server tags are present
inside attribute values during full seq parse (introduced v2.5).
- [1927391] Renderer had indenting problems.
- [1991529] Wrong encoding with DISPLAY_VALUE and select Tags.
- An element whose start tag and end tag have different names, such
as a Mason component called with content, had no end tag.
- SourceFormatter did not preserve original indentation inside server
tags as specified in documentation.
- A start tag containing a server tag immediately before its closing
delimiter was not parsed correctly.
- StartTag.tidy() removed server tags outside of attribute values.
- Nested elements formed from non-normal tag types were not parsed
correctly.
- CharStreamSourceUtil.toString(charStreamSource) broke if
charStreamSource.getEstimatedMaximumOutputLength()<-1
- CHANGES THAT COULD AFFECT THE BEHAVIOUR OF EXISTING PROGRAMS:
- Non-server tags are no longer recognised inside server tags.
(see the TagType.isValidPosition documentation for details)
- Elements inside <script> elements are now ignored up until the first
occurrence of the character sequence "</script" (previously "</")
during a full sequential parse.
- Added static Config.ConvertNonBreakingSpaces property, which
affects the default behaviour of several methods.
- StartTag.isEmptyElementTag() now checks that the start tag is not
one that has an optional or required end tag.
- Element.isEmptyElementTag() is now implemented to be identical to
StartTag.isEmptyElementTag().
- Added StartTag.isSyntacticalEmptyElementTag() method.
- Improved performance of internal stream writing methods.
- Added StartTagType.SERVER_COMMON_ESCAPED standard tag type.
- Added MicrosoftTagTypes.DOWNLEVEL_REVEALED_CONDITIONAL_COMMENT
extended tag type.
- Added Source(URLConnection) constructor.
- Added Source.findNextStartTag(pos,name,startTagType) method.
- Added Source.findPreviousStartTag(pos,name,startTagType) method.
- Added SourceCompactor class and CompactSource sample program.
- Added Segment.getNodeIterator() method.
- Reduced risk of stack overflow when parsing large documents without
full sequential parse by avoiding recursive comment search.
- Added TextExtractor.includeAttribute(StartTag,Attribute) method.
- TextExtractor now includes attribute contents in order of appearance
in the source document.
- TextExtractor now includes contents of href attributes if the
IncludeAttributes property is set.
- Added Renderer.IncludeHyperlinkURLs property.
- Renderer no longer includes A element href if it is equal to "#"
or starts with "javascript:".
- Added Segment.getSource() method.
- Added EndTagType.getEndTagName(String startTagName) method.
- Added OutputDocument.writeTo(Writer, int begin, int end) method.
- OutputDocument now ignores output segments enclosed by other
output segments.
- FormFields.getDataSet() Map entries are now ordered to match the
order of appearance of the keys in the source document.
- FormFields.getValues() now returns a List rather than a Collection.
- FormField.getValues() now returns a List rather than a Collection.
- Added WriterLogger.log(String level, String message) method.
- Upgraded to the following logger APIs:
slf4j-api-1.5.2, commons-logging-api-1.1.1, log4j-1.2.15