Jericho HTML Parser 2.2 released

Jericho HTML Parser is a simple but powerful java library allowing analysis and manipulation of parts of an HTML document, including some common server-side tags, while reproducing verbatim any unrecognised or invalid HTML. Also provides useful HTML form utilities.

Version 2.2 includes important bug fixes, and introduces the following major features:

1. The Source.fullSequentialParse() method provides a much more efficient means of parsing the entire source document.

2. The Source.indent(String indentText, boolean tidyTags, boolean collapseWhiteSpace, boolean indentAllElements) method reproduces source text with indenting that represents the document element hierarchy of the source document.

3. The Source.getChildElements(), Element.getChildElements(), Element.getParentElement() and Element.getDepth() methods provide a means of navigating the document element hierarchy.

Changes since version 2.1:
- Bug Fixes:
- Fault in caching mechanism resulted in missed tags in rare
circumstances. (SubCache.findNextTag method)
- [1407179] Segment.extractText() threw NullPointerException if
the last character position was part of a tag.
- Segment.extractText() now converts some tags to whitespace and
ignores text inside SCRIPT and STYLE elements.
- Added Segment.extractText(boolean includeAttributes) option.
- Added Source.fullSequentialParse() method.
- Added CharStreamSource interface for dealing with char output.
- Added Source.indent(String indentText, boolean tidyTags,
boolean collapseWhiteSpace, boolean indentAllElements) method.
- Added Segment.getChildElements() method.
- Added Element.getParentElement() method.
- Added Element.getDepth() method.
- Named tag search methods now only return unregistered tags if the
specified name is not a valid XML tag name.
- Changed Attributes.DefaultMaxErrorCount system default from 1 to 2.
- Added EndTag.getElement() method.
- Added Tag.getElement() abstract method.
- Added Tag.getNameSegment() method.
- Added Tag.getUserData() and Tag.setUserData(Object) methods.
- Added Tag.findNextTag() method.
- Added Tag.findPreviousTag() method.
- Added Tag.tidy() and Tag.tidy(boolean toXHTML) methods.
- Added and renamed many methods in OutputDocument class to make the
interface more intuitive.
- Added HTMLElements.getNestingForbiddenElementNames() method.
- Illegally nested elements with required end tags now terminate at
start of illegally nested start tag, avoiding possible stack overflow
in the common case of multiple unterminated <a name=...> elements.
- Tag search methods called with a pos argument that is out of range
now return null or empty results rather than throwing an exception.
- Renamed output(Writer) method in OutputSegment to writeTo(Writer).
- Deprecated Tag.regenerateHTML() method.
- Deprecated Source.getNextTagIterator() method.
- Deprecated AttributesOutputSegment class.
- Deprecated StringOutputSegment class.
- Removed BlankOutputSegment class from public API.
- Removed CharOutputSegment class from public API.
- Removed IOutputSegment which was deprecated in 2.0.

Posted by Martin Jericho 2006-06-19

Log in to post a comment.

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:

No, thanks