Jericho HTML Parser 2.0 released

Jericho HTML Parser is a simple but powerful java library allowing analysis and manipulation of parts of an HTML document, including some common server-side tags, while reproducing verbatim any unrecognised or invalid HTML. It also provides high-level HTML form manipulation functions.

Version 2.0 is a complete rewrite of previous versions, including the core parser and caching mechanism.

The major new feature in 2.0, in addition to the HTML form manipulation features added in 1.5-dev, is the ability to easily define new tag types for recognition by the parser. Performance is also greatly improved and the documentation has been expanded significantly.

Changes since version 1.4.1:

- Complete rewrite of the parsing engine to allow the encapsulation of
different tag types into the new TagType class.
- Requires Java 1.4 or later.
- All programs written for previous versions of the library will have
to be recompiled with the new version, regardless of whether any
changes are required. This is because several methods, including the
Source constructor, now expect a CharSequence as an argument instead
of a String.
- Changes that could require modifications to existing programs:
- The toString() method of Segment and all subclasses now returns the
source text of the segment instead of a string useful for debugging
purposes. This change was necessary because Segment now
implements CharSequence.
- For consistency, the toString() methods of all IOutputSegment
implementations now return the output string instead of a string
useful for debugging purposes.
- The return type of the OutputDocument.getSourceText() method is now
CharSequence instead of String.
- Character references in Attribute.getValue() are now decoded
- StartTag.isEmptyElementTag() no longer checks whether the end tag
is required.
- Element.getContent() now returns zero-length segment instead of null
in case of an empty element.
- FormField.getPredefinedValues() now returns an empty collection
instead of null if the form field has no predefined values.
- Segment.findAllStartTags() now returns server tags that are found
inside other tags.
- Attributes segment now ends immediately after the last attribute
instead of immediatley before the end-of-tag delimiter.
- Modified Segment.isWhiteSpace(char) to match HTML specification
- CharacterReference.encode(CharSequence) no longer encodes
apostrophes by default
- Tags of type SERVER_COMMON now always have the name "%" regardless
of whether an identifier immediately follows it.
- Modified and enhanced aspects of StartTag searches relating to
special tags
- P elements are now terminated by TABLE elements.
See the HTMLElementName.P documentation for more information.
- removed public fields in Attribute class that were deprecated in 1.2
- removed Source.getSourceTextLowerCase() method deprecated in 1.3
- removed Source.findEnd(int pos, SpecialTag) method which was
accidentally added as a public method in 1.4
- Deprecated numerous methods (details in javadoc)
- Deprecated IOutputSegment interface and replaced with OutputSegment
- Improved caching system
- Added recognition of markup declarations
- Added recognition of CDATA sections
- Added recognition of SGML marked sections
- Doctype declarations containing markup declarations now supported
- Segment class now implements CharSequence and Comparable
- Added getDebugInfo() to Segment and all subclasses to replace the
previous functionality of the toString() method
- OutputSegment interface now implements CharSequence
- Added getDebugInfo() to the OutputSegment interface to replace the
previous functionality of the toString() method
- Attributes class now implements List
- FormFields class now implements Collection
- Added HTMLElementName interface and HTMLElements class
- Added RowColumnVector class and associated methods in Source class
- Added FormControl class
- Added various methods to the FormField, FormFields and OutputDocument
classes related to FormControl objects and the manipulation and output
of form submission values.
- Added Config and related classes
- Added TagType class and subclasses
- Added various tag search methods to the Source and Segment classes
including searches by TagType, attribute values, and other criteria.
- Added AttributesOutputSegment class
- Added Util class
- Added OverlappingOutputSegmentsException class
- Added many other methods to existing classes
- Documentation improvements

Posted by Martin Jericho 2005-11-11