The latest integration build of most popular HTML parser on sourceforge,
HTML Parser version 1.4, is now available:
http://sourceforge.net/project/showfiles.php?group_id=24399&package_id=47712
This can be considered an alpha candidate of the final 1.4 release, and
has much improved stability, speed, and HTML page transformation
capabilities.
We can only go so far by running the 532 unit tests we have, so we are
widening the audience in the hope you'll pick it up and put it through
it's paces with your own applications.
Changes since Version 1.3
-------------------------
Decorators
The node decorator package has been added to provide support for the
delegate model.
Lexer
A new lexer i/o subsystem has been added. This provides accurate line number
and character position data, tag and attribute names maintain their original
case, and attributes maintain their original order. Line numbers reported by
tags are now zero based, not one based. The node count for parsing goes up
in most cases because whitespace is strictly maintained, i.e. every
whitespace (i.e. newline) now counts as a StringNode too. Storage of
attributes is now in a Vector which means the element 0 Attribute is
actually the name of the tag, rather than having the $TAGNAME entry in a
HashTable. The htmllexer.jar is this new i/o subsystem broken out and made
JDK 1.1 compliant, the htmlparser.jar, which includes everything in
htmllexer.jar, is not necessarily intended to be used in JDK 1.1
environments. Some support for JIS escape sequences has been added.
Tags
Zero arg tag constructors have been added. Attribute maintenance
(add/remove/edit) improved. There is no EndTag class any more. Just a
generic tag that responds true to isEndTag(). Improvements to form tag
handling, getting <input> and <textarea> tags nested within other tags.
Improvements to applet tag handling regarding parameters and codebases.
Scanners
The concept of scanners has been completely reworked. Applications register
tags not scanners to express interest in parsing only some tags. The default
is now to parse all tags, which is equivalent to the old registerDOMTags(),
so some extra nesting of tags will need to be handled. CompositeTagScanner
logic has been improved to try and match unclosed open tags when an
unexpected end tag is encountered. This change also moved recursion off the
JDK stack, eliminating most StackOverflow exceptions. Also, a CompositeTag's
"startTag()" is "this", and the CompositeTagScanner just adds children.
Filters
A new powerful filtering capability has been added, which makes extracting
specific tags very easy.
Applications
New example applications Thumbelina and SiteCapturer.
Derrick Oswald
|