Thread: [Htmlparser-announce] HTML Parser Integration Release 1.4-20040104

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

The latest integration build of most popular HTML parser on sourceforge, 
HTML Parser version 1.4, is now available:

http://sourceforge.net/project/showfiles.php?group_id=24399&package_id=47712

This can be considered an alpha candidate of the final 1.4 release, and 
has much improved stability, speed, and HTML page transformation 
capabilities.
We can only go so far by running the 532 unit tests we have, so we are 
widening the audience in the hope you'll pick it up and put it through 
it's paces with your own applications.

Changes since Version 1.3
-------------------------
Decorators
    The node decorator package has been added to provide support for the
    delegate model.
Lexer
    A new lexer i/o subsystem has been added. This provides accurate line number
    and character position data, tag and attribute names maintain their original
    case, and attributes maintain their original order. Line numbers reported by
    tags are now zero based, not one based. The node count for parsing goes up
    in most cases because whitespace is strictly maintained, i.e. every
    whitespace (i.e. newline) now counts as a StringNode too. Storage of
    attributes is now in a Vector which means the element 0 Attribute is
    actually the name of the tag, rather than having the $TAGNAME entry in a
    HashTable. The htmllexer.jar is this new i/o subsystem broken out and made
    JDK 1.1 compliant, the htmlparser.jar, which includes everything in
    htmllexer.jar, is not necessarily intended to be used in JDK 1.1
    environments. Some support for JIS escape sequences has been added.
Tags
    Zero arg tag constructors have been added. Attribute maintenance
    (add/remove/edit) improved. There is no EndTag class any more. Just a
    generic tag that responds true to isEndTag(). Improvements to form tag
    handling, getting <input> and <textarea> tags nested within other tags.
    Improvements to applet tag handling regarding parameters and codebases.
Scanners
    The concept of scanners has been completely reworked. Applications register
    tags not scanners to express interest in parsing only some tags. The default
    is now to parse all tags, which is equivalent to the old registerDOMTags(),
    so some extra nesting of tags will need to be handled. CompositeTagScanner
    logic has been improved to try and match unclosed open tags when an
    unexpected end tag is encountered. This change also moved recursion off the
    JDK stack, eliminating most StackOverflow exceptions. Also, a CompositeTag's
    "startTag()" is "this", and the CompositeTagScanner just adds children.
Filters
    A new powerful filtering capability has been added, which makes extracting
    specific tags very easy.
Applications
    New example applications Thumbelina and SiteCapturer.

Derrick Oswald

Thread: [Htmlparser-announce] HTML Parser Integration Release 1.4-20040104

htmlparser-announce