[Htmlparser-announce] HTML Parser Integration Release 1.4-20040104
Brought to you by:
derrickoswald
|
From: Derrick O. <Der...@Ro...> - 2004-01-04 23:11:08
|
The latest integration build of most popular HTML parser on sourceforge, HTML Parser version 1.4, is now available: http://sourceforge.net/project/showfiles.php?group_id=24399&package_id=47712 This can be considered an alpha candidate of the final 1.4 release, and has much improved stability, speed, and HTML page transformation capabilities. We can only go so far by running the 532 unit tests we have, so we are widening the audience in the hope you'll pick it up and put it through it's paces with your own applications. Changes since Version 1.3 ------------------------- Decorators The node decorator package has been added to provide support for the delegate model. Lexer A new lexer i/o subsystem has been added. This provides accurate line number and character position data, tag and attribute names maintain their original case, and attributes maintain their original order. Line numbers reported by tags are now zero based, not one based. The node count for parsing goes up in most cases because whitespace is strictly maintained, i.e. every whitespace (i.e. newline) now counts as a StringNode too. Storage of attributes is now in a Vector which means the element 0 Attribute is actually the name of the tag, rather than having the $TAGNAME entry in a HashTable. The htmllexer.jar is this new i/o subsystem broken out and made JDK 1.1 compliant, the htmlparser.jar, which includes everything in htmllexer.jar, is not necessarily intended to be used in JDK 1.1 environments. Some support for JIS escape sequences has been added. Tags Zero arg tag constructors have been added. Attribute maintenance (add/remove/edit) improved. There is no EndTag class any more. Just a generic tag that responds true to isEndTag(). Improvements to form tag handling, getting <input> and <textarea> tags nested within other tags. Improvements to applet tag handling regarding parameters and codebases. Scanners The concept of scanners has been completely reworked. Applications register tags not scanners to express interest in parsing only some tags. The default is now to parse all tags, which is equivalent to the old registerDOMTags(), so some extra nesting of tags will need to be handled. CompositeTagScanner logic has been improved to try and match unclosed open tags when an unexpected end tag is encountered. This change also moved recursion off the JDK stack, eliminating most StackOverflow exceptions. Also, a CompositeTag's "startTag()" is "this", and the CompositeTagScanner just adds children. Filters A new powerful filtering capability has been added, which makes extracting specific tags very easy. Applications New example applications Thumbelina and SiteCapturer. Derrick Oswald |