[Htmlparser-announce] HTML Parser Production Release 1.4 available
Brought to you by:
derrickoswald
From: Derrick O. <Der...@Ro...> - 2004-03-16 11:40:15
|
** Version 1.4 of the most popular HTML parser on sourceforge is now available. Ten months of development have culminated in a very robust, extensible product that has been tested, and is already being used, by thousands of developers. While prior versions concentrated on data extraction from web pages, Version 1.4 of the HTMLParser has substantial improvements in the area of transforming web pages, with simplified tag creation and editing, and verbatim toHtml() method output. Significant improvements have also been made in character set handling, providing support for developers worldwide. Changes since Version 1.3 ------------------------- Translation Character entity encoding and decoding has been revamped, leading to higher throughput and less memory churn. Beans The StringBean can now be used as a visitor for parsers external to the bean. Decorators The node decorator package has been added to provide support for the delegate model. Lexer A new lexer i/o subsystem has been added. This provides accurate line number and character position data, tag and attribute names maintain their original case, and attributes maintain their original order. Line numbers reported by tags are now zero based, not one based. The node count for parsing goes up in most cases because whitespace is strictly maintained, i.e. every whitespace (i.e. newline) now counts as a StringNode too. Storage of attributes is now in a Vector which means the element 0 Attribute is actually the name of the tag, rather than having the $TAGNAME entry in a HashTable. The htmllexer.jar is this new i/o subsystem broken out and made JDK 1.1 compliant, the htmlparser.jar, which includes everything in htmllexer.jar, is not necessarily intended to be used in JDK 1.1 environments. Some support for JIS escape sequences has been added. Tags Zero arg tag constructors have been added. Attribute maintenance (add/remove/edit) improved. There is no EndTag class any more. Just a generic tag that responds true to isEndTag(). Improvements to form tag handling, getting <input> and <textarea> tags nested within other tags. Improvements to applet tag handling regarding parameters and codebases. Scanners The concept of scanners has been completely reworked. Applications register tags not scanners to express interest in parsing only some tags. The default is now to parse all tags, which is equivalent to the old registerDOMTags(), so some extra nesting of tags will need to be handled. CompositeTagScanner logic has been improved to try and match unclosed open tags when an unexpected end tag is encountered. This change also moved recursion off the JDK stack, eliminating most StackOverflow exceptions. Also, a CompositeTag's "startTag()" is "this", and the CompositeTagScanner just adds children. The ScriptScanner will now decrypt Microsoft Script Encoder encrypted script tags. The plaintext is available via ScriptTag.getScriptCode(). Filters A new powerful filtering capability has been added, which makes extracting specific tags very easy. Applications New example applications Thumbelina and SiteCapturer. A mainline has been added to the Translate class to encode/decode stdin to stdout. The developers of the HTML Parser hope you enjoy it. http://sourceforge.net/projects/htmlparser Please post any requests for enhancements in version 1.5 to http://sourceforge.net/tracker/?group_id=24399&atid=381402 <http://sourceforge.net/projects/htmlparser> |