[Htmlparser-announce] HTML Parser Production Release 1.4 available

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

**

Version 1.4 of the most popular HTML parser on sourceforge is now 
available. Ten months of development have culminated in a very robust, 
extensible product that has been tested, and is already being used, by 
thousands of developers.

While prior versions concentrated on data extraction from web pages, 
Version 1.4 of the HTMLParser has substantial improvements in the area 
of transforming web pages, with simplified tag creation and editing, and 
verbatim toHtml() method output.

Significant improvements have also been made in character set handling, 
providing support for developers worldwide.

Changes since Version 1.3
-------------------------
Translation
  Character entity encoding and decoding has been revamped, leading to
  higher throughput and less memory churn.
Beans
  The StringBean can now be used as a visitor for parsers external to 
the bean.
Decorators
  The node decorator package has been added to provide support for the
  delegate model.
Lexer
  A new lexer i/o subsystem has been added. This provides accurate line 
number
  and character position data, tag and attribute names maintain their 
original
  case, and attributes maintain their original order. Line numbers 
reported by
  tags are now zero based, not one based. The node count for parsing goes up
  in most cases because whitespace is strictly maintained, i.e. every
  whitespace (i.e. newline) now counts as a StringNode too. Storage of
  attributes is now in a Vector which means the element 0 Attribute is
  actually the name of the tag, rather than having the $TAGNAME entry in a
  HashTable. The htmllexer.jar is this new i/o subsystem broken out and made
  JDK 1.1 compliant, the htmlparser.jar, which includes everything in
  htmllexer.jar, is not necessarily intended to be used in JDK 1.1
  environments. Some support for JIS escape sequences has been added.
Tags
  Zero arg tag constructors have been added. Attribute maintenance
  (add/remove/edit) improved. There is no EndTag class any more. Just a
  generic tag that responds true to isEndTag(). Improvements to form tag
  handling, getting <input> and <textarea> tags nested within other tags.
  Improvements to applet tag handling regarding parameters and codebases.
Scanners
  The concept of scanners has been completely reworked. Applications 
register
  tags not scanners to express interest in parsing only some tags. The 
default
  is now to parse all tags, which is equivalent to the old 
registerDOMTags(),
  so some extra nesting of tags will need to be handled. CompositeTagScanner
  logic has been improved to try and match unclosed open tags when an
  unexpected end tag is encountered. This change also moved recursion 
off the
  JDK stack, eliminating most StackOverflow exceptions. Also, a 
CompositeTag's
  "startTag()" is "this", and the CompositeTagScanner just adds children.
  The ScriptScanner will now decrypt Microsoft Script Encoder encrypted 
script
  tags. The plaintext is available via ScriptTag.getScriptCode().
Filters
  A new powerful filtering capability has been added, which makes extracting
  specific tags very easy.
Applications
  New example applications Thumbelina and SiteCapturer.
  A mainline has been added to the Translate class to encode/decode stdin to
  stdout.

The developers of the HTML Parser hope you enjoy it.
http://sourceforge.net/projects/htmlparser

Please post any requests for enhancements in version 1.5 to
http://sourceforge.net/tracker/?group_id=24399&atid=381402

<http://sourceforge.net/projects/htmlparser>