[Htmlparser-announce] HTML Parser 1.4 beta is now available

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

The most popular HTML Parser on Sourceforge has released the version 1.4 
beta as promised, an appropriate 9 months after the version 1.3 final 
release.
It is accessible as Integration Build 1.4-20040216, see:
http://sourceforge.net/project/showfiles.php?group_id=24399&package_id=47712 
<http://sourceforge.net/project/showfiles.php?group_id=24399&package_id=47712>

There were no bugs reported from the last 1000 downloads (although it's 
hard to tell because Sourceforge has stopped reporting download and 
pageview stats), and there are only two outstanding tasks left to do.

If bug churn remains as quiescent as it is now, the final 1.4 release 
should be in mid March.

Here are the release notes:

*Notes:*
Integration build. Beta candidate for 1.4 final release.
Failing Unit Tests:
  none
Open Bugs:
  874000 LinkScanner and FormScanner cannot be used together
Pending Bugs:
  none

Changes since Version 1.3
-------------------------
Translation
    Character entity encoding and decoding has been revamped, leading to
    higher throughput and less memory churn.
Beans
    The StringBean can now be used as a visitor for parsers external to the bean.
Decorators
    The node decorator package has been added to provide support for the
    delegate model.
Lexer
    A new lexer i/o subsystem has been added. This provides accurate line number
    and character position data, tag and attribute names maintain their original
    case, and attributes maintain their original order. Line numbers reported by
    tags are now zero based, not one based. The node count for parsing goes up
    in most cases because whitespace is strictly maintained, i.e. every
    whitespace (i.e. newline) now counts as a StringNode too. Storage of
    attributes is now in a Vector which means the element 0 Attribute is
    actually the name of the tag, rather than having the $TAGNAME entry in a
    HashTable. The htmllexer.jar is this new i/o subsystem broken out and made
    JDK 1.1 compliant, the htmlparser.jar, which includes everything in
    htmllexer.jar, is not necessarily intended to be used in JDK 1.1
    environments. Some support for JIS escape sequences has been added.
Tags
    Zero arg tag constructors have been added. Attribute maintenance
    (add/remove/edit) improved. There is no EndTag class any more. Just a
    generic tag that responds true to isEndTag(). Improvements to form tag
    handling, getting <input> and <textarea> tags nested within other tags.
    Improvements to applet tag handling regarding parameters and codebases.
Scanners
    The concept of scanners has been completely reworked. Applications register
    tags not scanners to express interest in parsing only some tags. The default
    is now to parse all tags, which is equivalent to the old registerDOMTags(),
    so some extra nesting of tags will need to be handled. CompositeTagScanner
    logic has been improved to try and match unclosed open tags when an
    unexpected end tag is encountered. This change also moved recursion off the
    JDK stack, eliminating most StackOverflow exceptions. Also, a CompositeTag's
    "startTag()" is "this", and the CompositeTagScanner just adds children.
Filters
    A new powerful filtering capability has been added, which makes extracting
    specific tags very easy.
Applications
    New example applications Thumbelina and SiteCapturer.
    A mainline has been added to the Translate class to encode/decode stdin to
    stdout.

Bug Fixes
---------
891058 Bug in lexer
865279 Documentation
851882 zero length alt tag causes bug in ImageScanner
839264 toHtml() parse error in Javascripts with "form" keyword
833592 DOCTYPE element is not parsed correctly
832530 empty attribute causes parser to fail
826764 ParserException occurs only when using setInputHTML() instea
825820 Words conjoined
825645 <input> not getting parsed inside table
813838 links not parsed correctly
805598 attribute src in tag img sometimes not correctly parsed
801118 two " characters at the end of an attribute value problem
798554 Applet Tag does not update codebase data
798553 setInputHtml does not set text
798552 Sample for node iterator incorrect
789439 Japanese page causes OutOfMemory Exception
788746 parser crashes on comments like <!-- foobar --!>
786869 LinkExtractor Sample not working
784767 irc://server/channel urls are HTTPLike?
778781 SRC-attribute suppression in IMG-tags
772700 Jsp Tags are not parsed correctly when in quoted attributes
765413 typo
761798 Error reading next element.
757337 Standalone attributes should remain standalone
755929 Empty string attr. value causes attr parsing to be stopped
753012 IMG SRC not parsed v1.3 & v1.4
753003 <IMG> within <A> missed when followed by <MAP>
750117 StackOverFlow while Node-Iteration
749295 Problem Parsing Table
745566 StackOverflowError on select with too many unclosed options
744610 getLink() Erroneous for Relative Links from Files on Windows