Version 1.4 of the most popular HTML parser on sourceforge is now available. Ten months of development have culminated in a very robust, extensible product that has been tested, and is already being used, by thousands of developers. HTML Parser is a library, written in Java, which allows you to parse HTML (HTML 4.0 supported). It has been used by people on live projects. Developers appreciate how easy it is to use. The architecture is flexible, allowing you to extend it easily.
While prior versions concentrated on data extraction from web pages, Version 1.4 of the HTMLParser has substantial improvements in the area of transforming web pages, with simplified tag creation and editing, and verbatim toHtml() method output.
Significant improvements have also been made in character set handling, providing support for developers worldwide.
Changes since Version 1.3
Character entity encoding and decoding has been revamped, leading to
higher throughput and less memory churn.
The StringBean can now be used as a visitor for parsers external to the bean.
The node decorator package has been added to provide support for the
A new lexer i/o subsystem has been added. This provides accurate line number
and character position data, tag and attribute names maintain their original
case, and attributes maintain their original order. Line numbers reported by
tags are now zero based, not one based. The node count for parsing goes up
in most cases because whitespace is strictly maintained, i.e. every
whitespace (i.e. newline) now counts as a StringNode too. Storage of
attributes is now in a Vector which means the element 0 Attribute is
actually the name of the tag, rather than having the $TAGNAME entry in a
HashTable. The htmllexer.jar is this new i/o subsystem broken out and made
JDK 1.1 compliant, the htmlparser.jar, which includes everything in
htmllexer.jar, is not necessarily intended to be used in JDK 1.1
environments. Some support for JIS escape sequences has been added.
Zero arg tag constructors have been added. Attribute maintenance
(add/remove/edit) improved. There is no EndTag class any more. Just a
generic tag that responds true to isEndTag(). Improvements to form tag
handling, getting <input> and <textarea> tags nested within other tags.
Improvements to applet tag handling regarding parameters and codebases.
The concept of scanners has been completely reworked. Applications register
tags not scanners to express interest in parsing only some tags. The default
is now to parse all tags, which is equivalent to the old registerDOMTags(),
so some extra nesting of tags will need to be handled. CompositeTagScanner
logic has been improved to try and match unclosed open tags when an
unexpected end tag is encountered. This change also moved recursion off the
JDK stack, eliminating most StackOverflow exceptions. Also, a CompositeTag's
"startTag()" is "this", and the CompositeTagScanner just adds children.
The ScriptScanner will now decrypt Microsoft Script Encoder encrypted script
tags. The plaintext is available via ScriptTag.getScriptCode().
A new powerful filtering capability has been added, which makes extracting
specific tags very easy.
New example applications Thumbelina and SiteCapturer.
A mainline has been added to the Translate class to encode/decode stdin to
The developers of the HTML Parser hope you enjoy it.
Log in to post a comment.