HTML Parser Production Release 1.6 available

Version 1.6 of the most popular HTML parser on Sourceforge is now available after a year of user requested fixes and enhancements and over thirty thousand downloads since version 1.5 was released.

HTML Parser is a Java library used to parse HTML in either a linear or nested fashion.

The HTML Parser community would like to thank the many users and developers that have provided constructive feedback, and we hope this production release provides an exemplary product and a positive user experience for the coming year.

This is likely to be the last release in the 1.x series of parser releases. Moving forward, the HTML Parser project will be using Subversion, changing its licensing model to be more Apache friendly, moving to a Maven build process and refactoring to take advantage of recent Java enhancements.

Changes since Version 1.5

New Functionality
Support has been added for commonly requested composite tags, P and H1-H6.
Definition list tags (dl, dt, dd), are also now included in the standard
set of tags recognized by the parser.
The FilterBean now has a 'recursive' property to control descent through
children when applying filters.
The NodeList class is a little more standard now with a remove(node) method.
The Node interface has been augmented with get first/last child and
get previous/next sibling methods to ease traversing the HTML document.
The TextNode class has an added isWhiteSpace method that returns true
when it contains no printable characters.
NodeTreeWalker, a utility class to traverse a tree of Node objects using
either depth-first or breadth-first tree order has been added.
An XorFilter has been added to round out our NOT, AND and OR filters,
along with new constructors to OrFilter/AndFilter that take an array of
Deflate encoding is now handled correctly and there is now an option to
have the ConnectionManager follow redirections manually so that cookie
processing can occur between redirections.
There is a new override for toHtml() that avoids issuing generated end tags.

Some refactoring to allow the htmllexer jar file to be compiled by gcj.
Moved non-JUnit test code to Request For Enhancement (RFE) as attachments,
so all the code in the tests package should now compile.
Removed all deprecated classes and methods.

Bug Fixes
#1496863 StringBean collapse() adds extra whitespace
#1488951 RemarkNode.toPlainTextString() incorrect behaviour
#1467712 Page#getCharset never works
#1461473 Relative links starting with ?
#1457371 Script tag consumes too much from document being parsed
#1445795 return as TextNode when processing jsp
#1445309 XML processing instructions are returned as text
#1376851 Null-valued cookies cause exception
#1375230 some javascript breaks stringbean
#1345049 HTMLParser should not terminate a comment with --->
#1344687 A bug when set cookies
#1334408 Exception occurs based on string length
#1322686 when illegal charset specified
#1227213 Particular SCRIPT tags close too late

#1436082 Follow redirections with cookie processing
#1338534 Support get first/last child, previous/next sibling

Requests For Enhancements
#1394144 handle deflate encoding

Posted by Derrick Oswald 2006-06-10

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:

No, thanks