Menu

HTML Parser / News: Recent posts

Happy Birthday

Today, the HTML Parser project marks its 10th year of existence.
It has been downloaded over 395,612 times over its history.
We hope it continues to provide the benefits that developers have enjoyed since it started - an easy to use, fast, flexible HTML parser.
Thanks to all those who have contributed to the project - and to all those who have used it.

Posted by Derrick Oswald 2011-04-03

Happy Birthday

The HTMLParser project celebrates 6 years of existence on April 3, 2007. Wish the parser a happy birthday.

For the month of March, 2007, the HTMLParser project achieved the milestone of 5000 downloads in a month. This is a continuation of the approximately 25% compound annual growth rate trend over its lifetime.

Posted by Derrick Oswald 2007-03-31

All new HTML Parser 2.0

The very popular HTML Parser project (http://sourceforge.net/projects/htmlparser) on Sourceforge has been updated with a new license, new build environment, new repository and a new web site. To identify this radical change, the version has been revved to 2.0.

In response to requests from the Apache community, the htmlparser license has changed from GNU Library or Lesser General Public License, to the more Apache friendly Common Public License 1.0 (http://opensource.org/licenses/cpl1.0.txt).... read more

Posted by Derrick Oswald 2006-09-17

HTML Parser Production Release 1.6 available

Version 1.6 of the most popular HTML parser on Sourceforge is now available after a year of user requested fixes and enhancements and over thirty thousand downloads since version 1.5 was released.
http://sourceforge.net/project/showfiles.php?group_id=24399

HTML Parser is a Java library used to parse HTML in either a linear or nested fashion.

The HTML Parser community would like to thank the many users and developers that have provided constructive feedback, and we hope this production release provides an exemplary product and a positive user experience for the coming year.... read more

Posted by Derrick Oswald 2006-06-10

HTML Parser Integration Release 1.6-20060527

Roll up release. This is candidate 1 for final version 1.6 release. If nothing pops up 1.6 should roll out in a couple of weeks.

Some new functionality, an XorFilter class to round out the logical filters was added by Ian.

Fixes bugs:
#1493884 Lexer returns a TagNode with a 'null' name
#1457371 Script tag consumes too much from document being parsed
#1488951 RemarkNode.toPlainTextString() incorrect behaviour
#1345049 HTMLParser should not terminate a comment with --->
#1467712 Page#getCharset never works
#1461473 Relative links starting with ?

Posted by Derrick Oswald 2006-05-27

HTML Parser accepts donations

The HTML Parser project has altered its open source model slightly, and is now accepting donations.

Over half of the most active projects on Sourceforge accept donations and the HTML Parser project is its most popular Java library for HTML parsing.

Donations will be used for the purchase and maintenance of a PKI code signing certificate.

"We would like to offer 'Java Web Start' examples direct from the web site, and the signing certificate will allow us to provide a better user experience." says Derrick Oswald, project lead.... read more

Posted by Derrick Oswald 2006-04-23

HTML Parser Integration Release 1.6-20060319

Minor update and bugfix release.

Adds NodeTreeWalker, a utility class to traverse a tree of Node objects using either depth-first or breadth-first tree order.

Fixes bugs:
#1445795 return as TextNode when processing jsp
#1445309 XML processing instructions are returned as text
#1376851 Null-valued cookies cause exception
#1375230 some javascript breaks stringbean

Posted by Derrick Oswald 2006-03-20

HTMP Parser ported to .net

A company called NetOMatix has created a port of the HTML Parser library:

http://www.netomatix.com/Products/DocumentManagement/HTMLParserNet.aspx

A cursory glance indicates the API is much the same, allowing nearly direct ports of existing code to the .net platform.

Posted by Derrick Oswald 2006-02-20

HTML Parser Integration Release 1.6-20051112

Support has been added for commonly requested composite tags, P, H1-H6, and definition list tags (DL, DT, DD). The node interface has been augmented with get first/last child and get previous/next sibling methods to ease traversing the HTML document.

Fixes bugs:
#1344687 A bug when set cookies
#1334408 Exception occurs based on string length
#1322686 when illegal charset specified

Posted by Derrick Oswald 2005-11-12

HTML Parser Integration Release 1.6-20050925

Minor update that applies a patch submitted by Keiron McCammon to fix bug #1227213 "Particular SCRIPT tags close too late", adds changes to FilterBean suggested by Martin Hudson, and adds a remove(Node) method to the NodeList class as suggested by Matthew Buckett.

Posted by Derrick Oswald 2005-09-25

HTML Parser Production Release 1.5 available

Version 1.5 of the most popular HTML parser on sourceforge is now available. Some significant new API's have been added since 1.4 was released, such
as ConnectionManager, SAX parsing, new filters and interfaces. But what's really cool is the new FilterBuilder that allows you to interactively generate a Java class that extracts information from a web page. Three months of downloads without a reported bug indicate this is one of the most stable
releases yet.... read more

Posted by Derrick Oswald 2005-06-14

Eclipse Wikipedia Editor

Axel Kramer writes:

Hi

Like to say thanks for your great Htmlparser tool.
I'm using your library in the Eclipse Wikipedia Editor [1]
I created a HTML to Wikipedia text converter [2] from your
StringExtractor example.

[1]http://www.plog4u.org/index.php/Using_Eclipse_Wikipedia_Editor
[2]http://www.plog4u.org/index.php/Using_Eclipse_Wikipedia_Editor:Working_with_the_Editor#The_Editor_Context_Menu... read more

Posted by Derrick Oswald 2005-04-02

Building Modular Applications with Seppia

HTML Parser was used in an example for an article on Seppia, which is glue code for Java components:
http://www.onjava.com/pub/a/onjava/2005/03/16/seppia.html

Posted by Derrick Oswald 2005-03-18

HTML Parser Integration Release 1.5-20050313 available

This is a bug fix release that should be considered the first candidate for a version 1.5 final. If no radical bugs are found in the next couple of weeks, we'll ship it and move on to version 1.6.

This release addresses a partial parse issue for pages that contain characters that cannot be represented in the page encoding. The nio.charset.CharsetDecoder replaces these characters with zero (by default) which corresponded to the end of file indicator of the Page class. Now (char)Source.EOF (-1) is used when the end of stream is encountered.... read more

Posted by Derrick Oswald 2005-03-13

HTML Parser Integration Release 1.5-20050306 available

This is a bug fix release that specifically addresses a long standing script and style parsing problem. The fix adds a parseCDATA method to the Lexer class that adheres to appendix B.3.2 Specifying non-HTML data, of the HTML specification regarding recognizing the ETAGO (</) at the end of script and style CDATA (see http://www.w3.org/TR/html4/appendix/notes.html#notes-specifying-data\).

Other bugs addressed include wrapping InputStreams with the org.htmlparser.lexer.Stream class to get around mark()/reset() issues, fixing a JDK 1.4 compile issue, providing a better error message while a Java bug regarding Byte Order Marks is pending, and implementing a change suggested by David Andersen to handle null ContentType.

Posted by Derrick Oswald 2005-03-07

HTML Parser Integration Release 1.5-20050213 available

This long overdue integration build adds two main enhancements: ConnectionManager and FilterBuilder.

The ConnectionManager is part of the org.htmlparser.http package which handles proxies, passwords and cookies. This addition fulfills three Request for Enhancements:
1017249 HTML Client Doesn't Support Cookies but will follow redirect
1010586 Add support for password protected URL
1000739 Add support for proxy scenario... read more

Posted by Derrick Oswald 2005-02-14

HTML Parser Integration Release 1.5-20040728 available

This semi-regular integration build provides a first look at a SAX parser implementation. It's currently just wrapping the DOM parser. It uses interfaces from the SAX project (http://sourceforge.net/projects/sax/). The 'code to the interface' refactoring is complete, along with some other housekeeping. Added a real StringSource that reads directly from a String rather than creating an intermediate byte array. This avoids character encoding losses.... read more

Posted by Derrick Oswald 2004-07-29

HTML Parser Production Release 1.42 available

This patch release fixes three bugs.

One bug involved the decoding of URLs with the Translate.decode() method, which was incorrect. This had already been addressed in the integration builds (HEAD).

Another bug involved the SiteCapturer program failing in the face of an EncodingChangeException. This exception is raised when the <META> tag indicates a different character set that the one assumed at the start of parsing, and retracing the stream yields different characters than those the client has already consumed. The SiteCapturer now handles this exception by resetting the parser and trying again.... read more

Posted by Derrick Oswald 2004-07-28

HTML Parser Integration Release 1.5-20040613 available

This semi-regular integration build provides refactored classes to reduce component size and allow 'code to the interface' programming; almost. Additional filters for cascading style sheet selectors (CssSelectorNodeFilter) and regular expresssions (RegExFilter) have been added. Besides the bug fix for SCRIPT tags with apostrophes in comments, three enhancement requests have been implemented (of note, the parser now accepts gzip/deflate content encodings). The logo has also been updated.... read more

Posted by Derrick Oswald 2004-06-15

HTML Parser Production Release 1.41 available

The most popular java HTML parser on SourceForge has issued a maintenance release. Version 1.41 fixes one bug in scanning <SCRIPT> tags, where a quote character within a comment would cause incorrect parsing. The small number of bugs reported after thousands of downloads speaks to the stability and ease of use of the htmlparser package. Get yours today!

Posted by Derrick Oswald 2004-05-22

HTML Parser Production Release 1.4 available

Version 1.4 of the most popular HTML parser on sourceforge is now available. Ten months of development have culminated in a very robust, extensible product that has been tested, and is already being used, by thousands of developers. HTML Parser is a library, written in Java, which allows you to parse HTML (HTML 4.0 supported). It has been used by people on live projects. Developers appreciate how easy it is to use. The architecture is flexible, allowing you to extend it easily.... read more

Posted by Derrick Oswald 2004-03-16