Learn how easy it is to sync an existing GitHub or Google Code repo to a SourceForge project! See Demo

Close

HTML Parser Integration Release 1.5-20040728 available

This semi-regular integration build provides a first look at a SAX parser implementation. It's currently just wrapping the DOM parser. It uses interfaces from the SAX project (http://sourceforge.net/projects/sax/). The 'code to the interface' refactoring is complete, along with some other housekeeping. Added a real StringSource that reads directly from a String rather than creating an intermediate byte array. This avoids character encoding losses.

Changes since Version 1.4
-------------------------
New APIs
Implement rudimentary sax parser. Currently exposes DOM parser via sax project interfaces.
Configuration Management
Removed the need for the Translate class to be packaged with htmllexer.jar.
This results in a lighter weight component.
Updated the logo and included the LGPL license.
Fixed the Windows batch files.
Refactoring
Obviated LinkProcessor and moved it's functionality to the Page class.
Added Tag, Text and Remark interfaces and moved concrete node
implementations to the nodes package, removing the lexer.nodes package.
Most internals now use the Tag interface.
Removed the org.htmlparser.tags.Tag class and moved the remaining (minor)
functionality to the TagNode class.
So now tags inherit directly from TagNode or CompositeTag.
** NOTE: If you have subclassed org.htmlparser.tags.Tag, use org.htmlparser.nodes.TagNode now.**
Removed deprecated methods getTagBegin/getTagEnd and deleted unused classes:
PeekingIterator and it's Implementation.
Added ObjectTag (like an applet tag).
Added a real StringSource that reads directly from a String rather than
creating a byte array. This avoids character encoding losses.
Filters
Added CssSelectorNodeFilter and RegExFilter.

Enhancement Requests
--------------------
943593 LinkProcessor.extract(link,base) weird behaviour?
943197 Accept gzip / deflate content encodings
874000 Remove specialized tag signatures from NodeVisitor

Bug Fixes
---------
998195 SiteCatpurer just crashed
995703 Parser Crash
988846 Linkbean getLinks() segmentation fault (duplicate of above)
973137 Double-bytes characters are messed after parsing
936392 ScriptTag visitor fails for comments with '
919738 Text has not been extracted correctly using StringBean

Posted by Derrick Oswald 2004-07-29