HTML Parser News

Brought to you by: derrickoswald

HTML Parser Production Release 1.5 available

Version 1.5 of the most popular HTML parser on sourceforge is now available. Some significant new API's have been added since 1.4 was released, such
as ConnectionManager, SAX parsing, new filters and interfaces. But what's really cool is the new FilterBuilder that allows you to interactively generate a Java class that extracts information from a web page. Three months of downloads without a reported bug indicate this is one of the most stable
releases yet.

Changes since Version 1.4
-------------------------
New APIs
Implement rudimentary sax parser. Currently exposes DOM parser via sax project
A new http package is added, the primary class being Connectionmanager which
handles proxies, passwords and cookies. Some testing still needed.
Also removed some line separator cruft.
Added parseCDATA to the Lexer, used in script and style scanners.
Note that this is significantly new behaviour that now adheres to appendix
B.3.2 Specifying non-HTML data of the HTML reference:
http://www.w3.org/TR/html4/appendix/notes.html#notes-specifying-data
Configuration Management
Removed the need for the Translate class to be packaged with htmllexer.jar.
This results in a lighter weight component.
Updated the logo and included the LGPL license.
Fixed the Windows batch files.
Added optional "classes" property to build.xml. This directory is where
class files are put. It defaults to src.
To use:
ant -Dclasses=classdir <target>
where classdir is/will-be a peer directory to src.
Fixed various end user experience issues.
Refactoring
Added static STRICT flag to ScriptScanner to revert to legacy handling of
broken ETAGO (</). If STRICT is true, scan according to HTML specification,
else if false, scan with quote smart state machine which heuristically
yields the correct parse.
Obviated LinkProcessor and moved it's functionality to the Page class.
Added Tag, Text and Remark interfaces and moved concrete node
implementations to the nodes package, removing the lexer.nodes package.
Most internals now use the Tag interface.
Removed the org.htmlparser.tags.Tag class and moved the remaining (minor)
functionality to the TagNode class.
So now tags inherit directly from TagNode or CompositeTag.
** NOTE: If you have subclassed org.htmlparser.tags.Tag, use org.htmlparser.nodes.TagNode now.**
Removed deprecated methods getTagBegin/getTagEnd and deleted unused classes:
PeekingIterator and it's Implementation.
Added ObjectTag (like an applet tag).
Added a real StringSource that reads directly from a String rather than
creating a byte array. This avoids character encoding losses.
Incorporate patch #1004985 Page.java, by making getCharset() and findCharset() static.
Incorporated some speed optimizations based on profiling.
Deprecated node decorators.
Filters
Added CssSelectorNodeFilter and RegExFilter.
Added the filter builder tool.
Added link pattern filters LinkRegexFilter and LinkStringFilter.

Enhancement Requests
--------------------
1160345 NodeList.visitAllNodesWith
1017249 HTML Client Doesn't Support Cookies but will follow redirect
1010586 Add support for password protected URL
1000739 Add support for proxy scenario
1000063 FilterBean
943593 LinkProcessor.extract(link,base) weird behaviour?
943197 Accept gzip / deflate content encodings
874000 Remove specialized tag signatures from NodeVisitor

Bug Fixes
---------
1161137 Non English Character web page
1160010 NullPointerException in addCookies
1153508 CVS sources do not compile
1121401 No Parsing with yahoo!
1104627 Parser Crash reading javascript
1061869 Crashing when trying to capture link to XLS document
1056438 Byte Order Mark
1044707 mark()/reset() issues
1024045 StringBean crashes on an URL
1021925 StyleTag with missing linefeed prevents page from parsing
1018884 'compile' ant task from build.xml messes up ./src directory
1005409 Input file not free by parser.
998195 SiteCatpurer just crashed
995703 Parser Crash
988846 Linkbean getLinks() segmentation fault (duplicate of above)
973137 Double-bytes characters are messed after parsing
936392 ScriptTag visitor fails for comments with '
919738 Text has not been extracted correctly using StringBean

Posted by 2005-06-14