Bug fixes:
A new serializer, TraversalDomSerializer, has been added. This is an experimental serializer that currently creates output that is not exactly the same as the regular DomSerializer, but may be useful where you need to reduce the memory footprint of HtmlCleaner for processing extremely large pages.... read more
HtmlCleaner release 2.14
This contains the following bug fixes:
149 StackOverflowError
148 Giving mixed-case filenames doesn't work on case-sensitive filesystems
147 Correction of ul structure
146 2.13 does not correct table structure
144 schema.org elements such as meta and link are removed
140 CRITICAL: endless loop in some tags (ref #129, #126)
139 option tag displayed after optgroup
136 ClassCastException... read more
What another release already??
Well, a big thanks to Wolfgang Koppenberger who spotted a problem in 2.11 with OPTION tags which needed fixing and releasing right away.
Apologies to anyone using 2.11 who encountered that issue.
Adds much better HTML5 support, pipelining of HTML from stdin (and XML to stdout), and more!
Here's the changelog:
New version brings most of required features and number of bug fixes. HtmlCleaner is now thread-safe, it introduces html-based serializers, API is extended to ease document manipulation. Parser is about 20% faster and now it runs on Java 1.5+, benefiting from language improvements.
- Parsing transformations are developed in order to easily skip or change specified tags or attributes during the cleanup process.
- Few more constructors added in class HtmlCleaner giving possibility to reuse same cleaner properties with multiple cleaner instances.
- Code cleanup.
Together with new milestone version 2.0, project web site is complitely redesigned giving better look and better organized information.
<a href="http://htmlcleaner.sourceforge.net/">Go to HtmlCleaner web site</a>
New version comes with a number of improvements and fixes. Some of them are:
- Complete code refactoring, making the Cleaner's API better and more flexible.
- Methods for DOM manipulation added.
- Basic XPath support added.
- New parameters introduced to control cleaner's behavior.
- New flag parameter ignoreQuestAndExclam is introduced offering control over special tags - <?TAGNAME....>, <!TAGNAME....>.
- Bug fixes.
- Added Reader based HtmlCleaner constructors.
- New parameter pruneTags is introduced offering a way to remove undesired tags with all the children from XML tree after parsing and cleaning.
- Bug fixes.
- Several bug fixes.
- Added option to escape XML content in DOM serializer - HtmlCleaner.createDOM(boolean escapeXml)
- New flag allowHtmlInsideAttributes is introduced in order to give the parser flexibility in handling attribute values.
- Several bug fixes.
* New browser-compact serializer added, that preserves single whitespace where multiple occure.
* New flag namespacesAware is introduced in order to control namespace prefixes and namespace declarations. It should be used instead of omitXmlnsAttributes that existed in previous versions and had limited functionality.
* New flag allowMultiWordAttributes is introduced giving HtmlCleaner's parser flexibility to (dis)allow tag attributes consisting of multiple words.
* New flag useEmptyElementTags is introduced in order to controll output of tags with empty body
(<xxx/> vs <xxx></xxx>).
* Several bug fixes.
- Several bugs fixed.
- New flags added to control behaviour of unknown/deprecated tags.
- New flag added to optionally remove HTML envelope from resulting XML.
- JDOM serializer added.
- Latest source may be checked out from https://htmlcleaner.svn.sourceforge.net/svnroot/htmlcleaner.
- Source can be browsed at http://htmlcleaner.svn.sourceforge.net/viewvc/htmlcleaner/
Serialization of XML to Java DOM supported with createDOM() method of HtmlCleaner class.
Hexadecimal entities escaping supported (i.e. 	).
- Compact XML serializer improved.
- Minor XML escaping bug fixed.
- A html tokenizing bug fixed.
- Methods of the class TagNode made public in order to enable creating custom XML serializers.
- Method writeXml(XmlSerializer) added to HtmlCleaner class in order to support creating custom XML serializers.
- HtmlCleaner Ant task added
- XML compact serializer added - stripps all unneeded whitespaces from the result
- Few minor bugs fixed
HtmlCleaner is open-source HTML parser written in Java. For specified HTML it prooduces well-formed XML.