New version brings most of required features and number of bug fixes. HtmlCleaner is now thread-safe, it introduces html-based serializers, API is extended to ease document manipulation. Parser is about 20% faster and now it runs on Java 1.5+, benefiting from language improvements.
- Parsing transformations are developed in order to easily skip or change specified tags or attributes during the cleanup process.
- Few more constructors added in class HtmlCleaner giving possibility to reuse same cleaner properties with multiple cleaner instances.
- Code cleanup.
Together with new milestone version 2.0, project web site is complitely redesigned giving better look and better organized information.
<a href="http://htmlcleaner.sourceforge.net/">Go to HtmlCleaner web site</a>
New version comes with a number of improvements and fixes. Some of them are:
- Complete code refactoring, making the Cleaner's API better and more flexible.
- Methods for DOM manipulation added.
- Basic XPath support added.
- New parameters introduced to control cleaner's behavior.
- New flag parameter ignoreQuestAndExclam is introduced offering control over special tags - <?TAGNAME....>, <!TAGNAME....>.
- Bug fixes.
- Added Reader based HtmlCleaner constructors.
- New parameter pruneTags is introduced offering a way to remove undesired tags with all the children from XML tree after parsing and cleaning.
- Bug fixes.
- Several bug fixes.
- Added option to escape XML content in DOM serializer - HtmlCleaner.createDOM(boolean escapeXml)
- New flag allowHtmlInsideAttributes is introduced in order to give the parser flexibility in handling attribute values.
- Several bug fixes.
* New browser-compact serializer added, that preserves single whitespace where multiple occure.
* New flag namespacesAware is introduced in order to control namespace prefixes and namespace declarations. It should be used instead of omitXmlnsAttributes that existed in previous versions and had limited functionality.
* New flag allowMultiWordAttributes is introduced giving HtmlCleaner's parser flexibility to (dis)allow tag attributes consisting of multiple words.
* New flag useEmptyElementTags is introduced in order to controll output of tags with empty body
(<xxx/> vs <xxx></xxx>).
* Several bug fixes.
- Several bugs fixed.
- New flags added to control behaviour of unknown/deprecated tags.
- New flag added to optionally remove HTML envelope from resulting XML.
- JDOM serializer added.
- Latest source may be checked out from https://htmlcleaner.svn.sourceforge.net/svnroot/htmlcleaner.
- Source can be browsed at http://htmlcleaner.svn.sourceforge.net/viewvc/htmlcleaner/
Serialization of XML to Java DOM supported with createDOM() method of HtmlCleaner class.
Hexadecimal entities escaping supported (i.e. 	).
- Compact XML serializer improved.
- Minor XML escaping bug fixed.
- A html tokenizing bug fixed.
- Methods of the class TagNode made public in order to enable creating custom XML serializers.
- Method writeXml(XmlSerializer) added to HtmlCleaner class in order to support creating custom XML serializers.
Minor bug in advanced XML escaping fixed.
- HtmlCleaner Ant task added
- XML compact serializer added - stripps all unneeded whitespaces from the result
- Few minor bugs fixed
HtmlCleaner is open-source HTML parser written in Java. For specified HTML it prooduces well-formed XML.