Menu

HtmlCleaner / News: Recent posts

Htmlcleaner 2.24 is here!

Bug fixes:

  • 220 Information is lost in case of double escaping in attributes
  • 219 H3 closing tag incorrectly placed
  • 217 StackOverflow in DomSerializer
  • 216 elementNames(org.htmlcleaner.HtmlCleanerTest) test failure

A new serializer, TraversalDomSerializer, has been added. This is an experimental serializer that currently creates output that is not exactly the same as the regular DomSerializer, but may be useful where you need to reduce the memory footprint of HtmlCleaner for processing extremely large pages.... read more

Posted by Scott Wilson 2020-04-29

HtmlCleaner release 2.14 is here

HtmlCleaner release 2.14

This contains the following bug fixes:

149 StackOverflowError
148 Giving mixed-case filenames doesn't work on case-sensitive filesystems
147 Correction of ul structure
146 2.13 does not correct table structure
144 schema.org elements such as meta and link are removed
140 CRITICAL: endless loop in some tags (ref #129, #126)
139 option tag displayed after optgroup
136 ClassCastException... read more

Posted by Scott Wilson 2015-08-24

HtmlCleaner 2.12 is out!

What another release already??

Well, a big thanks to Wolfgang Koppenberger who spotted a problem in 2.11 with OPTION tags which needed fixing and releasing right away.

Apologies to anyone using 2.11 who encountered that issue.

Posted by Scott Wilson 2015-05-15

HtmlCleaner 2.11 released

Adds much better HTML5 support, pipelining of HTML from stdin (and XML to stdout), and more!

Here's the changelog:

  • Feature 19: Support use of stdin and stdout for pipes on command line
  • Feature 10: Make OSGI-compatible bundle
  • Feature 15: Improved HTML5 support
  • Fixed issue 135: Some pages cause two different NullPointerExceptions
  • Fixed issue 134: Some pages cause IndexOutOfBoundsException
  • Fixed issue 133: Some pages cause NullPointerException
  • Fixed issue 132: ClassCastException: ArrayList cannot be cast to org.htmlcleaner.BaseToken... read more
Posted by Scott Wilson 2015-05-12

HtmlCleaner release 2.2

New version brings most of required features and number of bug fixes. HtmlCleaner is now thread-safe, it introduces html-based serializers, API is extended to ease document manipulation. Parser is about 20% faster and now it runs on Java 1.5+, benefiting from language improvements.

Posted by Vladimir Nikic 2010-12-28

HtmlCleaner release 2.1

- Parsing transformations are developed in order to easily skip or change specified tags or attributes during the cleanup process.
- Few more constructors added in class HtmlCleaner giving possibility to reuse same cleaner properties with multiple cleaner instances.
- Code cleanup.

Posted by Vladimir Nikic 2008-09-02

Web site redesigned

Together with new milestone version 2.0, project web site is complitely redesigned giving better look and better organized information.
<a href="http://htmlcleaner.sourceforge.net/">Go to HtmlCleaner web site</a>

Posted by Vladimir Nikic 2008-07-15

HtmlCleaner release 2.0

New version comes with a number of improvements and fixes. Some of them are:

- Complete code refactoring, making the Cleaner's API better and more flexible.
- Methods for DOM manipulation added.
- Basic XPath support added.
- New parameters introduced to control cleaner's behavior.

Posted by Vladimir Nikic 2008-07-15

HtmlCleaner release 1.6

- New flag parameter ignoreQuestAndExclam is introduced offering control over special tags - <?TAGNAME....>, <!TAGNAME....>.
- Bug fixes.

Posted by Vladimir Nikic 2007-12-26

HtmlCleaner release 1.55

- Added Reader based HtmlCleaner constructors.
- New parameter pruneTags is introduced offering a way to remove undesired tags with all the children from XML tree after parsing and cleaning.
- Bug fixes.

Posted by Vladimir Nikic 2007-09-27

HtmlCleaner release 1.5

- Several bug fixes.
- Added option to escape XML content in DOM serializer - HtmlCleaner.createDOM(boolean escapeXml)

Posted by Vladimir Nikic 2007-09-08

HtmlCleaner release 1.4

- New flag allowHtmlInsideAttributes is introduced in order to give the parser flexibility in handling attribute values.
- Several bug fixes.

Posted by Vladimir Nikic 2007-08-24

HtmlCleaner release 1.3

* New browser-compact serializer added, that preserves single whitespace where multiple occure.
* New flag namespacesAware is introduced in order to control namespace prefixes and namespace declarations. It should be used instead of omitXmlnsAttributes that existed in previous versions and had limited functionality.
* New flag allowMultiWordAttributes is introduced giving HtmlCleaner's parser flexibility to (dis)allow tag attributes consisting of multiple words.
* New flag useEmptyElementTags is introduced in order to controll output of tags with empty body
(<xxx/> vs <xxx></xxx>).
* Several bug fixes.

Posted by Vladimir Nikic 2007-07-12

HtmlCleaner release 1.2

- Several bugs fixed.
- New flags added to control behaviour of unknown/deprecated tags.
- New flag added to optionally remove HTML envelope from resulting XML.
- JDOM serializer added.

Posted by Vladimir Nikic 2007-05-05

SVN support added

Posted by Vladimir Nikic 2007-04-16

HtmlCleaner release 1.13

Serialization of XML to Java DOM supported with createDOM() method of HtmlCleaner class.

Posted by Vladimir Nikic 2007-04-13

HtmlCleaner release 1.12

Hexadecimal entities escaping supported (i.e. &#x09;).

Posted by Vladimir Nikic 2007-01-28

HtmlCleaner release 1.1

- Compact XML serializer improved.
- Minor XML escaping bug fixed.

Posted by Vladimir Nikic 2007-01-11

HtmlCleaner v1.0.5 released

- A html tokenizing bug fixed.
- Methods of the class TagNode made public in order to enable creating custom XML serializers.
- Method writeXml(XmlSerializer) added to HtmlCleaner class in order to support creating custom XML serializers.

Posted by Vladimir Nikic 2007-01-02

HtmlCleaner version 1.0 released

Minor bug in advanced XML escaping fixed.

Posted by Vladimir Nikic 2006-12-23

HtmlCleaner version 0.9 released

- HtmlCleaner Ant task added
- XML compact serializer added - stripps all unneeded whitespaces from the result
- Few minor bugs fixed

Posted by Vladimir Nikic 2006-12-05

Initial version of HtmlCleaner released

HtmlCleaner is open-source HTML parser written in Java. For specified HTML it prooduces well-formed XML.

Posted by Vladimir Nikic 2006-11-27