When using an input document that uses the HTML5 DocType ("html"), DomSerializer silently fails to add any subnodes to the result document. This would seem to be an issue with Java's DOMImplementation, however we should have a workaround in place for this.
Punting to 2.9 for now - this needs to go in release notes
I just started investigating htmlcleaner to replace jtidy. The htmls I am cleaning are using html5 DocType '<!DOCTYPE html>'. Yet I seem to be able to use DomSerializer to convert them to Docment from TagNode. Am I missing something here? I don't want to switch right now if under some circumstances the DomSerializer fails.
Hi Tahseen,
The issue is with the underlying Java DOM implementation rather than HtmlCleaner itself; I suspect this was fixed at a certain point in the JDK, JRE or default XML API implementation, but there doesn't seem to be an easy way to find out when apart from compiling and testing under different environments to see where the problem occurs. (I normally build and test under JDK6; next I'll upgrade to JDK7 and see if it still occurs.)
As you now have it working in your current environment I suspect it won't fail unless you revert to an earlier JDK, JRE or XML implementation in your environment.
There is a workaround, which is to add public and system identifiers as " " (a single space) rather than "" or null when creating the Document. I'd rather not code that into HC however if this is already solved for most modern environments.
Hope this helps,
S
Hi Scott,
Yes, I noticed that and tried with JDK 6 but bug was probably fixed before 1.6.0_51. I assumed so, but given my lack of knowledge I wanted your input. Thanks for such a quick reply and ofcourse the library itself :).