Hi,
we found strange behaviour in the release v.2.23 where some nodes are disappearing:
public class BugReport {
public static void main(String[] args) {
HtmlCleaner cleaner = new HtmlCleaner();
CleanerProperties props = cleaner.getProperties();
props.setAllowHtmlInsideAttributes(true);
props.setAllowMultiWordAttributes(true);
props.setOmitComments(true);
try {
Document doc = new DomSerializer(cleaner.getProperties()).createDOM(cleaner.clean(("<html>\n<body>\n<dl> \n<div class=\"a\">\n<label class=\"b\">bb<em>*</em></label>\n<select onchange=\"c\" \nid=\"cc\" name=\"ccc\" \n class=\"cccc\">\n<option value=\"xxx\" id=\"foo\">d</option>\n</select>\n</div>\n</dl>\n</body>\n</html>")));
LSSerializer lsSerializer = ((DOMImplementationLS)(doc.getImplementation().getFeature("LS", "3.0"))).createLSSerializer();
NodeList childNodes = doc.getChildNodes();
StringBuilder sb = new StringBuilder();
System.out.println(childNodes.getLength());
for (int i = 0; i < childNodes.getLength(); ++i) {
for (int x = 0; x < childNodes.item(i).getChildNodes().getLength(); ++i) {
printSub(childNodes.item(i).getChildNodes().item(x), lsSerializer);
}
sb.append(lsSerializer.writeToString(childNodes.item(i)));
}
System.out.println(sb.toString());
} catch (ParserConfigurationException e) {
}
}
public static void printSub(Node node, LSSerializer s) {
if (node.getChildNodes() != null) {
for (int i = 0; i < node.getChildNodes().getLength(); ++i) {
System.out.println(s.writeToString(node.getChildNodes().item(i)));
System.out.println("-------------------------");
}
}
}
}
Thanks for the report Dennis, I'll check it out in the morning and post an update with what I find out.
This line is in error:
Should be:
Hi Scott, thanks for the fast response. I'll check it again.
Hi Scott,
to clarify what Dennis meant in his bug report I adapted his code:
The input html string is the following:
With version 2.6.1 the parsed DOM tree looks as follows:
With version 2.23 the parsed DOM tree looks as follows:
So the tree structure is broken as 'dl' and 'div' are on the same level while actually 'div' should be a child of 'dl'. Further, the 'option' node is missing.
Thanks for the detailed report - I'll look into it
OK, I think I can see what is happening.
I'm not sure when DIV was allowed in a DL as well as DT and DD, looks like a recent spec change. In any case HC is out of step with Html5.2 so I've updated the rule.
Because the only things allowed in DL were DT and DD, it was moving everything else outside, screwing up the tree. Allowing DIV and also adding a preferred content of DIV seems to improve the model quite a bit.
Test passing in my current code looks like this (removed attributes for clarity):
I'll commit this change ASAP.