HtmlCleaner / Discussion / Help: A possible bug with DomSerializer

Mohsen Saboorian - 2014-06-26

Please see this link first.

There seems to be an issue with recent versions of HtmlCleaner's DomSerializer (new DomSerializer(cp, false)). I clean some html docs and export an org.w3c.Document using DomSerializer newed with the following cp and false.

cp.setTransResCharsToNCR(false);
cp.setRecognizeUnicodeChars(true);
cp.setTransSpecialEntitiesToNCR(false);

Now when I call textNode.getNodeValue() to get a text node value,   is returned (while I need it's unicode value;). To fix this I new DomSerializer with true as its second parameter. Now another problem rises. ZWNJ character, although was in it's unicode value in the original document is returned as &zwnj; when I call textNode.getNodeValue(). This works correctly with htmlcleaner-2.2 but fails with all newer versions up to now (which is 2.8). Also note that when I create new DomSerializer(cp, true), the following code
((TagNode) cleaned.evaluateXPath(xpath)[0]).getText()
returns proper text without an escaped &zwnj;, but I use another xpath library to obtain text nodes which is dependent on textNode.getNodeValue().

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Scott Wilson - 2014-06-26

Thanks for the report Mohsen. I think there are several problems in how character encoding is handled, due to the various ways different people want to use the output. See also the discussion on this bug:

https://sourceforge.net/p/htmlcleaner/bugs/118/

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Mohsen Saboorian - 2014-06-27

Scott, thanks for the quick response. Actually I cannot get the point with second parameter of DomSerializer. I believe a CP parameter should suffice for serializer. When I specify cp.setRecognizeUnicodeChars(true), I expect it to convert any instances of non unicode entities (such as NCR or special entities) to their unicode value. Perhapes another parameter is required for a DomSerializer to specify it's output encoding.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Scott Wilson - 2014-06-27

I'm not quite sure either why DomSerializer has an escapeXml parameter - none of the other serializers use it, they just interpret the current properties.

I'll have a go at the test case using ZWNJ and see if I can trace where things are getting confused.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Here is a test case. Note that I cannot get rid of both &ZWNJ; and &NBSP; in the output. This test passes with htmlcleaner-2.8

public static void zwnjBug() throws Exception {
    CleanerProperties cp = new CleanerProperties();
    HtmlCleaner hc = new HtmlCleaner(cp);

    TagNode cleaned = hc.clean("<html><body>[BEFORE_ZWNJ]\u200c[AFTER_ZWNJ]&nbsp;[END]</body></html>");

    DomSerializer ds = new DomSerializer(cp, false);
    Document dom = ds.createDOM(cleaned);

    assert "[BEFORE_ZWNJ]\u200c[AFTER_ZWNJ]&nbsp;[END]".equals(dom.getChildNodes().item(0).getTextContent());

    ds = new DomSerializer(cp, true);
    dom = ds.createDOM(cleaned);
    assert "[BEFORE_ZWNJ]&zwnj;[AFTER_ZWNJ]\u00a0[END]".equals(dom.getChildNodes().item(0).getTextContent());
}

Mohsen Saboorian - 2014-07-19

Scott, is there any good news on this issue? Should I open a file?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

A possible bug with DomSerializer

Forums

Help

A possible bug with DomSerializer document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

A possible bug with DomSerializer