Menu

A possible bug with DomSerializer

Help
2014-06-26
2014-07-19
  • Mohsen Saboorian

    Please see this link first.

    There seems to be an issue with recent versions of HtmlCleaner's DomSerializer (new DomSerializer(cp, false)). I clean some html docs and export an org.w3c.Document using DomSerializer newed with the following cp and false.

    cp.setTransResCharsToNCR(false);
    cp.setRecognizeUnicodeChars(true);
    cp.setTransSpecialEntitiesToNCR(false);

    Now when I call textNode.getNodeValue() to get a text node value,   is returned (while I need it's unicode value;). To fix this I new DomSerializer with true as its second parameter. Now another problem rises. ZWNJ character, although was in it's unicode value in the original document is returned as ‌ when I call textNode.getNodeValue(). This works correctly with htmlcleaner-2.2 but fails with all newer versions up to now (which is 2.8). Also note that when I create new DomSerializer(cp, true), the following code
    ((TagNode) cleaned.evaluateXPath(xpath)[0]).getText()
    returns proper text without an escaped ‌, but I use another xpath library to obtain text nodes which is dependent on textNode.getNodeValue().

     
  • Scott Wilson

    Scott Wilson - 2014-06-26

    Thanks for the report Mohsen. I think there are several problems in how character encoding is handled, due to the various ways different people want to use the output. See also the discussion on this bug:

    https://sourceforge.net/p/htmlcleaner/bugs/118/

     
  • Mohsen Saboorian

    Scott, thanks for the quick response. Actually I cannot get the point with second parameter of DomSerializer. I believe a CP parameter should suffice for serializer. When I specify cp.setRecognizeUnicodeChars(true), I expect it to convert any instances of non unicode entities (such as NCR or special entities) to their unicode value. Perhapes another parameter is required for a DomSerializer to specify it's output encoding.

     
  • Scott Wilson

    Scott Wilson - 2014-06-27

    I'm not quite sure either why DomSerializer has an escapeXml parameter - none of the other serializers use it, they just interpret the current properties.

    I'll have a go at the test case using ZWNJ and see if I can trace where things are getting confused.

     
  • Mohsen Saboorian

    Here is a test case. Note that I cannot get rid of both ‌ and   in the output. This test passes with htmlcleaner-2.8

    public static void zwnjBug() throws Exception {
        CleanerProperties cp = new CleanerProperties();
        HtmlCleaner hc = new HtmlCleaner(cp);
    
        TagNode cleaned = hc.clean("<html><body>[BEFORE_ZWNJ]\u200c[AFTER_ZWNJ]&nbsp;[END]</body></html>");
    
        DomSerializer ds = new DomSerializer(cp, false);
        Document dom = ds.createDOM(cleaned);
    
        assert "[BEFORE_ZWNJ]\u200c[AFTER_ZWNJ]&nbsp;[END]".equals(dom.getChildNodes().item(0).getTextContent());
    
        ds = new DomSerializer(cp, true);
        dom = ds.createDOM(cleaned);
        assert "[BEFORE_ZWNJ]&zwnj;[AFTER_ZWNJ]\u00a0[END]".equals(dom.getChildNodes().item(0).getTextContent());
    }
    
     
  • Mohsen Saboorian

    Scott, is there any good news on this issue? Should I open a file?

     

Log in to post a comment.

MongoDB Logo MongoDB