Menu

#218 Nodes are disappearing

v2.30
open
nobody
None
5
2023-06-19
2020-01-22
No

Hi,

we found strange behaviour in the release v.2.23 where some nodes are disappearing:

public class BugReport {
    public static void main(String[] args) {
        HtmlCleaner cleaner = new HtmlCleaner();
        CleanerProperties props = cleaner.getProperties();
        props.setAllowHtmlInsideAttributes(true);
        props.setAllowMultiWordAttributes(true);
        props.setOmitComments(true);
        try {

            Document doc = new DomSerializer(cleaner.getProperties()).createDOM(cleaner.clean(("<html>\n<body>\n<dl> \n<div class=\"a\">\n<label class=\"b\">bb<em>*</em></label>\n<select onchange=\"c\" \nid=\"cc\" name=\"ccc\" \n class=\"cccc\">\n<option value=\"xxx\" id=\"foo\">d</option>\n</select>\n</div>\n</dl>\n</body>\n</html>")));
            LSSerializer lsSerializer = ((DOMImplementationLS)(doc.getImplementation().getFeature("LS", "3.0"))).createLSSerializer();
            NodeList childNodes = doc.getChildNodes();
            StringBuilder sb = new StringBuilder();
            System.out.println(childNodes.getLength());
            for (int i = 0; i < childNodes.getLength(); ++i) {
                for (int x = 0; x < childNodes.item(i).getChildNodes().getLength(); ++i) {
                    printSub(childNodes.item(i).getChildNodes().item(x), lsSerializer);
                }
                sb.append(lsSerializer.writeToString(childNodes.item(i)));
            }
            System.out.println(sb.toString());
        } catch (ParserConfigurationException e) {                                                                                                                                             

        }                                                                                                                                                                                      
    }                                                                                                                                                                                          

    public static void printSub(Node node, LSSerializer s) {                                                                                                                                   
        if (node.getChildNodes() != null) {                                                                                                                                                    
            for (int i = 0; i < node.getChildNodes().getLength(); ++i) {                                                                                                                       
                System.out.println(s.writeToString(node.getChildNodes().item(i)));                                                                                                             
                System.out.println("-------------------------");                                                                                                                               
            }                                                                                                                                                                                  
        }                                                                                                                                                                                      
    }                                                                                                                                                                                          
}           

Discussion

  • Scott Wilson

    Scott Wilson - 2020-01-23

    Thanks for the report Dennis, I'll check it out in the morning and post an update with what I find out.

     
  • Scott Wilson

    Scott Wilson - 2020-01-24

    This line is in error:

                for (int x = 0; x < childNodes.item(i).getChildNodes().getLength(); ++i) {
    

    Should be:

                for (int x = 0; x < childNodes.item(i).getChildNodes().getLength(); ++x) {
    
     
  • Dennis Ratzke

    Dennis Ratzke - 2020-01-27

    Hi Scott, thanks for the fast response. I'll check it again.

     
  • Scott Wilson

    Scott Wilson - 2020-04-07
    • Group: v2.23 --> v2.24
     
  • Scott Wilson

    Scott Wilson - 2020-04-29
    • Group: v2.24 --> v2.25
     
  • Lukas Lehmann

    Lukas Lehmann - 2020-07-07

    Hi Scott,

    to clarify what Dennis meant in his bug report I adapted his code:

    public class BugReport {
        public static void main(String[] args) {
            HtmlCleaner cleaner = new HtmlCleaner();
            CleanerProperties props = cleaner.getProperties();
            props.setAllowHtmlInsideAttributes(true);
            props.setAllowMultiWordAttributes(true);
            props.setOmitComments(true);
            try {
                String htmlString = "<html>\n<body>\n<dl> \n<div class=\"a\">\n<label class=\"b\">bb<em>*</em></label>\n<select onchange=\"c\" \nid=\"cc\" name=\"ccc\" \n class=\"cccc\">\n<option value=\"xxx\" id=\"foo\">d</option>\n</select>\n</div>\n</dl>\n</body>\n</html>";
                System.out.println("Input html string:\n\n" + htmlString + "\n\n");
                Document doc = new DomSerializer(cleaner.getProperties()).createDOM(cleaner.clean(htmlString));
                System.out.println("Parsed document tree structure:\n");
                printTree(doc);
            } catch (ParserConfigurationException e) {}
        }
    
        private static void printTree(Node node) {
            printTreeLevel(node, 0);
        }
    
        private static void printTreeLevel(Node node, int level) {
            for (int i = 0; i < level; i++) {
                System.out.print("--");
            }
            if (node.getNodeName().startsWith("#")) {
                printSpecialTreeNode(node);
            } else {
                printTagTreeNode(node);
            }
            for (int i = 0; i < node.getChildNodes().getLength(); i++) {
                printTreeLevel(node.getChildNodes().item(i), level + 1);
            }
        }
    
        private static void printTagTreeNode(Node node) {
            System.out.print("<"  + node.getNodeName());
            for (int i = 0; i < node.getAttributes().getLength(); i++) {
                Node attribute = node.getAttributes().item(i);
                System.out.print(" " + attribute.getNodeName() + "='" + attribute.getNodeValue() + "'");
            }
            System.out.println(">");
        }
    
        private static void printSpecialTreeNode(Node node) {
            if (node.getNodeName() == "#text") {
                System.out.println(node.getNodeName() + ": '" + node.getTextContent().replace("\n", "\\n") + "'");
            } else {
                System.out.println(node.getNodeName());
            }
        }
    }
    

    The input html string is the following:

    <html>
    <body>
    <dl> 
    <div class="a">
    <label class="b">bb<em>*</em></label>
    <select onchange="c" 
    id="cc" name="ccc" 
     class="cccc">
    <option value="xxx" id="foo">d</option>
    </select>
    </div>
    </dl>
    </body>
    </html>
    

    With version 2.6.1 the parsed DOM tree looks as follows:

    #document
    --<html>
    ----<head>
    ----<body>
    ------#text: '\n'
    ------#text: '\n'
    ------<dl>
    --------#text: ' \n'
    --------<div class='a'>
    ----------#text: '\n'
    ----------<label class='b'>
    ------------#text: 'bb'
    ------------<em>
    --------------#text: '*'
    ----------#text: '\n'
    ----------<select class='cccc' id='cc' name='ccc' onchange='c'>
    ------------#text: '\n'
    ------------<option id='foo' value='xxx'>
    --------------#text: 'd'
    ------------#text: '\n'
    ----------#text: '\n'
    --------#text: '\n'
    ------#text: '\n'
    ------#text: '\n'
    

    With version 2.23 the parsed DOM tree looks as follows:

    #document
    --<html>
    ----<head>
    ----<body>
    ------#text: '\n'
    ------#text: '\n'
    ------<div class='a'>
    --------<label class='b'>
    ----------<em>
    ------------<select class='cccc' id='cc' name='ccc' onchange='c'>
    ------<dl>
    --------#text: ' \n'
    --------#text: '\n'
    --------#text: 'bb'
    --------#text: '*'
    --------#text: '\n'
    --------#text: '\n'
    --------#text: 'd'
    --------#text: '\n'
    --------#text: '\n'
    --------#text: '\n'
    ------#text: '\n'
    ------#text: '\n'
    

    So the tree structure is broken as 'dl' and 'div' are on the same level while actually 'div' should be a child of 'dl'. Further, the 'option' node is missing.

     
    • Scott Wilson

      Scott Wilson - 2020-07-07

      Thanks for the detailed report - I'll look into it

       
    • Scott Wilson

      Scott Wilson - 2020-07-07

      OK, I think I can see what is happening.

      I'm not sure when DIV was allowed in a DL as well as DT and DD, looks like a recent spec change. In any case HC is out of step with Html5.2 so I've updated the rule.

       
      • Scott Wilson

        Scott Wilson - 2020-07-07

        Because the only things allowed in DL were DT and DD, it was moving everything else outside, screwing up the tree. Allowing DIV and also adding a preferred content of DIV seems to improve the model quite a bit.

        Test passing in my current code looks like this (removed attributes for clarity):

                initial = "<dl>\n" +
                        "<div>\n" + 
                        "<label>bb<em>*</em></label>\n" + 
                        "<select>\n" + 
                        "<option>d</option>\n" + 
                        "</select>\n" + 
                        "</div>\n" + 
                        "</dl>\n";
                expected = "<html><head></head><body>" + 
                        "<dl>\n"+
                        "<div>\n" +
                        "<label>bb<em>*</em></label>\n" +
                        "<select>\n" +
                        "<option>d</option>\n" +
                        "</select>\n" +
                        "</div>\n" +
                        "</dl>\n"+
                        "</body></html>";
                assertCleanedHtml(initial, expected);
        

        I'll commit this change ASAP.

         
  • Scott Wilson

    Scott Wilson - 2021-09-24
    • Group: v2.25 --> v2.26
     
  • Scott Wilson

    Scott Wilson - 2023-04-29
    • Group: v2.26 --> v2.29
     
  • Scott Wilson

    Scott Wilson - 2023-06-19
    • Group: v2.29 --> v2.30
     

Log in to post a comment.

MongoDB Logo MongoDB