Menu

#100 HTMLCleaner rearrange attribute order by alphabetical order

v 2.9
closed-wont-fix
nobody
None
5
2014-08-17
2013-12-05
No

I think it should only do cleaning and this reordering shouldn't be done by default since it doesn't make the HTML more valid. It could be an option if someone still wants it (even though I have a hard time understanding the use case ;)).

Thanks!

Discussion

  • Scott Wilson

    Scott Wilson - 2013-12-06
    • status: open --> open-accepted
    • Group: v 2.7 --> v 2.8
     
  • Scott Wilson

    Scott Wilson - 2013-12-06

    Moving to fix in 2.8

     
  • Scott Wilson

    Scott Wilson - 2014-03-18

    Moving to 2.9

     
  • Scott Wilson

    Scott Wilson - 2014-03-18
    • Group: v 2.8 --> v 2.9
     
  • Vincent Massol

    Vincent Massol - 2014-03-18

    Getting ready for a 2.8 release? :)

    Too bad this is slipping from 2.8 since I consider this a big flaw of HC to change the order of attributes by default (it's not supposed to do that if not asked since for ex if you give a perfectly valid XHTML document you don't want it modified). Now I can perfectly understand you don't have the time to work on this ATM :)

     
  • Scott Wilson

    Scott Wilson - 2014-03-18

    Don't worry, it won't be a long wait until 2.9 :)

     
  • Scott Wilson

    Scott Wilson - 2014-03-23

    I can't seem to replicate the sorting behaviour... it seems to preserve attribute order, which I'd expect as its backed by a LinkedList. Is there a particular Serializer where sorting occurs?

        @Test
        public void testAttributesNoSortingXml() throws IOException{
            CleanerProperties cleanerProperties = new CleanerProperties();
            cleanerProperties.setOmitCdataOutsideScriptAndStyle(true);
            cleanerProperties.setAddNewlineToHeadAndBody(false);
            cleaner = new HtmlCleaner(cleanerProperties);
            serializer = new SimpleXmlSerializer(cleaner.getProperties());
    
            String input = "<div a=\"1\" x=\"2\" z=\"3\" b=\"4\"></div>";
            assertHTML(input, input);
        }
    
     
  • Vincent Massol

    Vincent Massol - 2014-03-24

    Hi Scott,

    Thanks for looking into this. The use case I had is described at https://sourceforge.net/p/htmlcleaner/bugs/99/#cfc2 (The "fill" attribute's position is changed).

    Thanks

     
  • Scott Wilson

    Scott Wilson - 2014-03-24

    That example is the one I'm using as "test18.html" along with "test18_expected.html" and doesn't seem to show this behaviour any more; perhaps this is a side effect of fixing other issues, or there is a particular combination of settings that causes it?

     
  • Vincent Massol

    Vincent Massol - 2014-03-24

    I've tested again with 2.7 and by upgrading to 2.8 and the attribute is still not preserved using this input:

    <?xml version="1.0" encoding="UTF-8" ?>
    <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
        "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
    <html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en">
      <head>
        <title>Test SVG cleaning</title>
      </head>
      <body>
        <p>before</p>
        <svg xmlns="http://www.w3.org/2000/svg" version="1.1">
          <circle cx="100" cy="50" r="40" stroke="black" stroke-width="2" fill="red"/>
        </svg>
        <p>after</p>
      </body>
    </html>
    ~~~~~~~
    
    Specifically I get:
    
    <html><head></head><body>

    before

    after

    </body></html> ~~~~~~

    As you can see the "fill" parameter position is modified.

    Thanks

     

    Last edit: Vincent Massol 2014-03-24
  • Scott Wilson

    Scott Wilson - 2014-03-24

    That is really strange. I try the same test and get the attributes back in the same order as the source HTML. Specifically:

    <?xml version="1.0" encoding="UTF-8"?>
    <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
    <html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en">
    <head>
        <title>Test SVG cleaning</title>
    
      </head>
    <body>
    
        <p>before</p>
        <svg xmlns="http://www.w3.org/2000/svg" version="1.1">
          <circle cx="100" cy="50" r="40" stroke="black" stroke-width="2" fill="red" />
        </svg>
        <p>after</p>
    
    </body></html>
    

    What are the specific cleaner settings and Serializer implementation you're using?

     
  • Vincent Massol

    Vincent Massol - 2014-03-24

    ~~~~~~
    CleanerProperties defaultProperties = new CleanerProperties();
    defaultProperties.setOmitUnknownTags(true);

        // HTML Cleaner uses the compact notation by default but we don't want that since:
        // - it's more work and not required since not compact notation is valid XHTML
        // - expanded elements can also be rendered fine in browsers that only support HTML.
        defaultProperties.setUseEmptyElementTags(false);
    
        // Wrap script and style content in CDATA blocks
        defaultProperties.setUseCdataForScriptAndStyle(true);
    
        // We need this for example to ignore CDATA sections not inside script or style elements.
        defaultProperties.setIgnoreQuestAndExclam(true);
    
        // Remove CDATA outside of script and style since according to the spec it has no effect there.
        defaultProperties.setOmitCdataOutsideScriptAndStyle(true);
    

    ~~~~~~~

    I'm using the XWikiDOMSerializer (https://github.com/xwiki/xwiki-commons/blob/master/xwiki-commons-core/xwiki-commons-xml/src/main/java/org/xwiki/xml/internal/html/XWikiDOMSerializer.java) but I've just tried using the default DomSerializer and I get the same problem...

    IMO the problem is in DomSerializer (I believe I copied some of its code when writing XWikiDOMSerializer so it's likely both suffer from the same issue!).

     
  • Scott Wilson

    Scott Wilson - 2014-03-24

    Got it - I did this (using the same cleaner settings):

    String initial = readFile("src/test/resources/test18.html");        
                DomSerializer ser = new DomSerializer(cleaner.getProperties());
                Document doc = ser.createDOM(cleaner.clean(initial));
                Element circle = (Element) doc.getElementsByTagName("circle").item(0);
                for (int i=0;i<circle.getAttributes().getLength();i++){
                    System.out.println(circle.getAttributes().item(i));
                }
    

    And in console I get:

    cx="100"
    cy="50"
    fill="red"
    r="40"
    stroke="black"
    stroke-width="2"
    

    ... which suggests that the DOM created by DomSerializer is returning attributes in alphabetical order, even though I'm pretty sure its processing them in document order.

    Now I've got a test case I can see if this if fixable...

     
  • Scott Wilson

    Scott Wilson - 2014-03-24

    OK, having had a look into it I think the problem is the DOM itself. Attributes in DOM are backed with a NamedNodeMap; this doesn't guarantee any particular ordering - and it seems that the Java implementation uses something that sorts them.

    For example, looking at the Xerces source code, I can see every time you add a new attribute it places it in the ArrayList backing it in alpha name order: http://svn.apache.org/viewvc/xerces/java/trunk/src/org/apache/xerces/dom/NamedNodeMapImpl.java?view=markup

    So I don't think this is something we can fix in HC.

     
  • Scott Wilson

    Scott Wilson - 2014-05-23
    • status: open-accepted --> closed-wont-fix
     

Log in to post a comment.