I think it should only do cleaning and this reordering shouldn't be done by default since it doesn't make the HTML more valid. It could be an option if someone still wants it (even though I have a hard time understanding the use case ;)).
Too bad this is slipping from 2.8 since I consider this a big flaw of HC to change the order of attributes by default (it's not supposed to do that if not asked since for ex if you give a perfectly valid XHTML document you don't want it modified). Now I can perfectly understand you don't have the time to work on this ATM :)
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I can't seem to replicate the sorting behaviour... it seems to preserve attribute order, which I'd expect as its backed by a LinkedList. Is there a particular Serializer where sorting occurs?
@Test
public void testAttributesNoSortingXml() throws IOException{
CleanerProperties cleanerProperties = new CleanerProperties();
cleanerProperties.setOmitCdataOutsideScriptAndStyle(true);
cleanerProperties.setAddNewlineToHeadAndBody(false);
cleaner = new HtmlCleaner(cleanerProperties);
serializer = new SimpleXmlSerializer(cleaner.getProperties());
String input = "<diva=\"1\"x=\"2\"z=\"3\"b=\"4\"></div>";
assertHTML(input, input);
}
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
That example is the one I'm using as "test18.html" along with "test18_expected.html" and doesn't seem to show this behaviour any more; perhaps this is a side effect of fixing other issues, or there is a particular combination of settings that causes it?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
~~~~~~
CleanerProperties defaultProperties = new CleanerProperties();
defaultProperties.setOmitUnknownTags(true);
// HTML Cleaner uses the compact notation by default but we don't want that since:
// - it's more work and not required since not compact notation is valid XHTML
// - expanded elements can also be rendered fine in browsers that only support HTML.
defaultProperties.setUseEmptyElementTags(false);
// Wrap script and style content in CDATA blocks
defaultProperties.setUseCdataForScriptAndStyle(true);
// We need this for example to ignore CDATA sections not inside script or style elements.
defaultProperties.setIgnoreQuestAndExclam(true);
// Remove CDATA outside of script and style since according to the spec it has no effect there.
defaultProperties.setOmitCdataOutsideScriptAndStyle(true);
~~~~~~~
I'm using the XWikiDOMSerializer (https://github.com/xwiki/xwiki-commons/blob/master/xwiki-commons-core/xwiki-commons-xml/src/main/java/org/xwiki/xml/internal/html/XWikiDOMSerializer.java) but I've just tried using the default DomSerializer and I get the same problem...
IMO the problem is in DomSerializer (I believe I copied some of its code when writing XWikiDOMSerializer so it's likely both suffer from the same issue!).
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Got it - I did this (using the same cleaner settings):
String initial = readFile("src/test/resources/test18.html");
DomSerializer ser = new DomSerializer(cleaner.getProperties());
Document doc = ser.createDOM(cleaner.clean(initial));
Element circle = (Element) doc.getElementsByTagName("circle").item(0);
for (int i=0;i<circle.getAttributes().getLength();i++){
System.out.println(circle.getAttributes().item(i));
}
... which suggests that the DOM created by DomSerializer is returning attributes in alphabetical order, even though I'm pretty sure its processing them in document order.
Now I've got a test case I can see if this if fixable...
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
OK, having had a look into it I think the problem is the DOM itself. Attributes in DOM are backed with a NamedNodeMap; this doesn't guarantee any particular ordering - and it seems that the Java implementation uses something that sorts them.
Moving to fix in 2.8
Moving to 2.9
Getting ready for a 2.8 release? :)
Too bad this is slipping from 2.8 since I consider this a big flaw of HC to change the order of attributes by default (it's not supposed to do that if not asked since for ex if you give a perfectly valid XHTML document you don't want it modified). Now I can perfectly understand you don't have the time to work on this ATM :)
Don't worry, it won't be a long wait until 2.9 :)
I can't seem to replicate the sorting behaviour... it seems to preserve attribute order, which I'd expect as its backed by a LinkedList. Is there a particular Serializer where sorting occurs?
Hi Scott,
Thanks for looking into this. The use case I had is described at https://sourceforge.net/p/htmlcleaner/bugs/99/#cfc2 (The "fill" attribute's position is changed).
Thanks
That example is the one I'm using as "test18.html" along with "test18_expected.html" and doesn't seem to show this behaviour any more; perhaps this is a side effect of fixing other issues, or there is a particular combination of settings that causes it?
I've tested again with 2.7 and by upgrading to 2.8 and the attribute is still not preserved using this input:
before
after
</body></html> ~~~~~~As you can see the "fill" parameter position is modified.
Thanks
Last edit: Vincent Massol 2014-03-24
That is really strange. I try the same test and get the attributes back in the same order as the source HTML. Specifically:
What are the specific cleaner settings and Serializer implementation you're using?
~~~~~~
CleanerProperties defaultProperties = new CleanerProperties();
defaultProperties.setOmitUnknownTags(true);
~~~~~~~
I'm using the XWikiDOMSerializer (https://github.com/xwiki/xwiki-commons/blob/master/xwiki-commons-core/xwiki-commons-xml/src/main/java/org/xwiki/xml/internal/html/XWikiDOMSerializer.java) but I've just tried using the default DomSerializer and I get the same problem...
IMO the problem is in DomSerializer (I believe I copied some of its code when writing XWikiDOMSerializer so it's likely both suffer from the same issue!).
Got it - I did this (using the same cleaner settings):
And in console I get:
... which suggests that the DOM created by DomSerializer is returning attributes in alphabetical order, even though I'm pretty sure its processing them in document order.
Now I've got a test case I can see if this if fixable...
OK, having had a look into it I think the problem is the DOM itself. Attributes in DOM are backed with a NamedNodeMap; this doesn't guarantee any particular ordering - and it seems that the Java implementation uses something that sorts them.
For example, looking at the Xerces source code, I can see every time you add a new attribute it places it in the ArrayList backing it in alpha name order: http://svn.apache.org/viewvc/xerces/java/trunk/src/org/apache/xerces/dom/NamedNodeMapImpl.java?view=markup
So I don't think this is something we can fix in HC.