Examples:
<?xml version="1.0" encoding="UTF-8" ?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en"> <head> <title>Test SVG cleaning</title> </head> <body> <p>before</p> <svg xmlns="http://www.w3.org/2000/svg" version="1.1"> <circle cx="100" cy="50" r="40" stroke="black" stroke-width="2" fill="red"/> </svg> <p>after</p> </body> </html>
or
<?xml version="1.0" encoding="UTF-8" ?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en" xmlns:svg="http://www.w3.org/2000/svg"> <head> <title>Test SVG cleaning</title> </head> <body> <p>before</p> <svg:svg version="1.1"> <svg:circle cx="100" cy="50" r="40" stroke="black" stroke-width="2" fill="red"/> </svg:svg> <p>after</p> </body> </html>
The full details are available at http://jira.xwiki.org/browse/XWIKI-9753
Thanks! Note that this is actually preventing us from releasing XWiki 5.3 ATM so if you know of a workaround that would be awesome! :)
Thanks a lot
Diff:
Diff:
Thanks Vincent.
I've had a go with the two examples, and the output looks OK - the SVG tags are retained, as are the namespace declarations and prefixes. Is there a good example demonstrating all these problems?
OK, I had a look at issue 1 - the title tag. This is actually quite tricky to deal with, as at the time the HTML is cleaned, we don't know the actual namespace of a token when we're trying to identify it (which is bad news for handling conflicting tag names).
A workaround for this specific issue is to make title non-unique in DefaultTagProvider, though obviously that has side effects in that it will allow multiple title tags to occur.
Thanks for looking so quickly into this Scott!
FTR I've created this unit test in XWiki land:
And when I execute it, I get the following output:
So the SVG tag is completely stripped.
Do you have setOmitUnknownTags = true ?
Indeed I have setOmitUnkownTags to true because we want to remove unknown tags that don't generate valid XHTML.
However in this case, the svg is not an unknown tag because it has a namespace. For me "OmitUnknownTags" is for unknown HTML tags, i.e. from the default XHTML namespace. While the tags from other namespaces should be preserved.
i.e. HTMLCleaner should clean only tags from the default (XHTML) namespace
Last edit: Vincent Massol 2013-12-02
OK, next up - case sensitive names.
At present TagNode sets these to lowercase.
However, we can retain the original name, and then leave it up to the serializer whether to use the lowercase name or the original case.
For example, XmlSerializer can look to see if the tagNode is recognized in DefaultTagProvider, and if it isn't, serialize the original name instead of the default lowercase one.
Of course this then opens up the possibility of error if the open and close tags no longer match :(
Now even with setOmitUnkownTags to true I still see some problems in addition to the ones mentioned above:
Several extra new lines
The "fill" parameter's position is changed.
See https://www.evernote.com/shard/s119/sh/71fb30f1-eb5c-4d19-af6e-fb30e4a837e2/6ba4418409fe0a4969f7cd60186cf16d
I think the issue here is that the order of processing doesn't seem to support this kind of case very well. The tree is constructed and tokens cleaned up, and only then do we have a model for checking which namespace a token is in. However, by that point we've already run many of the cleaning rules for which we want to make an exception for when we have foreign markup.
"the order of attribute specifications in a start-tag or empty-element tag is not significant." - http://www.w3.org/TR/REC-xml/#sec-starttags
This is a quick fix that would work for one, but not both, of the examples:
Its difficult to see how we can check for namespace inheritance because at this point we haven't actually built the tree, so we don't know the ancestors of the current tag.
I think I may have actually cracked it... I had a go at modifying how the internal tree is built taking account of xmlns attributes and ns prefixes, and I think I've got something that will work for the main issue of having HC behave in a way that is namespace-aware, while at the same time pruning unknown HTML tags.
To be honest I'm not 100% on putting this into a release as its a behaviour change and I need to make sure other users are happy with this.
However, feel free to try the attached patch and see if it does what you expect.
Thanks so much Scott! You're awesome :)
I'll try this first thing tomorrow morning.
ok I've tested it:
(the CDATA is not removed)
BTW your HC 2.5 tag is wrong in SVN (see https://sourceforge.net/p/htmlcleaner/code/HEAD/tree/tags/htmlcleaner-2.5/). It contains code after 2.5, check the pom.xml for ex which points to 2.7-SNAPSHOT (the sources also don't match 2.5).
I've tried your patch on HC 2.7-SNAPSHOT and it applied cleanly. It works fine for an example with a HEAD>TITLE element (except for the attribute position swap but that's not critical at the moment). However a full HTML with TITLE fails: https://www.evernote.com/shard/s119/sh/31af349e-e736-40ed-a681-c4b7e0ecf6be/faad474d29290c82a5bb4c382647678d
Thanks
Last edit: Vincent Massol 2013-12-04
Thanks Vincent - can you paste the input you used for that test and I'll see if I can figure out what the problem is.
I've used the input:
Thanks
OK, here's the output:
Apart from a few whitespace differences, that looks as I'd expect.
This is the testcase code:
ok my bad, I made a mistake in my test. I've created a new test input:
And this shows several issues remains:
the Title is removed inside the SVG element.
HTML attributes are stripped
* Several extra newlines
Screenshot:
https://www.evernote.com/shard/s119/sh/7876478a-42d2-421d-9780-b53ea7c88660/9a60e113784cc72a16f560ef2d5a0f84
FWIW my example was taken from https://developer.mozilla.org/en-US/docs/Web/SVG/Element/title
We're getting close :)
Thanks
Last edit: Vincent Massol 2013-12-05
Yay!
I'll check in the changes so far; I've reflected on the change in behaviour and I think its consistent with what you would expect the combination of "OmitUnknown" and "NamespaceAware" to do; I'll add some extra documentation about this on the website with the next release.
I think the Title issue is fairly easy to fix now that namespace awareness is woven deeper into the cleaning algorithm - I'll take a look at it tonight.
The HTML attributes issue needs its own bug really.
The newlines are kind of irritating but are added in a way that doesn't affect the meaning of the document. Maybe again worth adding a new issue to track this as its not as critical as the others.
Regarding the HTML attribues, I've opened https://sourceforge.net/p/htmlcleaner/bugs/100/ already.
I've just created https://sourceforge.net/p/htmlcleaner/bugs/101/ for the extra new lines as I agree it's less critical (but very irritating indeed).
Thanks!
I've committed the fix for improved namespace awareness, and a fix for the tag name clashes over "title", plus test cases.
Thanks Scott. I've retested it and it works. I've also created https://sourceforge.net/p/htmlcleaner/bugs/103/ since that's also a problem.
Was the tag name case change fixed as well? How about the svg:style rules being moved in the html:head? In the snapshot that Vincent built for XWiki both were still happening.
I've attached a real document where all the issues are occurring.
Neko HTML has three options for tag names: lowecase, uppercase, match, where match means that the name is kept as it is in the opening tag, but the closing tag is always updated to match the opening tag exactly.
Another option would be to only lowercase HTML tags and leave tags from a different namespace as is, possibly with a "match" logic to fix mismatching tag pairs.
Yet another option would be to add real support for SVG and MathML, since these two are tightly bound to HTML5.
As for the svg:style changes, it is important to keep them in place since: