HtmlCleaner / Bugs / #99 SVG elements in HTML are incorrectly modified

Description has changed:

Diff:

--- old
+++ new
@@ -1,4 +1,4 @@
-* SVG <title> elements are dropped since HTMLCleaner only allows one <title> element, and only in the <head>, even though these are another type of title elements from a different namespace; as a result, the text inside the title is left inline, without the surrounding tag
+* SVG <title> elements are dropped since HTMLCleaner only allows one &lt;title&gt; element, and only in the &lt;head&gt;, even though these are another type of title elements from a different namespace; as a result, the text inside the title is left inline, without the surrounding tag
 * SVG has case sensitive tag names, and it does contain a few camelCase tags, but HTMLCleaner considers that HTML is case insensitive, so it automatically transforms all tag names to lowercase, breaking SVGs
 * Embedded SVG styles are moved in the HTML head section, which breaks the styling of the images.

Description has changed:

Diff:

--- old
+++ new
@@ -1,4 +1,4 @@
-* SVG <title> elements are dropped since HTMLCleaner only allows one &lt;title&gt; element, and only in the &lt;head&gt;, even though these are another type of title elements from a different namespace; as a result, the text inside the title is left inline, without the surrounding tag
+* SVG &lt;title&gt; elements are dropped since HTMLCleaner only allows one &lt;title&gt; element, and only in the &lt;head&gt;, even though these are another type of title elements from a different namespace; as a result, the text inside the title is left inline, without the surrounding tag
 * SVG has case sensitive tag names, and it does contain a few camelCase tags, but HTMLCleaner considers that HTML is case insensitive, so it automatically transforms all tag names to lowercase, breaking SVGs
 * Embedded SVG styles are moved in the HTML head section, which breaks the styling of the images.

Scott Wilson - 2013-12-02

Thanks Vincent.

I've had a go with the two examples, and the output looks OK - the SVG tags are retained, as are the namespace declarations and prefixes. Is there a good example demonstrating all these problems?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Scott Wilson - 2013-12-02

OK, I had a look at issue 1 - the title tag. This is actually quite tricky to deal with, as at the time the HTML is cleaned, we don't know the actual namespace of a token when we're trying to identify it (which is bad news for handling conflicting tag names).

A workaround for this specific issue is to make title non-unique in DefaultTagProvider, though obviously that has side effects in that it will allow multiple title tags to occur.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Thanks for looking so quickly into this Scott!

FTR I've created this unit test in XWiki land:

    @Test
    public void cleanSVGTags() throws Exception
    {
        String input = "<p>before</p>\n"
            + "<svg xmlns=\"http://www.w3.org/2000/svg\" version=\"1.1\">\n"
            + "<circle cx=\"100\" cy=\"50\" r=\"40\" stroke=\"black\" stroke-width=\"2\" fill=\"red\"/>\n"
            + "</svg>\n"
            + "<p>after</p>\n";
        assertHTML("xxx", HEADER_FULL + input + FOOTER);
    }

And when I execute it, I get the following output:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html><head></head><body>
<p>before</p>

<p>after</p>

</body></html>

So the SVG tag is completely stripped.

Scott Wilson - 2013-12-02

Do you have setOmitUnknownTags = true ?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Vincent Massol - 2013-12-02

Indeed I have setOmitUnkownTags to true because we want to remove unknown tags that don't generate valid XHTML.

However in this case, the svg is not an unknown tag because it has a namespace. For me "OmitUnknownTags" is for unknown HTML tags, i.e. from the default XHTML namespace. While the tags from other namespaces should be preserved.

i.e. HTMLCleaner should clean only tags from the default (XHTML) namespace

Last edit: Vincent Massol 2013-12-02

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Scott Wilson - 2013-12-02

OK, next up - case sensitive names.

At present TagNode sets these to lowercase.

However, we can retain the original name, and then leave it up to the serializer whether to use the lowercase name or the original case.

For example, XmlSerializer can look to see if the tagNode is recognized in DefaultTagProvider, and if it isn't, serialize the original name instead of the default lowercase one.

Of course this then opens up the possibility of error if the open and close tags no longer match :(

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Vincent Massol - 2013-12-02

Now even with setOmitUnkownTags to true I still see some problems in addition to the ones mentioned above:
Several extra new lines
The "fill" parameter's position is changed.

See https://www.evernote.com/shard/s119/sh/71fb30f1-eb5c-4d19-af6e-fb30e4a837e2/6ba4418409fe0a4969f7cd60186cf16d

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Scott Wilson - 2013-12-02

I think the issue here is that the order of processing doesn't seem to support this kind of case very well. The tree is constructed and tokens cleaned up, and only then do we have a model for checking which namespace a token is in. However, by that point we've already run many of the cleaning rules for which we want to make an exception for when we have foreign markup.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Scott Wilson - 2013-12-02

"the order of attribute specifications in a start-tag or empty-element tag is not significant." - http://www.w3.org/TR/REC-xml/#sec-starttags

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Scott Wilson - 2013-12-02

This is a quick fix that would work for one, but not both, of the examples:

// unknown HTML tag and unknown tags are not allowed // however if we have set namespace-aware and the tag contains a ":" we allow it } else if ( tag == null && properties.isOmitUnknownTags()) { if (!(startTagToken.getName().contains(":") && properties.isNamespacesAware()) ){ nodeIterator.set(null); properties.fireUglyHtml(true, startTagToken, ErrorType.Unknown); }

Its difficult to see how we can check for namespace inheritance because at this point we haven't actually built the tree, so we don't know the ancestors of the current tag.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Scott Wilson - 2013-12-02

I think I may have actually cracked it... I had a go at modifying how the internal tree is built taking account of xmlns attributes and ns prefixes, and I think I've got something that will work for the main issue of having HC behave in a way that is namespace-aware, while at the same time pruning unknown HTML tags.

To be honest I'm not 100% on putting this into a release as its a behaviour change and I need to make sure other users are happy with this.

However, feel free to try the attached patch and see if it does what you expect.

ns_aware.txt

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Vincent Massol - 2013-12-02

Thanks so much Scott! You're awesome :)

I'll try this first thing tomorrow morning.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Vincent Massol - 2013-12-04

ok I've tested it:

I've tried applying the patch to HC 2.5 but it didn't apply cleanly. XWiki is still on HC 2.5 since all versions thereafter don't work for us. Not by much but the following test fails:

assertHTML("<p/>", "<p><![CDATA[&]]></p>")

(the CDATA is not removed)

BTW your HC 2.5 tag is wrong in SVN (see https://sourceforge.net/p/htmlcleaner/code/HEAD/tree/tags/htmlcleaner-2.5/). It contains code after 2.5, check the pom.xml for ex which points to 2.7-SNAPSHOT (the sources also don't match 2.5).

I've tried your patch on HC 2.7-SNAPSHOT and it applied cleanly. It works fine for an example with a HEAD>TITLE element (except for the attribute position swap but that's not critical at the moment). However a full HTML with TITLE fails: https://www.evernote.com/shard/s119/sh/31af349e-e736-40ed-a681-c4b7e0ecf6be/faad474d29290c82a5bb4c382647678d

Thanks

Last edit: Vincent Massol 2013-12-04
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Scott Wilson - 2013-12-04

Thanks Vincent - can you paste the input you used for that test and I'll see if I can figure out what the problem is.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

I've used the input:

<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en">
  <head>
    <title>Test SVG cleaning</title>
  </head>
  <body>
    <p>before</p>
    <svg xmlns="http://www.w3.org/2000/svg" version="1.1">
      <circle cx="100" cy="50" r="40" stroke="black" stroke-width="2" fill="red"/>
    </svg>
    <p>after</p>
  </body>
</html>

Thanks

OK, here's the output:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en">
<head>
    <title>Test SVG cleaning</title>

</head>
<body>

    <p>before</p>
    <svg xmlns="http://www.w3.org/2000/svg" version="1.1">
      <circle cx="100" cy="50" r="40" stroke="black" stroke-width="2" fill="red" />
    </svg>
    <p>after</p>

</body></html>

Apart from a few whitespace differences, that looks as I'd expect.

This is the testcase code:

@Test
public void preserveSVGtagsWithTitle() throws IOException{

    cleaner.getProperties().setOmitXmlDeclaration(false);
    cleaner.getProperties().setOmitDoctypeDeclaration(false);
    cleaner.getProperties().setNamespacesAware(true);

    String initial = readFile("src/test/resources/test21.html");
    String expected = readFile("src/test/resources/test21.html");

    assertCleaned(initial,expected);
}

ok my bad, I made a mistake in my test. I've created a new test input:

    @Test
    public void cleanTitleWithNamespace() throws Exception
    {
        // Test with TITLE in HEAD
        String input = "<html xmlns=\"http://www.w3.org/1999/xhtml\" lang=\"en\" xml:lang=\"en\">\n"
            + "  <head>\n"
            + "    <title>Title test</title>\n"
            + "  </head>\n"
            + "  <body>\n"
            + "    <p>before</p>\n"
            + "    <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"300\" width=\"500\">\n"
            + "      <g>\n"
            + "        <title>SVG Title Demo example</title>\n"
            + "        <rect height=\"50\" style=\"fill:none; stroke:blue; stroke-width:1px\" width=\"200\" x=\"10\" "
            + "y=\"10\"></rect>\n"
            + "      </g>\n"
            + "    </svg>\n"
            + "    <p>after</p>\n";
        Assert.assertEquals(HEADER + input + FOOTER, HTMLUtils.toString(this.cleaner.clean(new StringReader(input))));
    }

And this shows several issues remains:
the Title is removed inside the SVG element.
HTML attributes are stripped
* Several extra newlines

Screenshot:
https://www.evernote.com/shard/s119/sh/7876478a-42d2-421d-9780-b53ea7c88660/9a60e113784cc72a16f560ef2d5a0f84

FWIW my example was taken from https://developer.mozilla.org/en-US/docs/Web/SVG/Element/title

We're getting close :)

Thanks

Last edit: Vincent Massol 2013-12-05

Scott Wilson - 2013-12-05

Yay!

I'll check in the changes so far; I've reflected on the change in behaviour and I think its consistent with what you would expect the combination of "OmitUnknown" and "NamespaceAware" to do; I'll add some extra documentation about this on the website with the next release.

I think the Title issue is fairly easy to fix now that namespace awareness is woven deeper into the cleaning algorithm - I'll take a look at it tonight.

The HTML attributes issue needs its own bug really.

The newlines are kind of irritating but are added in a way that doesn't affect the meaning of the document. Maybe again worth adding a new issue to track this as its not as critical as the others.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Vincent Massol - 2013-12-05

Regarding the HTML attribues, I've opened https://sourceforge.net/p/htmlcleaner/bugs/100/ already.

I've just created https://sourceforge.net/p/htmlcleaner/bugs/101/ for the extra new lines as I agree it's less critical (but very irritating indeed).

Thanks!

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Scott Wilson - 2013-12-06

status: open --> closed-fixed

assigned_to: Scott Wilson

Group: v 2.5 --> v 2.7
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Scott Wilson - 2013-12-06

I've committed the fix for improved namespace awareness, and a fix for the tag name clashes over "title", plus test cases.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Vincent Massol - 2013-12-06

Thanks Scott. I've retested it and it works. I've also created https://sourceforge.net/p/htmlcleaner/bugs/103/ since that's also a problem.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Sergiu Dumitriu - 2013-12-09

Was the tag name case change fixed as well? How about the svg:style rules being moved in the html:head? In the snapshot that Vincent built for XWiki both were still happening.

I've attached a real document where all the issues are occurring.

Neko HTML has three options for tag names: lowecase, uppercase, match, where match means that the name is kept as it is in the opening tag, but the closing tag is always updated to match the opening tag exactly.

Another option would be to only lowercase HTML tags and leave tags from a different namespace as is, possibly with a "match" logic to fix mismatching tag pairs.

Yet another option would be to add real support for SVG and MathML, since these two are tightly bound to HTML5.

As for the svg:style changes, it is important to keep them in place since:

These styles should apply to nodes in the svg namespace, yet moving them into the html namespace means they won't function correctly without adding a specific namespace selector.

In XWiki we're passing the cleaned HTML to an XSLT processor, then to Apache FOP, which further passes each embedded SVG images to Apache Batik, so it's important that the SVGs are fully self-contained, without any dependencies on the outer HTML, including styles.

p0000001.html
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

SVG elements in HTML are incorrectly modified

Group

Searches

Help

#99 SVG elements in HTML are incorrectly modified

Discussion