Menu

#99 SVG elements in HTML are incorrectly modified

v 2.7
closed-fixed
None
5
2017-09-26
2013-12-02
No
  • SVG <title> elements are dropped since HTMLCleaner only allows one <title> element, and only in the <head>, even though these are another type of title elements from a different namespace; as a result, the text inside the title is left inline, without the surrounding tag
  • SVG has case sensitive tag names, and it does contain a few camelCase tags, but HTMLCleaner considers that HTML is case insensitive, so it automatically transforms all tag names to lowercase, breaking SVGs
  • Embedded SVG styles are moved in the HTML head section, which breaks the styling of the images.

Examples:

<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en">
  <head>
    <title>Test SVG cleaning</title>
  </head>
  <body>
    <p>before</p>
    <svg xmlns="http://www.w3.org/2000/svg" version="1.1">
      <circle cx="100" cy="50" r="40" stroke="black" stroke-width="2" fill="red"/>
    </svg>
    <p>after</p>
  </body>
</html>

or

<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en"
    xmlns:svg="http://www.w3.org/2000/svg">
  <head>
    <title>Test SVG cleaning</title>
  </head>
  <body>
    <p>before</p>
    <svg:svg version="1.1">
      <svg:circle cx="100" cy="50" r="40" stroke="black" stroke-width="2" fill="red"/>
    </svg:svg>
    <p>after</p>
  </body>
</html>

The full details are available at http://jira.xwiki.org/browse/XWIKI-9753

Thanks! Note that this is actually preventing us from releasing XWiki 5.3 ATM so if you know of a workaround that would be awesome! :)

Thanks a lot

Discussion

1 2 > >> (Page 1 of 2)
  • Scott Wilson

    Scott Wilson - 2013-12-02
    • Description has changed:

    Diff:

    --- old
    +++ new
    @@ -1,4 +1,4 @@
    -* SVG <title> elements are dropped since HTMLCleaner only allows one <title> element, and only in the <head>, even though these are another type of title elements from a different namespace; as a result, the text inside the title is left inline, without the surrounding tag
    +* SVG <title> elements are dropped since HTMLCleaner only allows one &lt;title&gt; element, and only in the &lt;head&gt;, even though these are another type of title elements from a different namespace; as a result, the text inside the title is left inline, without the surrounding tag
     * SVG has case sensitive tag names, and it does contain a few camelCase tags, but HTMLCleaner considers that HTML is case insensitive, so it automatically transforms all tag names to lowercase, breaking SVGs
     * Embedded SVG styles are moved in the HTML head section, which breaks the styling of the images.
    
     
  • Scott Wilson

    Scott Wilson - 2013-12-02
    • Description has changed:

    Diff:

    --- old
    +++ new
    @@ -1,4 +1,4 @@
    -* SVG <title> elements are dropped since HTMLCleaner only allows one &lt;title&gt; element, and only in the &lt;head&gt;, even though these are another type of title elements from a different namespace; as a result, the text inside the title is left inline, without the surrounding tag
    +* SVG &lt;title&gt; elements are dropped since HTMLCleaner only allows one &lt;title&gt; element, and only in the &lt;head&gt;, even though these are another type of title elements from a different namespace; as a result, the text inside the title is left inline, without the surrounding tag
     * SVG has case sensitive tag names, and it does contain a few camelCase tags, but HTMLCleaner considers that HTML is case insensitive, so it automatically transforms all tag names to lowercase, breaking SVGs
     * Embedded SVG styles are moved in the HTML head section, which breaks the styling of the images.
    
     
  • Scott Wilson

    Scott Wilson - 2013-12-02

    Thanks Vincent.

    I've had a go with the two examples, and the output looks OK - the SVG tags are retained, as are the namespace declarations and prefixes. Is there a good example demonstrating all these problems?

     
  • Scott Wilson

    Scott Wilson - 2013-12-02

    OK, I had a look at issue 1 - the title tag. This is actually quite tricky to deal with, as at the time the HTML is cleaned, we don't know the actual namespace of a token when we're trying to identify it (which is bad news for handling conflicting tag names).

    A workaround for this specific issue is to make title non-unique in DefaultTagProvider, though obviously that has side effects in that it will allow multiple title tags to occur.

     
  • Vincent Massol

    Vincent Massol - 2013-12-02

    Thanks for looking so quickly into this Scott!

    FTR I've created this unit test in XWiki land:

        @Test
        public void cleanSVGTags() throws Exception
        {
            String input = "<p>before</p>\n"
                + "<svg xmlns=\"http://www.w3.org/2000/svg\" version=\"1.1\">\n"
                + "<circle cx=\"100\" cy=\"50\" r=\"40\" stroke=\"black\" stroke-width=\"2\" fill=\"red\"/>\n"
                + "</svg>\n"
                + "<p>after</p>\n";
            assertHTML("xxx", HEADER_FULL + input + FOOTER);
        }
    

    And when I execute it, I get the following output:

    <?xml version="1.0" encoding="UTF-8"?>
    <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
    <html><head></head><body>
    <p>before</p>
    
    <p>after</p>
    
    </body></html>
    

    So the SVG tag is completely stripped.

     
  • Scott Wilson

    Scott Wilson - 2013-12-02

    Do you have setOmitUnknownTags = true ?

     
  • Vincent Massol

    Vincent Massol - 2013-12-02

    Indeed I have setOmitUnkownTags to true because we want to remove unknown tags that don't generate valid XHTML.

    However in this case, the svg is not an unknown tag because it has a namespace. For me "OmitUnknownTags" is for unknown HTML tags, i.e. from the default XHTML namespace. While the tags from other namespaces should be preserved.

    i.e. HTMLCleaner should clean only tags from the default (XHTML) namespace

     

    Last edit: Vincent Massol 2013-12-02
  • Scott Wilson

    Scott Wilson - 2013-12-02

    OK, next up - case sensitive names.

    At present TagNode sets these to lowercase.

    However, we can retain the original name, and then leave it up to the serializer whether to use the lowercase name or the original case.

    For example, XmlSerializer can look to see if the tagNode is recognized in DefaultTagProvider, and if it isn't, serialize the original name instead of the default lowercase one.

    Of course this then opens up the possibility of error if the open and close tags no longer match :(

     
  • Scott Wilson

    Scott Wilson - 2013-12-02

    I think the issue here is that the order of processing doesn't seem to support this kind of case very well. The tree is constructed and tokens cleaned up, and only then do we have a model for checking which namespace a token is in. However, by that point we've already run many of the cleaning rules for which we want to make an exception for when we have foreign markup.

     
  • Scott Wilson

    Scott Wilson - 2013-12-02

    "the order of attribute specifications in a start-tag or empty-element tag is not significant." - http://www.w3.org/TR/REC-xml/#sec-starttags

     
  • Scott Wilson

    Scott Wilson - 2013-12-02

    This is a quick fix that would work for one, but not both, of the examples:

                // unknown HTML tag and unknown tags are not allowed
                // however if we have set namespace-aware and the tag contains a ":" we allow it
                } else if ( tag == null && properties.isOmitUnknownTags()) {
                    if (!(startTagToken.getName().contains(":") && properties.isNamespacesAware()) ){
                     nodeIterator.set(null);
                     properties.fireUglyHtml(true, startTagToken, ErrorType.Unknown);
                    }
    

    Its difficult to see how we can check for namespace inheritance because at this point we haven't actually built the tree, so we don't know the ancestors of the current tag.

     
  • Scott Wilson

    Scott Wilson - 2013-12-02

    I think I may have actually cracked it... I had a go at modifying how the internal tree is built taking account of xmlns attributes and ns prefixes, and I think I've got something that will work for the main issue of having HC behave in a way that is namespace-aware, while at the same time pruning unknown HTML tags.

    To be honest I'm not 100% on putting this into a release as its a behaviour change and I need to make sure other users are happy with this.

    However, feel free to try the attached patch and see if it does what you expect.

     
  • Vincent Massol

    Vincent Massol - 2013-12-02

    Thanks so much Scott! You're awesome :)

    I'll try this first thing tomorrow morning.

     
  • Vincent Massol

    Vincent Massol - 2013-12-04

    ok I've tested it:

    • I've tried applying the patch to HC 2.5 but it didn't apply cleanly. XWiki is still on HC 2.5 since all versions thereafter don't work for us. Not by much but the following test fails:
     assertHTML("<p/>", "<p><![CDATA[&]]></p>")
    

    (the CDATA is not removed)

    Thanks

     

    Last edit: Vincent Massol 2013-12-04
  • Scott Wilson

    Scott Wilson - 2013-12-04

    Thanks Vincent - can you paste the input you used for that test and I'll see if I can figure out what the problem is.

     
  • Vincent Massol

    Vincent Massol - 2013-12-04

    I've used the input:

    <?xml version="1.0" encoding="UTF-8" ?>
    <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
        "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
    <html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en">
      <head>
        <title>Test SVG cleaning</title>
      </head>
      <body>
        <p>before</p>
        <svg xmlns="http://www.w3.org/2000/svg" version="1.1">
          <circle cx="100" cy="50" r="40" stroke="black" stroke-width="2" fill="red"/>
        </svg>
        <p>after</p>
      </body>
    </html>
    

    Thanks

     
  • Scott Wilson

    Scott Wilson - 2013-12-04

    OK, here's the output:

    <?xml version="1.0" encoding="UTF-8"?>
    <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
    <html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en">
    <head>
        <title>Test SVG cleaning</title>
    
    </head>
    <body>
    
        <p>before</p>
        <svg xmlns="http://www.w3.org/2000/svg" version="1.1">
          <circle cx="100" cy="50" r="40" stroke="black" stroke-width="2" fill="red" />
        </svg>
        <p>after</p>
    
    </body></html>
    

    Apart from a few whitespace differences, that looks as I'd expect.

    This is the testcase code:

    @Test
    public void preserveSVGtagsWithTitle() throws IOException{
    
        cleaner.getProperties().setOmitXmlDeclaration(false);
        cleaner.getProperties().setOmitDoctypeDeclaration(false);
        cleaner.getProperties().setNamespacesAware(true);
    
        String initial = readFile("src/test/resources/test21.html");
        String expected = readFile("src/test/resources/test21.html");
    
        assertCleaned(initial,expected);
    }
    
     
  • Vincent Massol

    Vincent Massol - 2013-12-05

    ok my bad, I made a mistake in my test. I've created a new test input:

        @Test
        public void cleanTitleWithNamespace() throws Exception
        {
            // Test with TITLE in HEAD
            String input = "<html xmlns=\"http://www.w3.org/1999/xhtml\" lang=\"en\" xml:lang=\"en\">\n"
                + "  <head>\n"
                + "    <title>Title test</title>\n"
                + "  </head>\n"
                + "  <body>\n"
                + "    <p>before</p>\n"
                + "    <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"300\" width=\"500\">\n"
                + "      <g>\n"
                + "        <title>SVG Title Demo example</title>\n"
                + "        <rect height=\"50\" style=\"fill:none; stroke:blue; stroke-width:1px\" width=\"200\" x=\"10\" "
                + "y=\"10\"></rect>\n"
                + "      </g>\n"
                + "    </svg>\n"
                + "    <p>after</p>\n";
            Assert.assertEquals(HEADER + input + FOOTER, HTMLUtils.toString(this.cleaner.clean(new StringReader(input))));
        }
    

    And this shows several issues remains:
    the Title is removed inside the SVG element.
    HTML attributes are stripped
    * Several extra newlines

    Screenshot:
    https://www.evernote.com/shard/s119/sh/7876478a-42d2-421d-9780-b53ea7c88660/9a60e113784cc72a16f560ef2d5a0f84

    FWIW my example was taken from https://developer.mozilla.org/en-US/docs/Web/SVG/Element/title

    We're getting close :)

    Thanks

     

    Last edit: Vincent Massol 2013-12-05
  • Scott Wilson

    Scott Wilson - 2013-12-05

    Yay!

    I'll check in the changes so far; I've reflected on the change in behaviour and I think its consistent with what you would expect the combination of "OmitUnknown" and "NamespaceAware" to do; I'll add some extra documentation about this on the website with the next release.

    I think the Title issue is fairly easy to fix now that namespace awareness is woven deeper into the cleaning algorithm - I'll take a look at it tonight.

    The HTML attributes issue needs its own bug really.

    The newlines are kind of irritating but are added in a way that doesn't affect the meaning of the document. Maybe again worth adding a new issue to track this as its not as critical as the others.

     
  • Scott Wilson

    Scott Wilson - 2013-12-06
    • status: open --> closed-fixed
    • assigned_to: Scott Wilson
    • Group: v 2.5 --> v 2.7
     
  • Scott Wilson

    Scott Wilson - 2013-12-06

    I've committed the fix for improved namespace awareness, and a fix for the tag name clashes over "title", plus test cases.

     
  • Vincent Massol

    Vincent Massol - 2013-12-06

    Thanks Scott. I've retested it and it works. I've also created https://sourceforge.net/p/htmlcleaner/bugs/103/ since that's also a problem.

     
  • Sergiu Dumitriu

    Sergiu Dumitriu - 2013-12-09

    Was the tag name case change fixed as well? How about the svg:style rules being moved in the html:head? In the snapshot that Vincent built for XWiki both were still happening.

    I've attached a real document where all the issues are occurring.

    Neko HTML has three options for tag names: lowecase, uppercase, match, where match means that the name is kept as it is in the opening tag, but the closing tag is always updated to match the opening tag exactly.

    Another option would be to only lowercase HTML tags and leave tags from a different namespace as is, possibly with a "match" logic to fix mismatching tag pairs.

    Yet another option would be to add real support for SVG and MathML, since these two are tightly bound to HTML5.

    As for the svg:style changes, it is important to keep them in place since:

    1. These styles should apply to nodes in the svg namespace, yet moving them into the html namespace means they won't function correctly without adding a specific namespace selector.
    2. In XWiki we're passing the cleaned HTML to an XSLT processor, then to Apache FOP, which further passes each embedded SVG images to Apache Batik, so it's important that the SVGs are fully self-contained, without any dependencies on the outer HTML, including styles.
     
1 2 > >> (Page 1 of 2)

Log in to post a comment.