Menu

#219 H3 closing tag incorrectly placed

v2.24
closed-fixed
nobody
None
5
2020-07-06
2020-02-17
No

If we run the the following HTML through HTMLCleaner:

<a href="https://www.moneysavingexpert.com/car-insurance/"><br>
<h3>Compare cheap car insurance quotes online - MSE</h3>
<div>
<cite>www.moneysavingexpert.com › car-insurance</cite>
</div>
</a>

We get the following:

<a href="https://www.moneysavingexpert.com/car-insurance/">
    <h3>
        Compare cheap car insurance quotes online - MSE
        <div>
            <cite>www.moneysavingexpert.com › car-insurance</cite>
        </div>
    </h3>
    <br/>
</a>

Note how is incorreclty placed after

1 Attachments

Discussion

  • Scott Wilson

    Scott Wilson - 2020-02-17

    Thanks Daniel, I'll check it out.

     
  • Scott Wilson

    Scott Wilson - 2020-02-17

    OK, the culprit is the xmlns='foo' attribute in the root element. This basically switches the parser from reading HTML elements to reading elements as being 'foreign markup'. So because of this it doesn't apply the rules that put the H3 and DIV in the correct nesting when it moves out the unclosed BR tag (again, because it thinks this is now a foo:BR tag it doesn't know its a HTML BR tag so doesn't auto-close it). If you remove the xmlns attribute, or set it to the XHTML namespace such as 'http://www.w3.org/1999/xhtml', then it all behaves properly.

    Does this problem occur in any 'real' HTML or is it just a test case?

     
  • Daniel Gonzalez

    Daniel Gonzalez - 2020-02-17

    Hi Scott,

    Thanks for your response.
    Yes this problem occurred in real HTML.

    The following HTML will reproduce it:

    <html>
    <head></head>
    <body>
    <svg></svg>
    <a href="https://www.moneysavingexpert.com/car-insurance/" "><br>
    <h3">Compare cheap car insurance quotes online - MSE</h3>
    <div"><cite">www.moneysavingexpert.com › car-insurance</cite></div>
    </a>
    </body>
    </html>
    

    Seems that the tags cause the issue as without it the H3 end tag is in the correct place.

    Thanks

    Danny

     
  • Daniel Gonzalez

    Daniel Gonzalez - 2020-02-17

    Hi Scott,
    Seems that sourceforge is removing some of my comments if I refer to html tags.
    What I wanted to say is that I believe it is the svg tags that is causing the issue in the above example.

    Thanks

    Danny

     
  • Scott Wilson

    Scott Wilson - 2020-02-17

    Got it. I think the problem here is specifically the lenient processing of SVG tags, where we use an 'implied namespace' to include them in HTML. Looks like this namespace doesn't get 'turned off' when the SVG element is closed.

    Adding a check to pop the 'implied namespace' to htmlcleaner.makeTree seems to fix it:

                            //
                            // Get the open start tag. If it contained an xmlns, then we remove it from the current namespace stack
                            //
                            if (closed.size()>0){
                              TagNode startingTag = (TagNode) closed.get(0);
                              if (startingTag.hasAttribute("xmlns")){
                                  cleanTimeValues.namespace.pop();
                              }
                              //
                              // ***NEW CODE HERE***
                              // If we used an 'assumed namespace' for this tag, remove it from the NS stack
                              //
                              TagInfo startTagInfo = getTagInfoProvider().getTagInfo(startingTag.getName());
                              if (startTagInfo != null 
                                      && startTagInfo.getAssumedNamespace() != null 
                                      && !cleanTimeValues.namespace.isEmpty()
                                      && startTagInfo.getAssumedNamespace().equals(cleanTimeValues.namespace.lastElement()) 
                                      && !startingTag.hasAttribute("xmlns")){
                                  cleanTimeValues.namespace.pop();
                              }
                            }
    

    I'll need to run a few more tests to make sure this is 'safe', but it looks like it may do the trick.

     
  • Scott Wilson

    Scott Wilson - 2020-04-07

    Committed - this will be in the next release.

     
  • Scott Wilson

    Scott Wilson - 2020-04-07
    • status: open --> closed-fixed
     
  • Scott Wilson

    Scott Wilson - 2020-04-07
    • Group: v2.23 --> v2.24
     
  • Remi Rosenthal

    Remi Rosenthal - 2020-07-06

    Hi Scott,

    I have been working with Danny on this bug internally.
    After updating to v2.24, we still seem to get the same incorrect output for the Java file HTMLCleanerBug.java attached to the original post. Could you please take a look?

    Remi

     
    • Scott Wilson

      Scott Wilson - 2020-07-06

      Hi Remi,

      Remove the "xmlns='foo'" attribute, as this effectively tells HC "this is not HTML".

       
      • Remi Rosenthal

        Remi Rosenthal - 2020-07-06

        Hi Scott,

        Thanks for clarifying - everything seems to work alright now.

        Remi

         
        • Scott Wilson

          Scott Wilson - 2020-07-06

          That's great! I'm glad I could help.

           
    • Scott Wilson

      Scott Wilson - 2020-07-06

      There is a stray escaped quote in the H3 tag name - I'm assuming this is what you are testing for. HC will pick this up and clean it if it thinks this is a HTML tag and not 'foreign markup'.

       

Log in to post a comment.