Menu

#175 HTMLCleaner genereates invalid attribute names from bad HTML

v2.20
closed-fixed
nobody
None
5
2017-04-26
2016-05-06
No

When using setNamespacesAware = false,
the following HTML will be not cleaned properly, causing a DOMSerializer to fail processing the cleaned TagNode.

Input:

<div class="fisterpo" style="style=" margin:="" 50px;="" margin-left:30px;"="">

Output (using PrettyXMLSerializer):

<div class="fisterpo" style="style=" margin="" px="px" 30px="30px">

which contains a attribute beginning with a digit, which is not valid XML Syntax.

Discussion

1 2 > >> (Page 1 of 2)
  • Scott Wilson

    Scott Wilson - 2016-08-17
    • status: open --> open-accepted
     
  • Scott Wilson

    Scott Wilson - 2016-08-17

    Thanks Philipp, I'll see if I can fix this for the next release. I think the only way to handle this is to arbitrarily prefix the attributes when output using the XML Serializer, e.g.

    <div class="fisterpo" style="style=" margin="" px="px" hcgenerated30px="30px">
    
     
  • Scott Wilson

    Scott Wilson - 2016-10-19
    • Group: v2.17 --> v2.18
     
  • Scott Wilson

    Scott Wilson - 2017-02-06
    • Group: v2.18 --> v2.19
     
  • Scott Wilson

    Scott Wilson - 2017-02-07
    • Group: v2.19 --> v2.20
     
  • legrass

    legrass - 2017-04-24

    I have another example taken from a real website. The (admittedly terrible) html is

    <div #000000;="" 1px="" border-top:solid="" clear:both;="" 18px;="" line-height:="" 15px;text-align:center;="" padding-top:="" margin-top:25px;="" margin-right:="" 20px;="" margin-left:="" 10px;="" style="font-size:">

    and inspecting the dom in firefox, the attributes are:

    #000000;="", 1px="", border-top:solid="", clear:both;="", 18px;="", line-height:="", 15px;text-align:center;="", padding-top:="", margin-top:25px;="", margin-right:="", 20px;="", margin-left:="", 10px;="", style="font-size:"

    while HC reports:

    style=font-size:, px=px, margin-left=margin-left, margin-right=margin-right, 25px=25px, padding-top=padding-top, center=center, line-height=line-height, both=both, solid=solid

     
  • Scott Wilson

    Scott Wilson - 2017-04-24

    Yes, the HTML5 attribute rules are less strict than those for XML, so we should allow them:

    "Attribute names must consist of one or more characters other than the space characters, U+0000 NULL, U+0022 QUOTATION MARK ("), U+0027 APOSTROPHE ('), U+003E GREATER-THAN SIGN (>), U+002F SOLIDUS (/), and U+003D EQUALS SIGN (=) characters, the control characters, and any characters that are not defined by Unicode. In the HTML syntax, attribute names, even those for foreign elements, may be written with any mix of lower- and uppercase letters that are an ASCII case-insensitive match for the attribute's name."

     
  • Scott Wilson

    Scott Wilson - 2017-04-24

    In XML, the limits are different:

    [4] NameStartChar ::= ":" | [A-Z] | "_" | [a-z] | [#xC0-#xD6] | [#xD8-#xF6] | [#xF8-#x2FF] | [#x370-#x37D] | [#x37F-#x1FFF] | [#x200C-#x200D] | [#x2070-#x218F] | [#x2C00-#x2FEF] | [#x3001-#xD7FF] | [#xF900-#xFDCF] | [#xFDF0-#xFFFD] | [#x10000-#xEFFFF]
    [4a] NameChar ::= NameStartChar | "-" | "." | [0-9] | #xB7 | [#x0300-#x036F] | [#x203F-#x2040]
    [5] Name ::= NameStartChar (NameChar)*

     
  • Scott Wilson

    Scott Wilson - 2017-04-24

    OK, so I have the following test case set:

    For the input:

    <p 1="1" x='y'>text</p>
    

    The HTML serializer output is:

    <p 1="1" x="y">text</p>
    

    The XML, Dom and JDom output is:

    <p hc-generated-1="1" x="y">text</p>
    

    Look sensible?

    I now need to make sure we haven't gone too far in the other direction and allow invalid HTML attribute names.

     
  • legrass

    legrass - 2017-04-24

    looks fine, probably useful to make the prefix 'hc-generated-' configurable as property.

     
  • Scott Wilson

    Scott Wilson - 2017-04-24

    OK, so perhaps a "PrefixInvalidXmlAttributeNames" property with a default of "hc-generated-"

    Where if its set to "", the behaviour is to omit the attribute entirely.

     
  • legrass

    legrass - 2017-04-24

    Probably it needs a boolean property to enable/disable it. I'm using the DOMSerializer setting strictErrorChecking=false to allow invalid names, and it works. So maybe a specific boolean property prefixInvalidAttributeNames and a string for the prefix to use is more flexible becausestrictErrorChecking=false can be still used if wanted.

     
  • legrass

    legrass - 2017-04-24

    and I agrre that prefixInvalidAttributeNames can be true by default

     
  • Scott Wilson

    Scott Wilson - 2017-04-24

    I've checked in some code changes for this with the following behaviour.

    Given the input:

    <div ban;ana="true">
    

    the default output for XML is:

    <div hc-generated-banana="true">
    

    if prefixInvalidAttributeNames is false:

    <div>
    

    if prefixInvalidAttributeNames is true and InvalidXmlAttributeNamePrefix = "":

    <div banana="true">
    
     
  • legrass

    legrass - 2017-04-25

    From your last example, I wonder if you can always clean invalid names like in your example ban;ana to banana ? What about <p 1="1"> ?

    Thinking it through, maybe the most flexible way for a user would be to allow to pass to a serializer constructor a function<string, string=""> that transforms an attribute name into whatever the user wants. By default, if not overridden by the user, this function does what we said above.</string,>

    something like:

    if (!isXMLValid(attrName)) 
        attrName = invalidAttrNameFunction.sanitize(attrName);
    if(!attrName ==null) //null means drop it entirely
        node.setAttribute(attrName, attrValue)
    

    This gives the flexibility to, for instance, drop only some attributes, clean differently others, prefix differently some others and so on. Or just use the default if one doesn't bother.
    I understand that probably is not in line with the current way properties are dealt with, so please feel free to not like it :)

     
  • Scott Wilson

    Scott Wilson - 2017-04-25

    I was wondering too whether the default should be always to try to fix attribute names rather than scrap invalid ones. The question is whether to show that's happened (using the prefix) or do it quietly.

    Note that it would currently try to change "ban;ana" to "banana" for XML not HTML given that its not actually invalid for HTML5 (according to WHATWG). Maybe that should also be configurable - maybe a "strict attribute names" option for HTML.

    With the example you gave, for

    <p 1="1">
    

    The current HTML serialization would be the same, as thats still valid HTML (well, its not in the spec, but its not forbidden). For XML or DOM it has to be:

    <p hc-generated-1="1">
    

    (Or whatever the prefix is.)

    For attribute transformations, there is already the input transform function, so I'd rather not mix that in with serializers.

     
  • legrass

    legrass - 2017-04-25

    I see, seems fine. Only one last thing, I'd still leave the possibility to configure such that nothing is touched. While this will produce invalid attribute names, setting Document.stricterrorchecking=false will enable to have a Document with untouched names. This is my current use case (getting a Document from html5 pages and query via xpath saxon).

     
  • Scott Wilson

    Scott Wilson - 2017-04-25

    OK - so maybe just a FixInvalidAttributeNames property with default of true that applies to all serializers (even HTML), with the false behaviour to allow invalid attributes and the true behaviour to try to coerce attributes into a suitable form, or to drop them if non-fixable, and to use the configured prefix (if set) to indicate where attribute names have been modified. With the caveat that setting it to false will allow for a potentially invalid DOM to be generated.

     

    Last edit: Scott Wilson 2017-04-25
  • legrass

    legrass - 2017-04-25

    I think this is it! It covers all the options. thanks!

     
  • Scott Wilson

    Scott Wilson - 2017-04-25

    Wonderful! Though to be consistent with other properties I'll make it AllowInvalidNames with a default of false :)

     
  • Scott Wilson

    Scott Wilson - 2017-04-25

    OK, we're about there.

    AllowInvalidAttributeNames is false by default.

    For all serializers, this means that attribute names are coerced into valid XML attribute names where possible, or omitted where this is not possible. Coerced attribute names are prefixed with the value of the InvalidXmlAttributeNamePrefix property, "hc-generated-" by default.

    If AllowInvalidAttributeNames is true, then all attribute names are untouched, with the exception of:

    1. Characters in an attribute name that are invalid in attribute identifiers according to the WHATWG document processing rules (apos, solidus, greater than etc) result in the attribute name being omitted.
    2. JDom doesn't permit invalid attribute names, therefore any invalid attribute names are omitted by JDomSerializer.

    And the caveat that DomSerializer sets Document.stricterrorchecking=false when AllowInvalidAttributeNames is true to prevent throwing errors when creating a Document, therefore the output of DomSerializer may be invalid XML.

    One final thing before this goes into the next release:

    I know most users expect HC to "do its best" with relatively little configuration. With that in mind, should the default be not to use a prefix to indicate where it has modified attributes to make them valid, so that it just tries to silently fix issues?

     
  • legrass

    legrass - 2017-04-26

    I agree, probably default not to use a prefix is clearer and if you like, just expose a prefix property for one to use (in this case would be "" by default).

    A side question: I couldn't find a way to set Document.stricterrorchecking=false if not creating my own custom DomSerializer where I can access document before is built. Is there a property that I miss?

     
  • Scott Wilson

    Scott Wilson - 2017-04-26

    Excellent, I'll implement it that way.

    On the stricterrorchecking question - no, this is not exposed as a property. It probably makes more sense as a constructor argument for the DomSerializer as it doesn't apply elsewhere.

     
  • legrass

    legrass - 2017-04-26

    makes sense, thanks

     
  • Scott Wilson

    Scott Wilson - 2017-04-26

    I've opened another issue to track that - its #186.

     
1 2 > >> (Page 1 of 2)

Log in to post a comment.

MongoDB Logo MongoDB