HtmlCleaner / Bugs / #175 HTMLCleaner genereates invalid attribute names from bad HTML

Scott Wilson - 2016-08-17

status: open --> open-accepted
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Scott Wilson - 2016-08-17

Thanks Philipp, I'll see if I can fix this for the next release. I think the only way to handle this is to arbitrarily prefix the attributes when output using the XML Serializer, e.g.

<div class="fisterpo" style="style=" margin="" px="px" hcgenerated30px="30px">
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Scott Wilson - 2016-10-19

Group: v2.17 --> v2.18
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Scott Wilson - 2017-02-06

Group: v2.18 --> v2.19
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Scott Wilson - 2017-02-07

Group: v2.19 --> v2.20
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

legrass - 2017-04-24

I have another example taken from a real website. The (admittedly terrible) html is

<div #000000;="" 1px="" border-top:solid="" clear:both;="" 18px;="" line-height:="" 15px;text-align:center;="" padding-top:="" margin-top:25px;="" margin-right:="" 20px;="" margin-left:="" 10px;="" style="font-size:">

and inspecting the dom in firefox, the attributes are:

#000000;="", 1px="", border-top:solid="", clear:both;="", 18px;="", line-height:="", 15px;text-align:center;="", padding-top:="", margin-top:25px;="", margin-right:="", 20px;="", margin-left:="", 10px;="", style="font-size:"

while HC reports:

style=font-size:, px=px, margin-left=margin-left, margin-right=margin-right, 25px=25px, padding-top=padding-top, center=center, line-height=line-height, both=both, solid=solid

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Scott Wilson - 2017-04-24

Yes, the HTML5 attribute rules are less strict than those for XML, so we should allow them:

"Attribute names must consist of one or more characters other than the space characters, U+0000 NULL, U+0022 QUOTATION MARK ("), U+0027 APOSTROPHE ('), U+003E GREATER-THAN SIGN (>), U+002F SOLIDUS (/), and U+003D EQUALS SIGN (=) characters, the control characters, and any characters that are not defined by Unicode. In the HTML syntax, attribute names, even those for foreign elements, may be written with any mix of lower- and uppercase letters that are an ASCII case-insensitive match for the attribute's name."

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Scott Wilson - 2017-04-24

In XML, the limits are different:

[4] NameStartChar ::= ":" | [A-Z] | "_" | [a-z] | [#xC0-#xD6] | [#xD8-#xF6] | [#xF8-#x2FF] | [#x370-#x37D] | [#x37F-#x1FFF] | [#x200C-#x200D] | [#x2070-#x218F] | [#x2C00-#x2FEF] | [#x3001-#xD7FF] | [#xF900-#xFDCF] | [#xFDF0-#xFFFD] | [#x10000-#xEFFFF]
[4a] NameChar ::= NameStartChar | "-" | "." | [0-9] | #xB7 | [#x0300-#x036F] | [#x203F-#x2040]
[5] Name ::= NameStartChar (NameChar)*

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Scott Wilson - 2017-04-24

OK, so I have the following test case set:

For the input:

<p 1="1" x='y'>text</p>

The HTML serializer output is:

<p 1="1" x="y">text</p>

The XML, Dom and JDom output is:

<p hc-generated-1="1" x="y">text</p>

Look sensible?

I now need to make sure we haven't gone too far in the other direction and allow invalid HTML attribute names.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

legrass - 2017-04-24

looks fine, probably useful to make the prefix 'hc-generated-' configurable as property.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Scott Wilson - 2017-04-24

OK, so perhaps a "PrefixInvalidXmlAttributeNames" property with a default of "hc-generated-"

Where if its set to "", the behaviour is to omit the attribute entirely.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

legrass - 2017-04-24

Probably it needs a boolean property to enable/disable it. I'm using the DOMSerializer setting strictErrorChecking=false to allow invalid names, and it works. So maybe a specific boolean property prefixInvalidAttributeNames and a string for the prefix to use is more flexible becausestrictErrorChecking=false can be still used if wanted.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

legrass - 2017-04-24

and I agrre that prefixInvalidAttributeNames can be true by default

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Scott Wilson - 2017-04-24

I've checked in some code changes for this with the following behaviour.

Given the input:

<div ban;ana="true">

the default output for XML is:

<div hc-generated-banana="true">

if prefixInvalidAttributeNames is false:

<div>

if prefixInvalidAttributeNames is true and InvalidXmlAttributeNamePrefix = "":

<div banana="true">
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

legrass - 2017-04-25

From your last example, I wonder if you can always clean invalid names like in your example ban;ana to banana ? What about <p 1="1"> ?

Thinking it through, maybe the most flexible way for a user would be to allow to pass to a serializer constructor a function<string, string=""> that transforms an attribute name into whatever the user wants. By default, if not overridden by the user, this function does what we said above.</string,>

something like:

if (!isXMLValid(attrName)) attrName = invalidAttrNameFunction.sanitize(attrName); if(!attrName ==null) //null means drop it entirely node.setAttribute(attrName, attrValue)

This gives the flexibility to, for instance, drop only some attributes, clean differently others, prefix differently some others and so on. Or just use the default if one doesn't bother.
I understand that probably is not in line with the current way properties are dealt with, so please feel free to not like it :)
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Scott Wilson - 2017-04-25

I was wondering too whether the default should be always to try to fix attribute names rather than scrap invalid ones. The question is whether to show that's happened (using the prefix) or do it quietly.

Note that it would currently try to change "ban;ana" to "banana" for XML not HTML given that its not actually invalid for HTML5 (according to WHATWG). Maybe that should also be configurable - maybe a "strict attribute names" option for HTML.

With the example you gave, for

<p 1="1">

The current HTML serialization would be the same, as thats still valid HTML (well, its not in the spec, but its not forbidden). For XML or DOM it has to be:

<p hc-generated-1="1">

(Or whatever the prefix is.)

For attribute transformations, there is already the input transform function, so I'd rather not mix that in with serializers.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

legrass - 2017-04-25

I see, seems fine. Only one last thing, I'd still leave the possibility to configure such that nothing is touched. While this will produce invalid attribute names, setting Document.stricterrorchecking=false will enable to have a Document with untouched names. This is my current use case (getting a Document from html5 pages and query via xpath saxon).

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Scott Wilson - 2017-04-25

OK - so maybe just a FixInvalidAttributeNames property with default of true that applies to all serializers (even HTML), with the false behaviour to allow invalid attributes and the true behaviour to try to coerce attributes into a suitable form, or to drop them if non-fixable, and to use the configured prefix (if set) to indicate where attribute names have been modified. With the caveat that setting it to false will allow for a potentially invalid DOM to be generated.

Last edit: Scott Wilson 2017-04-25

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

legrass - 2017-04-25

I think this is it! It covers all the options. thanks!

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Scott Wilson - 2017-04-25

Wonderful! Though to be consistent with other properties I'll make it AllowInvalidNames with a default of false :)

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Scott Wilson - 2017-04-25

OK, we're about there.

AllowInvalidAttributeNames is false by default.

For all serializers, this means that attribute names are coerced into valid XML attribute names where possible, or omitted where this is not possible. Coerced attribute names are prefixed with the value of the InvalidXmlAttributeNamePrefix property, "hc-generated-" by default.

If AllowInvalidAttributeNames is true, then all attribute names are untouched, with the exception of:

Characters in an attribute name that are invalid in attribute identifiers according to the WHATWG document processing rules (apos, solidus, greater than etc) result in the attribute name being omitted.

JDom doesn't permit invalid attribute names, therefore any invalid attribute names are omitted by JDomSerializer.

And the caveat that DomSerializer sets Document.stricterrorchecking=false when AllowInvalidAttributeNames is true to prevent throwing errors when creating a Document, therefore the output of DomSerializer may be invalid XML.

One final thing before this goes into the next release:

I know most users expect HC to "do its best" with relatively little configuration. With that in mind, should the default be not to use a prefix to indicate where it has modified attributes to make them valid, so that it just tries to silently fix issues?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

legrass - 2017-04-26

I agree, probably default not to use a prefix is clearer and if you like, just expose a prefix property for one to use (in this case would be "" by default).

A side question: I couldn't find a way to set Document.stricterrorchecking=false if not creating my own custom DomSerializer where I can access document before is built. Is there a property that I miss?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Scott Wilson - 2017-04-26

Excellent, I'll implement it that way.

On the stricterrorchecking question - no, this is not exposed as a property. It probably makes more sense as a constructor argument for the DomSerializer as it doesn't apply elsewhere.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

legrass - 2017-04-26

makes sense, thanks

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Scott Wilson - 2017-04-26

I've opened another issue to track that - its #186.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

HTMLCleaner genereates invalid attribute names from bad HTML

Group

Searches

Help

#175 HTMLCleaner genereates invalid attribute names from bad HTML

Discussion