When using setNamespacesAware = false,
the following HTML will be not cleaned properly, causing a DOMSerializer to fail processing the cleaned TagNode.
Input:
<div class="fisterpo" style="style=" margin:="" 50px;="" margin-left:30px;"="">
Output (using PrettyXMLSerializer):
<div class="fisterpo" style="style=" margin="" px="px" 30px="30px">
which contains a attribute beginning with a digit, which is not valid XML Syntax.
Thanks Philipp, I'll see if I can fix this for the next release. I think the only way to handle this is to arbitrarily prefix the attributes when output using the XML Serializer, e.g.
I have another example taken from a real website. The (admittedly terrible) html is
<div #000000;="" 1px="" border-top:solid="" clear:both;="" 18px;="" line-height:="" 15px;text-align:center;="" padding-top:="" margin-top:25px;="" margin-right:="" 20px;="" margin-left:="" 10px;="" style="font-size:">and inspecting the dom in firefox, the attributes are:
#000000;="", 1px="", border-top:solid="", clear:both;="", 18px;="", line-height:="", 15px;text-align:center;="", padding-top:="", margin-top:25px;="", margin-right:="", 20px;="", margin-left:="", 10px;="", style="font-size:"while HC reports:
style=font-size:, px=px, margin-left=margin-left, margin-right=margin-right, 25px=25px, padding-top=padding-top, center=center, line-height=line-height, both=both, solid=solidYes, the HTML5 attribute rules are less strict than those for XML, so we should allow them:
"Attribute names must consist of one or more characters other than the space characters, U+0000 NULL, U+0022 QUOTATION MARK ("), U+0027 APOSTROPHE ('), U+003E GREATER-THAN SIGN (>), U+002F SOLIDUS (/), and U+003D EQUALS SIGN (=) characters, the control characters, and any characters that are not defined by Unicode. In the HTML syntax, attribute names, even those for foreign elements, may be written with any mix of lower- and uppercase letters that are an ASCII case-insensitive match for the attribute's name."
In XML, the limits are different:
[4] NameStartChar ::= ":" | [A-Z] | "_" | [a-z] | [#xC0-#xD6] | [#xD8-#xF6] | [#xF8-#x2FF] | [#x370-#x37D] | [#x37F-#x1FFF] | [#x200C-#x200D] | [#x2070-#x218F] | [#x2C00-#x2FEF] | [#x3001-#xD7FF] | [#xF900-#xFDCF] | [#xFDF0-#xFFFD] | [#x10000-#xEFFFF]
[4a] NameChar ::= NameStartChar | "-" | "." | [0-9] | #xB7 | [#x0300-#x036F] | [#x203F-#x2040]
[5] Name ::= NameStartChar (NameChar)*
OK, so I have the following test case set:
For the input:
The HTML serializer output is:
The XML, Dom and JDom output is:
Look sensible?
I now need to make sure we haven't gone too far in the other direction and allow invalid HTML attribute names.
looks fine, probably useful to make the prefix 'hc-generated-' configurable as property.
OK, so perhaps a "PrefixInvalidXmlAttributeNames" property with a default of "hc-generated-"
Where if its set to "", the behaviour is to omit the attribute entirely.
Probably it needs a boolean property to enable/disable it. I'm using the DOMSerializer setting
strictErrorChecking=falseto allow invalid names, and it works. So maybe a specific boolean propertyprefixInvalidAttributeNamesand a string for the prefix to use is more flexible becausestrictErrorChecking=falsecan be still used if wanted.and I agrre that
prefixInvalidAttributeNamescan betrueby defaultI've checked in some code changes for this with the following behaviour.
Given the input:
the default output for XML is:
if
prefixInvalidAttributeNamesisfalse:if
prefixInvalidAttributeNamesistrueandInvalidXmlAttributeNamePrefix= "":From your last example, I wonder if you can always clean invalid names like in your example
ban;anatobanana? What about<p 1="1">?Thinking it through, maybe the most flexible way for a user would be to allow to pass to a serializer constructor a function<string, string=""> that transforms an attribute name into whatever the user wants. By default, if not overridden by the user, this function does what we said above.</string,>
something like:
This gives the flexibility to, for instance, drop only some attributes, clean differently others, prefix differently some others and so on. Or just use the default if one doesn't bother.
I understand that probably is not in line with the current way properties are dealt with, so please feel free to not like it :)
I was wondering too whether the default should be always to try to fix attribute names rather than scrap invalid ones. The question is whether to show that's happened (using the prefix) or do it quietly.
Note that it would currently try to change "ban;ana" to "banana" for XML not HTML given that its not actually invalid for HTML5 (according to WHATWG). Maybe that should also be configurable - maybe a "strict attribute names" option for HTML.
With the example you gave, for
The current HTML serialization would be the same, as thats still valid HTML (well, its not in the spec, but its not forbidden). For XML or DOM it has to be:
(Or whatever the prefix is.)
For attribute transformations, there is already the input transform function, so I'd rather not mix that in with serializers.
I see, seems fine. Only one last thing, I'd still leave the possibility to configure such that nothing is touched. While this will produce invalid attribute names, setting
Document.stricterrorchecking=falsewill enable to have a Document with untouched names. This is my current use case (getting a Document from html5 pages and query via xpath saxon).OK - so maybe just a
FixInvalidAttributeNamesproperty with default oftruethat applies to all serializers (even HTML), with thefalsebehaviour to allow invalid attributes and thetruebehaviour to try to coerce attributes into a suitable form, or to drop them if non-fixable, and to use the configured prefix (if set) to indicate where attribute names have been modified. With the caveat that setting it tofalsewill allow for a potentially invalid DOM to be generated.Last edit: Scott Wilson 2017-04-25
I think this is it! It covers all the options. thanks!
Wonderful! Though to be consistent with other properties I'll make it
AllowInvalidNameswith a default offalse:)OK, we're about there.
AllowInvalidAttributeNamesisfalseby default.For all serializers, this means that attribute names are coerced into valid XML attribute names where possible, or omitted where this is not possible. Coerced attribute names are prefixed with the value of the
InvalidXmlAttributeNamePrefixproperty, "hc-generated-" by default.If
AllowInvalidAttributeNamesistrue, then all attribute names are untouched, with the exception of:And the caveat that DomSerializer sets
Document.stricterrorchecking=falsewhenAllowInvalidAttributeNamesistrueto prevent throwing errors when creating a Document, therefore the output of DomSerializer may be invalid XML.One final thing before this goes into the next release:
I know most users expect HC to "do its best" with relatively little configuration. With that in mind, should the default be not to use a prefix to indicate where it has modified attributes to make them valid, so that it just tries to silently fix issues?
I agree, probably default not to use a prefix is clearer and if you like, just expose a prefix property for one to use (in this case would be "" by default).
A side question: I couldn't find a way to set
Document.stricterrorchecking=falseif not creating my own custom DomSerializer where I can access document before is built. Is there a property that I miss?Excellent, I'll implement it that way.
On the
stricterrorcheckingquestion - no, this is not exposed as a property. It probably makes more sense as a constructor argument for the DomSerializer as it doesn't apply elsewhere.makes sense, thanks
I've opened another issue to track that - its #186.