it's worth noticing that, by setting props.setOmitUnknownTags(true); the issue does not happen
StringIndexOutOfBoundsException while sanitizeXmlIdentifier
makes sense, thanks
I agree, probably default not to use a prefix is clearer and if you like, just expose a prefix property for one to use (in this case would be "" by default). A side question: I couldn't find a way to set Document.stricterrorchecking=false if not creating my own custom DomSerializer where I can access document before is built. Is there a property that I miss?
I think this is it! It covers all the options. thanks!
I see, seems fine. Only one last thing, I'd still leave the possibility to configure such that nothing is touched. While this will produce invalid attribute names, setting Document.stricterrorchecking=false will enable to have a Document with untouched names. This is my current use case (getting a Document from html5 pages and query via xpath saxon).
From your last example, I wonder if you can always clean invalid names like in your example ban;ana to banana ? What about <p 1="1"> ? Thinking it through, maybe the most flexible way for a user would be to allow to pass to a serializer constructor a function<String, String=""> that transforms an attribute name into whatever the user wants. By default, if not overridden by the user, this function does what we said above. something like: if (!isXMLValid(attrName)) attrName = invalidAttrNameFunction.sanitize(attrName);...
and I agrre that prefixInvalidAttributeNames can be true by default