It seems that after the clean process the style attribute is split into multiple attributes (even if setAllowMultiWordAttributes is set to true), where one of these attributes is -.25pt=-.25pt.
The problem is that calling createDOM on the resulting TagNode raise the following exception:
Exception in thread "main" org.w3c.dom.DOMException: INVALID_CHARACTER_ERR: An invalid or illegal XML character is specified.
at org.apache.xerces.dom.CoreDocumentImpl.createAttribute(Unknown Source)
at org.apache.xerces.dom.ElementImpl.setAttribute(Unknown Source)
at org.htmlcleaner.DomSerializer.createSubnodes(DomSerializer.java:208)
at org.htmlcleaner.DomSerializer.createSubnodes(DomSerializer.java:220)
at org.htmlcleaner.DomSerializer.createSubnodes(DomSerializer.java:220)
at org.htmlcleaner.DomSerializer.createSubnodes(DomSerializer.java:220)
at org.htmlcleaner.DomSerializer.createSubnodes(DomSerializer.java:220)
at org.htmlcleaner.DomSerializer.createSubnodes(DomSerializer.java:220)
at org.htmlcleaner.DomSerializer.createSubnodes(DomSerializer.java:220)
at org.htmlcleaner.DomSerializer.createSubnodes(DomSerializer.java:220)
at org.htmlcleaner.DomSerializer.createSubnodes(DomSerializer.java:220)
at org.htmlcleaner.DomSerializer.createSubnodes(DomSerializer.java:220)
at org.htmlcleaner.DomSerializer.createDOM(DomSerializer.java:136)
This is caused I think by the fact that an attribute name cannot start with the - character by XML specifications.
Any help appreciated.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Diego - have you tried using the latest version of HtmlCleaner? I've tried this with the current Trunk version and I get this:
@TestpublicvoidmultiWord()throwsIOException,ParserConfigurationException{Stringinitial="<span style=\"font-size:19.5pt;font-family:"Arial",sans-serif;color:#41637E;letter-spacing:-.25pt\">New story request<o:p></o:p></span>";cleaner.getProperties().setAllowMultiWordAttributes(true);DomSerializerser=newDomSerializer(cleaner.getProperties());Documentdoc=ser.createDOM(cleaner.clean(initial));for(inti=0;i<doc.getDocumentElement().getElementsByTagName("span").item(0).getAttributes().getLength();i++){System.out.println(doc.getDocumentElement().getElementsByTagName("span").item(0).getAttributes().item(i));}}
I have an html document with the following tag:
It seems that after the clean process the style attribute is split into multiple attributes (even if setAllowMultiWordAttributes is set to true), where one of these attributes is -.25pt=-.25pt.
The problem is that calling createDOM on the resulting TagNode raise the following exception:
This is caused I think by the fact that an attribute name cannot start with the - character by XML specifications.
Any help appreciated.
Diego - have you tried using the latest version of HtmlCleaner? I've tried this with the current Trunk version and I get this:
Results in:
Hi Scott, thank you for the answer.
My problem was that I was unescaping the string before processing which resulted in:
which I think it's not a valid html string.
Last edit: Diego Bardari 2015-05-04
Aha, that makes sense