Menu

Attributes split and parse error

Help
2015-04-27
2015-05-04
  • Diego Bardari

    Diego Bardari - 2015-04-27

    I have an html document with the following tag:

    <span style="font-size:19.5pt;font-family:&quot;Arial&quot;,sans-serif;color:#41637E;letter-spacing:-.25pt">New story request<o:p></o:p></span>
    

    It seems that after the clean process the style attribute is split into multiple attributes (even if setAllowMultiWordAttributes is set to true), where one of these attributes is -.25pt=-.25pt.

    The problem is that calling createDOM on the resulting TagNode raise the following exception:

    Exception in thread "main" org.w3c.dom.DOMException: INVALID_CHARACTER_ERR: An invalid or illegal XML character is specified. 
    at org.apache.xerces.dom.CoreDocumentImpl.createAttribute(Unknown Source)
    at org.apache.xerces.dom.ElementImpl.setAttribute(Unknown Source)
    at org.htmlcleaner.DomSerializer.createSubnodes(DomSerializer.java:208)
    at org.htmlcleaner.DomSerializer.createSubnodes(DomSerializer.java:220)
    at org.htmlcleaner.DomSerializer.createSubnodes(DomSerializer.java:220)
    at org.htmlcleaner.DomSerializer.createSubnodes(DomSerializer.java:220)
    at org.htmlcleaner.DomSerializer.createSubnodes(DomSerializer.java:220)
    at org.htmlcleaner.DomSerializer.createSubnodes(DomSerializer.java:220)
    at org.htmlcleaner.DomSerializer.createSubnodes(DomSerializer.java:220)
    at org.htmlcleaner.DomSerializer.createSubnodes(DomSerializer.java:220)
    at org.htmlcleaner.DomSerializer.createSubnodes(DomSerializer.java:220)
    at org.htmlcleaner.DomSerializer.createSubnodes(DomSerializer.java:220)
    at org.htmlcleaner.DomSerializer.createDOM(DomSerializer.java:136)
    

    This is caused I think by the fact that an attribute name cannot start with the - character by XML specifications.

    Any help appreciated.

     
  • Scott Wilson

    Scott Wilson - 2015-05-01

    Diego - have you tried using the latest version of HtmlCleaner? I've tried this with the current Trunk version and I get this:

        @Test   
        public void multiWord() throws IOException, ParserConfigurationException{
            String initial = "<span style=\"font-size:19.5pt;font-family:&quot;Arial&quot;,sans-serif;color:#41637E;letter-spacing:-.25pt\">New story request<o:p></o:p></span>";
            cleaner.getProperties().setAllowMultiWordAttributes(true);
            DomSerializer ser = new DomSerializer(cleaner.getProperties());
            Document doc = ser.createDOM(cleaner.clean(initial));
            for (int i=0;i<doc.getDocumentElement().getElementsByTagName("span").item(0).getAttributes().getLength(); i++){
                System.out.println(doc.getDocumentElement().getElementsByTagName("span").item(0).getAttributes().item(i));
            }
        }   
    

    Results in:

    style="font-size:19.5pt;font-family:&quot;Arial&quot;,sans-serif;color:#41637E;letter-spacing:-.25pt"
    
     
  • Diego Bardari

    Diego Bardari - 2015-05-04

    Hi Scott, thank you for the answer.

    My problem was that I was unescaping the string before processing which resulted in:

    <span style="font-size:19.5pt;font-family:"Arial",sans-serif;color:#41637E;letter-spacing:-.25pt">New story request<o:p></o:p></span>
    

    which I think it's not a valid html string.

     

    Last edit: Diego Bardari 2015-05-04
  • Scott Wilson

    Scott Wilson - 2015-05-04

    Aha, that makes sense

     

Log in to post a comment.