INVALID_CHARACTER_ERR when convert to w3c Document

Help
fancy
2014-04-18
2014-04-19
  • fancy

    fancy - 2014-04-18

    when clean http://internal.dbw.cn/system/2011/09/20/053405762.shtml
    it throws
    Exception in thread "main" org.w3c.dom.DOMException: INVALID_CHARACTER_ERR: An invalid or illegal XML character is specified.
    at org.apache.xerces.dom.CoreDocumentImpl.createAttribute(Unknown Source)
    at org.apache.xerces.dom.ElementImpl.setAttribute(Unknown Source)
    at org.htmlcleaner.DomSerializer.createSubnodes(DomSerializer.java:180)
    at org.htmlcleaner.DomSerializer.createSubnodes(DomSerializer.java:192)
    at org.htmlcleaner.DomSerializer.createSubnodes(DomSerializer.java:192)
    at org.htmlcleaner.DomSerializer.createSubnodes(DomSerializer.java:192)
    at org.htmlcleaner.DomSerializer.createSubnodes(DomSerializer.java:192)
    at org.htmlcleaner.DomSerializer.createSubnodes(DomSerializer.java:192)
    at org.htmlcleaner.DomSerializer.createSubnodes(DomSerializer.java:192)
    at org.htmlcleaner.DomSerializer.createSubnodes(DomSerializer.java:192)
    at org.htmlcleaner.DomSerializer.createSubnodes(DomSerializer.java:192)
    at org.htmlcleaner.DomSerializer.createSubnodes(DomSerializer.java:192)
    at org.htmlcleaner.DomSerializer.createSubnodes(DomSerializer.java:192)
    at org.htmlcleaner.DomSerializer.createDOM(DomSerializer.java:108)

    my codes:
    TagNode tagNode = new HtmlCleaner().clean(content);
    CleanerProperties cp=new CleanerProperties();
    cp.setRecognizeUnicodeChars(true);
    document = new DomSerializer(cp).createDOM(tagNode);

     
  • Scott Wilson

    Scott Wilson - 2014-04-19

    Hi fancy,

    The problem is caused by reading in the document using the wrong charset, which interprets some of the chinese characters as tags (I guess they must occupy code points used for angle brackets in UTF-8). If you do this, however, it will work OK:

        URL url = new URL("http://internal.dbw.cn/system/2011/09/20/053405762.shtml");
        HttpURLConnection conn = (HttpURLConnection) url.openConnection();
        InputStream in = conn.getInputStream();
        TagNode node = cleaner.clean(in, "GB2312");
        DomSerializer ser = new DomSerializer(cleaner.getProperties());
        ser.createDOM(node);
    

    You can also set the charset on the HtmlCleaner itself, e.g.:

        cleaner.getProperties().setCharset("GB2312");
    
     

Log in to post a comment.

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:





No, thanks