Menu

#3 remove invalid namespaces from XML fragment

open
nobody
None
5
2009-09-14
2009-09-12
Sam Hough
No

I'm using the SAXParser to clean up HTML fragments provided by the user, cut and pasted into editor with hideous Word rubbish, and pass on, via SAX, to Xalan.

I've tried various combinations of properties but I can't get it to remove invalid namespaces or create synthetic declarations (are these allowed on non-root elements?) using Purifier filter.

What set of features/properties should I be using? At the moment I'm using a ContentHandler to strip them but this seems overkill.

I'm using 1.9.13, Java 1.6.0_0-b11 and Xerces 2.6.2

Many thanks. nekohtml has been cleaning HTML for me for years. I'm just being fussy now ;)

Cheers

Sam

Discussion

  • Marc Guillemot

    Marc Guillemot - 2009-09-14
    • status: open --> pending
     
  • Marc Guillemot

    Marc Guillemot - 2009-09-14

    Can you provide an example of your dirty html code as well as what you expect as cleaned result?

     
  • Sam Hough

    Sam Hough - 2009-09-14

    My unit test for this

     
  • Sam Hough

    Sam Hough - 2009-09-14
    • status: pending --> open
     
  • Sam Hough

    Sam Hough - 2009-09-14

    I've attached a unit test for this. I'm happy to make match nekohtml coding conventions and reattach as a patch if useful.

    Anyway, the not very consistent tests I have are:
    public void testElementWithNamespace() throws Exception {
    doTest("<strong><o:p></o:p>X</strong>", "<strong><p/>X</strong>");
    }

    public void testRemoveAttibuteWithNamespace() throws Exception {
    doTest("<p m:bogus=\"x\">fred</p>", "<p>fred</p>");
    }

    Mainly I just want to avoid Xalan blowing up due to undeclared namespaces.

    Many thanks

    Sam

     

Log in to post a comment.