I'm using the SAXParser to clean up HTML fragments provided by the user, cut and pasted into editor with hideous Word rubbish, and pass on, via SAX, to Xalan.
I've tried various combinations of properties but I can't get it to remove invalid namespaces or create synthetic declarations (are these allowed on non-root elements?) using Purifier filter.
What set of features/properties should I be using? At the moment I'm using a ContentHandler to strip them but this seems overkill.
I'm using 1.9.13, Java 1.6.0_0-b11 and Xerces 2.6.2
Many thanks. nekohtml has been cleaning HTML for me for years. I'm just being fussy now ;)
Cheers
Sam
Can you provide an example of your dirty html code as well as what you expect as cleaned result?
My unit test for this
I've attached a unit test for this. I'm happy to make match nekohtml coding conventions and reattach as a patch if useful.
Anyway, the not very consistent tests I have are:
public void testElementWithNamespace() throws Exception {
doTest("<strong><o:p></o:p>X</strong>", "<strong><p/>X</strong>");
}
public void testRemoveAttibuteWithNamespace() throws Exception {
doTest("<p m:bogus=\"x\">fred</p>", "<p>fred</p>");
}
Mainly I just want to avoid Xalan blowing up due to undeclared namespaces.
Many thanks
Sam