CyberNeko HTML Parser / Support Requests / #3 remove invalid namespaces from XML fragment

#3 remove invalid namespaces from XML fragment

Status: open

Owner: nobody

Labels: None

Priority: 5

Updated: 2009-09-14

Created: 2009-09-12

Creator: Sam Hough

Private: No

I'm using the SAXParser to clean up HTML fragments provided by the user, cut and pasted into editor with hideous Word rubbish, and pass on, via SAX, to Xalan.

I've tried various combinations of properties but I can't get it to remove invalid namespaces or create synthetic declarations (are these allowed on non-root elements?) using Purifier filter.

What set of features/properties should I be using? At the moment I'm using a ContentHandler to strip them but this seems overkill.

I'm using 1.9.13, Java 1.6.0_0-b11 and Xerces 2.6.2

Many thanks. nekohtml has been cleaning HTML for me for years. I'm just being fussy now ;)

Cheers

Sam

Discussion

Marc Guillemot - 2009-09-14

status: open --> pending
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Marc Guillemot - 2009-09-14

Can you provide an example of your dirty html code as well as what you expect as cleaned result?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Sam Hough - 2009-09-14

My unit test for this

NamespaceRemovalTest.java

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Sam Hough - 2009-09-14

status: pending --> open
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Sam Hough - 2009-09-14

I've attached a unit test for this. I'm happy to make match nekohtml coding conventions and reattach as a patch if useful.

Anyway, the not very consistent tests I have are:
public void testElementWithNamespace() throws Exception {
doTest("<o:p></o:p>X", "X");
}

public void testRemoveAttibuteWithNamespace() throws Exception {
doTest("fred", "fred");
}

Mainly I just want to avoid Xalan blowing up due to undeclared namespaces.

Many thanks

Sam

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

remove invalid namespaces from XML fragment

Group

Searches

Help

#3 remove invalid namespaces from XML fragment

Discussion