Menu

#136 Purifier filter corrupts document structure [with a "fix"]

1.9.15
open
nobody
scanner (58)
5
2019-11-28
2012-03-06
Anonymous
No

<br> tag disappears from result xhtml.

Here is the parsing code:

public static Document parseHtml(String html) throws IOException, SAXException {
// create HTML parser
DOMParser domParser = new DOMParser();

domParser.setFeature("http://cyberneko.org/html/features/balance-tags", true);
domParser.setFeature("http://xml.org/sax/features/namespaces", true);
domParser.setFeature("http://cyberneko.org/html/features/scanner/normalize-attrs", true);

XMLDocumentFilter[] filters = new XMLDocumentFilter[] { new NamespaceBinder(), new Purifier() };
domParser.setProperty("http://cyberneko.org/html/properties/filters", filters);
domParser.setProperty("http://cyberneko.org/html/properties/names/elems", "lower");
domParser.setProperty("http://cyberneko.org/html/properties/names/attrs", "no-change");
domParser.setProperty("http://cyberneko.org/html/properties/namespaces-uri", "http://www.w3.org/1999/xhtml");

domParser.parse(new InputSource(new StringReader(html)));

return domParser.getDocument();
}
====================================

here is the parsing result:

I have to grab some lunch now, but I'm open the rest of the afternoon. What time works for you?==== Add your comments to the task above this line (WRIKE_TID=NDU0Mzg0NDoxMjkyNjA= ) ====<BR>
The task &ldquo;Review activity mockup for design &amp; UI&lt;https://www.wrike.com/open.htm?id=4543844&gt;&rdquo; was discussed by Quang Tang:<BR>
Comment was added:<BR>
maybe you can give me a quick run through of how you see things working... ?<BR>
To check the current and previous state of the task , click the following link: https://www.wrike.com/open.htm?id=4543844<BR>
==== Add your comments to the task below this line (WRIKE_TID=NDU0Mzg0NDoxMjkyNjA= ) ====<BR>

===================================

first <br> disappeared

Here is the fix

private static class PurifierFixed extends Purifier {

@Override
protected String purifyName(String name, boolean localpart) {
String purified = super.purifyName(name, localpart);

if (purified == null) {
return null;
}

if (purified.equals(name)) {
return name; // do not make a new object, return the passed one. or it will trigger some bug inside nekohtml
} else {
return purified;
}
}
}

Discussion

  • Anonymous

    Anonymous - 2012-03-06
    • summary: Purifier filter corrupts document structure --> Purifier filter corrupts document structure (with a fix)
     

    Last edit: Anonymous 2015-01-12
  • Anonymous

    Anonymous - 2012-03-06
    • summary: Purifier filter corrupts document structure (with a fix) --> Purifier filter corrupts document structure [with a "fix"]
     

    Last edit: Anonymous 2015-01-12
  • Radu Coravu

    Radu Coravu - 2013-10-15

    I had about the same problem and the suggested fix worked for me, thanks for posting it.
    I'm curious why nobody worked on this issue which makes NEKO quite unusable when you want to specify filters.

     
  • Niklas Therning

    Niklas Therning - 2019-11-28
    Post awaiting moderation.

Log in to post a comment.

MongoDB Logo MongoDB