Here is the parsing code:
public static Document parseHtml(String html) throws IOException, SAXException {
// create HTML parser
DOMParser domParser = new DOMParser();
domParser.setFeature("http://cyberneko.org/html/features/balance-tags", true);
domParser.setFeature("http://xml.org/sax/features/namespaces", true);
domParser.setFeature("http://cyberneko.org/html/features/scanner/normalize-attrs", true);
XMLDocumentFilter[] filters = new XMLDocumentFilter[] { new NamespaceBinder(), new Purifier() };
domParser.setProperty("http://cyberneko.org/html/properties/filters", filters);
domParser.setProperty("http://cyberneko.org/html/properties/names/elems", "lower");
domParser.setProperty("http://cyberneko.org/html/properties/names/attrs", "no-change");
domParser.setProperty("http://cyberneko.org/html/properties/namespaces-uri", "http://www.w3.org/1999/xhtml");
domParser.parse(new InputSource(new StringReader(html)));
return domParser.getDocument();
}
====================================
here is the parsing result:
I have to grab some lunch now, but I'm open the rest of the afternoon. What time works for you?==== Add your comments to the task above this line (WRIKE_TID=NDU0Mzg0NDoxMjkyNjA= ) ====<BR>
The task “Review activity mockup for design & UI<https://www.wrike.com/open.htm?id=4543844>” was discussed by Quang Tang:<BR>
Comment was added:<BR>
maybe you can give me a quick run through of how you see things working... ?<BR>
To check the current and previous state of the task , click the following link: https://www.wrike.com/open.htm?id=4543844<BR>
==== Add your comments to the task below this line (WRIKE_TID=NDU0Mzg0NDoxMjkyNjA= ) ====<BR>
===================================
first <br> disappeared
Here is the fix
private static class PurifierFixed extends Purifier {
@Override
protected String purifyName(String name, boolean localpart) {
String purified = super.purifyName(name, localpart);
if (purified == null) {
return null;
}
if (purified.equals(name)) {
return name; // do not make a new object, return the passed one. or it will trigger some bug inside nekohtml
} else {
return purified;
}
}
}
View and moderate all "bugs Discussion" comments posted by this user
Mark all as spam, and block user from posting to "Bugs"
Input file
Last edit: Anonymous 2015-01-08
View and moderate all "bugs Discussion" comments posted by this user
Mark all as spam, and block user from posting to "Bugs"
Last edit: Anonymous 2015-01-12
View and moderate all "bugs Discussion" comments posted by this user
Mark all as spam, and block user from posting to "Bugs"
Last edit: Anonymous 2015-01-12
I had about the same problem and the suggested fix worked for me, thanks for posting it.
I'm curious why nobody worked on this issue which makes NEKO quite unusable when you want to specify filters.