I am keeping a single DOMFragmentParser per thread in my web app, like this:
public static final ThreadLocal<DOMFragmentParser> FRAGMENT_PARSER = new ThreadLocal<DOMFragmentParser>();
....
DOMFragmentParser parser = FRAGMENT_PARSER.get();
if (parser == null) {
parser = new DOMFragmentParser();
FRAGMENT_PARSER.set(parser);
}
parser.setFeature("http://cyberneko.org/html/features/balance-tags", true);
parser.setFeature("http://cyberneko.org/html/features/balance-tags/document-fragment", !contentTrim.contains("<html"));
// Specifies whether a self closing <iframe/> tag should be allowed or not. When set to true the parser won't look for a corresponding closing </iframe> tag.
parser.setFeature("http://cyberneko.org/html/features/scanner/allow-selfclosing-iframe", false);
parser.setProperty("http://cyberneko.org/html/properties/default-encoding", Strings.UTF_8);
I have a unit test that parses:
<html><head><title>Test</title></head>
<body>
<form id="polls_form_1" class="wp-polls-form" action="/dates-and-currency/" method="post">
<p style="display: none;"><input type="hidden" name="poll_id" value="1" /></p>
</form>
</body></html>
About half the time it skips the FORM node, but keeps the P and INPUT tags in the fragment. I can't repeat on my development machine, but happens about 50% of the time in the full nightly test. I guess it's something to do with reusing the Parser object and that some error state from a previous test is causing this. Is there a way to reuse the same config or parser safely?
Fixed in https://github.com/HtmlUnit/htmlunit-neko
There was a problem with the reset implementation that leads to this when the form is not closed in the parsed html code