Neko does not correctly parse some tags that contain whitespace in the closing tag.
For example, assume a document contains a style tag with whitespace in the closing tag before the end bracket, as follows (a full document exhibiting this problem will be uploaded with the bug):
</style
>
After processing with Neko, the closing "style" tag with whitespace is still included in the output. However, new closing "style", "body" and "html" tags will be added, even if the closing "body" and "html" tags were already present. This happens even if the "balance-tags" feature is disabled.
To reproduce:
// open an InputStream to the attached file
org.apache.xerces.parsers.DOMParser parser = new org.apache.xerces.parsers.DOMParser( new org.cyberneko.html.HTMLConfiguration() );
parser.setFeature( "http://cyberneko.org/html/features/balance-tags", false );
parser.setFeature( "http://cyberneko.org/html/features/balance-tags", false );
parser.parse( fileInputStream );
// dump the parser output
In some cases, further processing can cause any HTML after the questionable closing tag to be interpreted as "characters" by the Xerces SAX engine, which causes all of the HTML characters to be converted into entity escapes, such as "<".
An easy way to observe this behaviour is to add the following to the appropriate locations in the test file, with namespace processing turned on:
<head xmlns="http://www.w3.org/1999/xhtml">
<meta http-equiv="Content-Type" content="text/html; charset=ISO-2022-JP">
</head>
File to use for reproducing bad handling of close tags containing whitespace
see https://github.com/HtmlUnit/htmlunit-neko/issues/146
Will be fixed in htmlunit-neko 4.12