CyberNeko HTML Parser / Bugs / #115 Neko incorrectly parses tags with whitespace in the tag

#115 Neko incorrectly parses tags with whitespace in the tag

Milestone: 1.9.7

Status: open

Owner: nobody

Labels: scanner (58)

Priority: 5

Updated: 2025-04-30

Created: 2010-07-20

Creator: Thomas Park

Private: No

Neko does not correctly parse some tags that contain whitespace in the closing tag.

For example, assume a document contains a style tag with whitespace in the closing tag before the end bracket, as follows (a full document exhibiting this problem will be uploaded with the bug):

</style

After processing with Neko, the closing "style" tag with whitespace is still included in the output. However, new closing "style", "body" and "html" tags will be added, even if the closing "body" and "html" tags were already present. This happens even if the "balance-tags" feature is disabled.

To reproduce:

// open an InputStream to the attached file
org.apache.xerces.parsers.DOMParser parser = new org.apache.xerces.parsers.DOMParser( new org.cyberneko.html.HTMLConfiguration() );
parser.setFeature( "http://cyberneko.org/html/features/balance-tags", false );
parser.setFeature( "http://cyberneko.org/html/features/balance-tags", false );
parser.parse( fileInputStream );
// dump the parser output

In some cases, further processing can cause any HTML after the questionable closing tag to be interpreted as "characters" by the Xerces SAX engine, which causes all of the HTML characters to be converted into entity escapes, such as "<".

An easy way to observe this behaviour is to add the following to the appropriate locations in the test file, with namespace processing turned on:

Discussion

Thomas Park - 2010-07-20

File to use for reproducing bad handling of close tags containing whitespace

neko_style_bug.html

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

RBRi - 2025-04-30

see https://github.com/HtmlUnit/htmlunit-neko/issues/146

Will be fixed in htmlunit-neko 4.12

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Neko incorrectly parses tags with whitespace in the tag

Group

Searches

Help

#115 Neko incorrectly parses tags with whitespace in the tag

Discussion