JTidy / Bugs / #256 Unwrapped inline content means invalid XHTML is generated

#256 Unwrapped inline content means invalid XHTML is generated

Status: open

Owner: nobody

Labels: Tidy functionality (138)

Priority: 5

Updated: 2012-10-08

Created: 2011-11-02

Creator: Hel

Private: No

Using jtidy.parseDOM with setXHTML(true) and setEncloseBlockText(true) does not cause inline content to be properly wrapped and hence W3c validation fails.

Example HTML 1 (generates valid XHTML)
"Text Inline content" -> "

Text Inline content

Example HTML 2 (generates invalid XHTML)
"Inline content" -> "Inline content"

There is code within src/main/java/org/w3c/tidy/ParserImpl.java that performs this wrapping but it has been commented out due to bug report 1403105 : java.lang.StackOverflowError in Tidy.parseDOM(). Uncommenting this block of code seems to produce correctly wrapped XHTML in most situations, but unfortunately the stack over flow error still happens if the HTML mentioned in report 1403105 is supplied. Anyway that this can be reinstated without causing the stack over flow?

Discussion

Hel - 2011-11-02

Update: This is not an xhtml-specific problem. Incorrectly wrapped content also fails HTML 4.01 Strict validation. It seems a shame to lose this important functionality because of what seems to be quite an obsure bug (1403105).

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Hel - 2011-11-02

Even with the mentioned code being re-instated, this only resolves wrapping of inline content within a blockquote, for example, and not at the top level within the body element.

For Example:

"ssss
Inline content
"
generates xhtml:
"ssss

Inline content

"

Note the initial does not get wrapped with a p element. If I place some text in front of it, however, it does get wrapped.

For Example:

"xxxx ssss
Inline content
"
generates xhtml:
"
xxxx ssss

Inline content

"

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Adding code very similar to the TEXT_NODE encloseBodyText processing (about line 799 within ParserImpl.java) for an inline element (at about line 934) seems to result in inline content within the body being properly wrapped, though it hasn't had extensive testing and there may be a better way.

That is,

            if (node.type == Node.START_TAG || node.type == Node.START_END_TAG)
            {
                if ( (node.tag.model & Dict.CM_INLINE) != 0 ) {

                    if (lexer.configuration.encloseBodyText)
                    {
                        Node para;

                        lexer.ungetToken();
                        para = lexer.inferredTag("p");
                        body.insertNodeAtEnd(para);
                        parseTag(lexer, para, mode);
                        mode = Lexer.MIXED_CONTENT;
                        continue;
                    }

                }

               ...

Unwrapped inline content means invalid XHTML is generated

Group

Searches

Help

#256 Unwrapped inline content means invalid XHTML is generated

Discussion