[Jtidy-devel] [ jtidy-Bugs-3432258 ] Unwrapped inline content means invalid XHTML is generated

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Bugs item #3432258, was opened at 2011-11-02 12:56
Message generated for change (Comment added) made by helsom
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=113153&aid=3432258&group_id=13153

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: Tidy functionality
Group: None
Status: Open
Resolution: None
Priority: 5
Private: No
Submitted By: Hel (helsom)
Assigned to: Nobody/Anonymous (nobody)
Summary: Unwrapped inline content means invalid XHTML is generated

Initial Comment:
Using jtidy.parseDOM with setXHTML(true) and setEncloseBlockText(true) does not cause inline content to be properly wrapped and hence W3c validation fails.

Example HTML 1 (generates valid XHTML)
"Text <em>Inline content</em>" -> "<p>Text <em>Inline content</em></p>"

Example HTML 2 (generates invalid XHTML)
"<em>Inline content</em>" -> "<em>Inline content</em>"

There is code within src/main/java/org/w3c/tidy/ParserImpl.java that performs this wrapping but it has been commented out due to bug report 1403105 : java.lang.StackOverflowError in Tidy.parseDOM(). Uncommenting this block of code seems to produce correctly wrapped XHTML in most situations, but unfortunately the stack over flow error still happens if the HTML mentioned in report 1403105 is supplied. Anyway that this can be reinstated without causing the stack over flow?

----------------------------------------------------------------------

>Comment By: Hel (helsom)
Date: 2011-11-02 16:51

Message:
Adding code very similar to the TEXT_NODE encloseBodyText processing (about
line 799 within ParserImpl.java) for an inline element (at about line 934)
seems to result in inline content within the body being properly wrapped,
though it hasn't had extensive testing and there may be a better way.

That is,

                if (node.type == Node.START_TAG || node.type ==
Node.START_END_TAG)
                {
                    if ( (node.tag.model & Dict.CM_INLINE) != 0 ) {

                        if (lexer.configuration.encloseBodyText)
                        {
                            Node para;

                            lexer.ungetToken();
                            para = lexer.inferredTag("p");
                            body.insertNodeAtEnd(para);
                            parseTag(lexer, para, mode);
                            mode = Lexer.MIXED_CONTENT;
                            continue;
                        }

                    }

                   ...

----------------------------------------------------------------------

Comment By: Hel (helsom)
Date: 2011-11-02 15:33

Message:
Even with the mentioned code being re-instated, this only resolves wrapping
of inline content within a blockquote, for example, and not at the top
level within the body element.

For Example:

"<em>ssss</em> <blockquote><em>Inline content</em></blockquote>"
generates xhtml:
"<em>ssss</em> <blockquote> <p><em>Inline content</em></p> </blockquote>"

Note the initial <em> does not get wrapped with a p element. If I place
some text in front of it, however, it does get wrapped.

For Example:

"xxxx <em>ssss</em> <blockquote><em>Inline content</em></blockquote>"
generates xhtml:
"<p>xxxx <em>ssss</em></p> <blockquote> <p><em>Inline content</em></p>
</blockquote>"

----------------------------------------------------------------------

Comment By: Hel (helsom)
Date: 2011-11-02 14:08

Message:
Update: This is not an xhtml-specific problem. Incorrectly wrapped content
also fails HTML 4.01 Strict validation. It seems a shame to lose this
important functionality because of what seems to be quite an obsure bug
(1403105).

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=113153&aid=3432258&group_id=13153