Work at SourceForge, help us to make it a better place! We have an immediate need for a Support Technician in our San Francisco or Denver office.

Close

#256 Unwrapped inline content means invalid XHTML is generated

open
nobody
5
2012-10-08
2011-11-02
Hel
No

Using jtidy.parseDOM with setXHTML(true) and setEncloseBlockText(true) does not cause inline content to be properly wrapped and hence W3c validation fails.

Example HTML 1 (generates valid XHTML)
"Text Inline content" -> "

Text Inline content

"

Example HTML 2 (generates invalid XHTML)
"Inline content" -> "Inline content"

There is code within src/main/java/org/w3c/tidy/ParserImpl.java that performs this wrapping but it has been commented out due to bug report 1403105 : java.lang.StackOverflowError in Tidy.parseDOM(). Uncommenting this block of code seems to produce correctly wrapped XHTML in most situations, but unfortunately the stack over flow error still happens if the HTML mentioned in report 1403105 is supplied. Anyway that this can be reinstated without causing the stack over flow?

Discussion

  • Hel
    Hel
    2011-11-02

    Update: This is not an xhtml-specific problem. Incorrectly wrapped content also fails HTML 4.01 Strict validation. It seems a shame to lose this important functionality because of what seems to be quite an obsure bug (1403105).

     
  • Hel
    Hel
    2011-11-02

    Even with the mentioned code being re-instated, this only resolves wrapping of inline content within a blockquote, for example, and not at the top level within the body element.

    For Example:

    "ssss

    Inline content
    "
    generates xhtml:
    "ssss

    Inline content

    "

    Note the initial does not get wrapped with a p element. If I place some text in front of it, however, it does get wrapped.

    For Example:

    "xxxx

    ssss
    Inline content
    "
    generates xhtml:
    "

    xxxx ssss

    Inline content

    "

     
  • Hel
    Hel
    2011-11-02

    Adding code very similar to the TEXT_NODE encloseBodyText processing (about line 799 within ParserImpl.java) for an inline element (at about line 934) seems to result in inline content within the body being properly wrapped, though it hasn't had extensive testing and there may be a better way.

    That is,

                if (node.type == Node.START_TAG || node.type == Node.START_END_TAG)
                {
                    if ( (node.tag.model & Dict.CM_INLINE) != 0 ) {
    
                        if (lexer.configuration.encloseBodyText)
                        {
                            Node para;
    
                            lexer.ungetToken();
                            para = lexer.inferredTag("p");
                            body.insertNodeAtEnd(para);
                            parseTag(lexer, para, mode);
                            mode = Lexer.MIXED_CONTENT;
                            continue;
                        }
    
                    }
    
                   ...