Menu

Parser doesn't find CDATA tag

Larry K
2012-10-30
2013-01-03
  • Larry K

    Larry K - 2012-10-30

    This isn't a big deal for me but I noticed that the Jericho parser does not seem to find CDATA tags.  I used the example text in the documentation for StartTag.  Below is a short junit test that demonstrates the problem:

    package com.msh.g2.annotator;

    import net.htmlparser.jericho.Element;
    import net.htmlparser.jericho.EndTag;
    import net.htmlparser.jericho.Source;
    import net.htmlparser.jericho.StartTag;
    import net.htmlparser.jericho.Tag;

    import org.apache.log4j.Logger;
    import org.junit.Before;
    import org.junit.Test;

    public class JerichoTests {
    private static final Logger logger = Logger.getLogger(JerichoTests.class);
    private static final String CDATA_TEXT = "<script type=\"text/javascript\">" +
    "//<![CDATA]</script>";

    @Before
    public void setUp() throws Exception {
    }

    @Test
    public void cdataTest() {
    Source htmlSource = new Source(CDATA_TEXT);
    for (Element element : htmlSource.getAllElements()) {
    StartTag startTag = element.getStartTag();
    EndTag endTag = element.getEndTag();
    logTag(startTag);
    logTag(endTag);
    }
    }

    private void logTag(Tag tag) {
    if (tag !=null) {
    logger.info("****");
    logger.info("Tag name:"+tag.getName());
    logger.info("Tag type:"+tag.getTagType());
    logger.info("Tag begin:"+tag.getBegin());
    logger.info("Tag end:"+tag.getEnd());
    }
    }

    }

     
  • Martin Jericho

    Martin Jericho - 2012-10-30

    Thanks for your feedback Larry.

    This is actually quite a complex topic. The behaviour of the parser is correct for HTML4 and HTML5 documents, but incorrect for XHTML documents. The documentation of the TagType.isValidPosition method contains the details, but I'll include the relevant bits here:

    The HTML 4 DTD defines script element content as a special type of CDATA. The XHTML DTD changed it to PCDATA, meaning that HTML elements should be parsed inside script elements if they are not escaped by comments or an explicit CDATA section. The HTML 5 parsing rules reversed this again, making it closer to the original HTML 4 rules. Because this parser is designed to facilitate parsing HTML rather than XHTML, it treats script element content as implicit CDATA, consistent with HTML 4 and HTML 5.

    According to the HTML 4.01 specification section 6.2, the first occurrence of the character sequence "</" terminates the special handling of CDATA within SCRIPT and STYLE elements. This library however only terminates the CDATA handling of SCRIPT element content when the character sequence "</script" is detected, in line with the behaviour of the major browsers and with HTML 5 script element parsing rules.

    Note that the implicit treatment of SCRIPT element content as CDATA also prevents the recognition of comments and explicit CDATA sections inside script elements. All major browsers used to recognise comments inside script elements regardless, which is relevant if the script element contains a javascript string literal "<script", which would terminate the script element unless it was enclosed in a comment. Versions 3.0 to 3.2 of this parser therefore also recognised comments inside script elements in a full sequential parse to maintain compatibility with the major browsers, but the latest versions of gecko and webkit browsers now correctly ignore comments inside script elements, so as of version 3.3 this parser has also reverted to the correct behaviour.

    Although STYLE elements should theoretically be treated in the same way as SCRIPT elements, the syntax of Cascading Style Sheets (CSS) does not contain any constructs that could be misinterpreted as HTML tags, so there is virtually no need to perform any special checks in this case.

    I'll be releasing v3.3 shortly. 
    Regards, 
    Martin

     

Log in to post a comment.