Menu

HTML Parsing Problem (Bug maybe?)

Developer
kury
2007-10-04
2013-05-20
  • kury

    kury - 2007-10-04

    Using TinyXML if I load this file: (--- lines not included)
    ------------------------------------------------------------------------------------------------------------------------
    <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
    <HTML LANG=EN>
        <HEAD>
            <TITLE>some title</TITLE>
            <LINK type="text/css" rel="stylesheet" href="http://link2.css">
        </HEAD>
        <BODY bgcolor="#FFFFFF">
            <TABLE border="0" cellspacing="0" cellpadding="0" align="center" summary=" " >
                <TR align=center valign=top bgcolor="#FFFFFF">
                    <TD colspan=2>
                        <a href="http://link.com" TARGET="_top">
                            <IMG src="http://link.com/pic.jpg" border="0">
                        </a>
                    </TD>
                </TR>
            </table>
            <H1 align=center>heading</h1>
        </BODY>
    </HTML>
    ------------------------------------------------------------------------------------------------------------------------

    And then print it I get:
    ------------------------------------------------------------------------------------------------------------------------
    <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
    <HTML LANG="EN">
        <HEAD>
            <TITLE>some title</TITLE>
            <LINK type="text/css" rel="stylesheet" href="http://link2.css" />
        </HEAD>
    </HTML>
    ------------------------------------------------------------------------------------------------------------------------

    I do get this message when loading the file doc.ErrorDesc() = 'Error reading end tag.'   and it seems to be caused by the <LINK> tag. As soon as I remove it I get:
    ------------------------------------------------------------------------------------------------------------------------
    <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
    <HTML LANG="EN">
        <HEAD>
            <TITLE>some title</TITLE>
        </HEAD>
        <BODY bgcolor="#FFFFFF">
            <TABLE border="0" cellspacing="0" cellpadding="0" align="center" summary=" ">
                <TR align="center" valign="top" bgcolor="#FFFFFF">
                    <TD colspan="2">
                        <a href="http://link.com" TARGET="_top">
                            <IMG src="http://link.com/pic.jpg" border="0" />
                        </a>
                    </TD>
                </TR>
            </TABLE>
        </BODY>
    </HTML>
    ------------------------------------------------------------------------------------------------------------------------

    Can anyone explain whats going on here and/or how to fix it?

    Thanks

     
    • Nicola Civran

      Nicola Civran - 2007-10-04

      HTML is not XML: the LINK tag lacks a closing tag (i.e. </LINK>).

       
    • kury

      kury - 2007-10-04

      I thought html was a subset of XML, is this not true?

      Either way, is there a way to get this behavior out of TinyXML without modifying the HTML? (I don't have control over it)

      Thanks

       
      • Nicola Civran

        Nicola Civran - 2007-10-04

        1) This is a long tale, but in short the question is: in HTML some closing tags are OPTIONAL, in XML all closing tags are MANDATORY. So, your HTML is not a well-formed XML.

        2) No, to my knowledge.

         
    • kury

      kury - 2007-10-04

      Any recommendations for a HTML parser or a different XML parser with these capabilities?

      Thanks

       

Log in to post a comment.

MongoDB Logo MongoDB