Menu

HTML with in tags

Developer
2004-08-04
2004-08-06
  • steven smethurst

    Hello

    Im an parsing the output from a database
    one of the fields contain html data
    When i try to read the data in that node i only get up to the first tag.

    for example:
    <node>
       <path>index</path>
       <htmldata>hello world this is a test of a bug in tinyXML <BR/>This part will never get shown
       </htmldata>
    </node>

    when i print out htmldata node i get "hello world this is a test of a bug in tinyXML"

    instead of what i expected
    "hello world this is a test of a bug in tinyXML <BR/>This part will never get shown"

    I searched thou the forms but did not find anyone else with the same problem as me.

    is this a bug?
    is html tags with in side of nodes not supported by TinyXML ?
    am doing else something wrong ?

    Thank you for your time

     
    • Lee Thomason

      Lee Thomason - 2004-08-04

      At first I thought that wasn't valid XML...but I think it is. Odd looking, but syntactically correct.

      Please file a bug!
      Thanks, lee

       
    • Ellers

      Ellers - 2004-08-05

      Works fine for me. Running the program below this is the output I get (it has indenting but sourceforge doesn't display it):

      Loaded OK. Now, traverse...
      Document
      Element Named [node]
        Element Named [path]
         Text [index]
        Element Named [htmldata]
         Text [hello world this is a test of a bug in tinyXML]
         Element Named [BR]
         Text [This part will never get shown]
      Press any key to continue

      Within the element "htmldata" you have 3 nodes: a text node, a element node, and another text node.

      Are you sure you're traversing the loaded document correctly?

      #include "tinyxml.h"

      void traverse( TiXmlNode * pNode, int indent = 0 )
      {
          static const char* indentString = "                           ";
          static int indentStringLen = strlen( indentString );

          printf( "%s", &indentString[(indent<indentStringLen)?indentStringLen-indent:0] );

          switch ( pNode->Type( )) {
          case TiXmlNode::DOCUMENT:
              printf( "Document" );
              break;
          case TiXmlNode::ELEMENT:
              printf( "Element" );
              printf( " Named [%s]", pNode->ToElement( )->Value( ));
              break;
          case TiXmlNode::COMMENT:
              printf( "Comment" );
              break;
          case TiXmlNode::UNKNOWN:
              printf( "Unknown" );
              break;
          case TiXmlNode::TEXT:
              printf( "Text" );
              printf( " [" );
              printf( pNode->ToText( )->Value( ));
              printf( "]" );
              break;
          case TiXmlNode::DECLARATION:
              printf( "Declaration" );
              break;
          default:
              printf( "UNRECOGNISED TYPE" );
              break;
          }
          printf( "\n" );

          for ( TiXmlNode* pChild = pNode->FirstChild( ); 0 != pChild; pChild = pChild->NextSibling( )) {

              traverse( pChild, indent+1 );
          }
      }

      void test2( )
      {
          TiXmlDocument doc("html_prob.xml");
          bool bLoad=doc.LoadFile();

          if(! bLoad) {
              printf( "FAILED to load\n" );
              return;
          }
          printf( "Loaded OK. Now, traverse...\n" );

          traverse( &doc );
      }

      int main(int argc, char* argv[])
      {
          test2( );
          return 0;
      }

       
      • steven smethurst

        when I print out the html_data node i only get the start of the node up until the first xhtml tag <br/>

        this is proven by my test and by yours

        "Text [hello world this is a test of a bug in tinyXML]
        Element Named [BR]
        Text [This part will never get shown]"

        There is some Text
        Then a Element Named BR
        Then there is more Text

        What I expected to happen is that it would print all the data between the start and end tag of html_data

        "Text [hello world this is a test of a bug in tinyXML <BR/> this part will never get shown]"

        In other words this is what I was thinking
        === XML file ===
        <node>
        <record>
            <si_>1058</si_>
            <notes>Something</notes>
        </record>
        </node>
        === ===

        Getting the value of the node record would give me
        Text [<si_>1058</si_>\n<notes>Something</notes>]

        But that is wrong, and I was mistaken in my understanding of TinyXML

        I guess my question should have been
        How do I get the text and sub tags beneath a node as text?

        For example:
        How could I get the this as an output if I had a cursor on the record node?
        <si_>1058</si_>\n<notes>Something</notes>

        Thank you for your time.

         
    • Ellers

      Ellers - 2004-08-05

      While I'm thinking of it, "embedding" html within xml can be troublesome. Browsers (and people that write html) tend to write html in its original malform-able way. If the html is strict xhtml then you'll be ok, but...

      it may be easier/safer to put the whole html into the xml as one text string, with "&lt" instead of the opening tag char etc. The end-user should never know the difference and you won't have to worry about the html corrupting your xml stream.

       
      • steven smethurst

        i agree is would be a lot safer if i COULD change all the preexisting tags to "&lt"
        but thats not the case.

        i get this output from a database in XML form
        i do not have control over how it is formed.

        I do know that the output will be xhtml strict thou, so i do not have to worry about malformed tags.

         
        • Ellers

          Ellers - 2004-08-06

          If you're stuck with the embedding approach then your best (only?) option is the way I described. A traversal of the revelant parts of the three, converting the DOM back to an xml char string/stream. If you're expecting huge amounts of data you could look into using a SAX parser. You'll still need to be careful with your coding, but you'll basically do

          - read nodes until start of html block
          - during html just accumulate the text
          - at the end of the html section go back to parsing as normal.

          Personally I prefer DOM, not SAX, but its partly personal preference, and partly problem domain...

           
    • Ellers

      Ellers - 2004-08-06

      TinyXml is a DOM processing XML parser. It is doing exactly what it should as a DOM processor - building a tree of nodes that exactly describe the XML document you provided.

      The 'record' element contains:
      - an element named 'si_' which itself contains one text node
      - an element called 'nodes' which also contains one text node.

      What I think you are asking is how to write code to:
      - "beginning at node 'record', traverse the tree and CONVERT all contained nodes into a std::string (or stream) re-writing the DOM into character XML data"

      This can be done by modifying my traverse() function from above.

      I haven't compiled this, but it will be something like:

      void traverseAppend( TiXmlNode * pNode, std::string & s )
      {
          switch ( pNode->Type( )) {

          case TiXmlNode::DOCUMENT:
              // NOP
              break;
          case TiXmlNode::ELEMENT:
              s.append( "<" );
              s.append( pNode->ToElement( )->Value( ));
              s.append( ">" );
              break;
          case TiXmlNode::COMMENT:
              s.append( "<!--" );
              s.append( pNode->ToComment()->Value());
              s.append( "-->" );
              break;
          case TiXmlNode::UNKNOWN:
              //printf( "Unknown" );
              break;
          case TiXmlNode::TEXT:
              s.append( pNode->ToText( )->Value( ));
              break;
          case TiXmlNode::DECLARATION:
              //printf( "Declaration" );
              break;
          default:
              //printf( "UNRECOGNISED TYPE" );
              break;
          }

          for ( TiXmlNode* pChild = pNode->FirstChild( ); 0 != pChild; pChild = pChild->NextSibling( ))
          {
              traverseAppend( pChild, s );
          }
      }

      std::string astring;
      traverseAppend( recordNodeHandle, astring )
      // astring is now "<si_>1058</si_><notes>Something</nodes>"

      You should understand that this is not a TinyXml thing, it is an *XML* thing, and it will be the same problem with any DOM processing library. With a SAX parser it will be slightly different, but essentially the same problem. True, MSXML or something like that might provide a convenience method like

      astring = recordNode->asXml()

      which does the traversal exactly as I've shown, writing the node into a string.

      Personally, I don't like embedding one XML grammar in another because of these problems which keep coming up. I read some articles about embedding xhtml in xml some time ago for work and the general consensus seemed to be either:

      - put the xhtml in as a single text code, e.g. escaping special chars like &lt;html&gt; etc
      or
      - MIME/Base64 encode it (simple and effective)

      HTH

       

Log in to post a comment.

MongoDB Logo MongoDB