TinyXML / Discussion / Developer: HTML with in tags

steven smethurst - 2004-08-04

Hello

Im an parsing the output from a database
one of the fields contain html data
When i try to read the data in that node i only get up to the first tag.

for example:
<node>
   <path>index</path>
   <htmldata>hello world this is a test of a bug in tinyXML <BR/>This part will never get shown
   </htmldata>
</node>

when i print out htmldata node i get "hello world this is a test of a bug in tinyXML"

instead of what i expected
"hello world this is a test of a bug in tinyXML <BR/>This part will never get shown"

I searched thou the forms but did not find anyone else with the same problem as me.

is this a bug?
is html tags with in side of nodes not supported by TinyXML ?
am doing else something wrong ?

Thank you for your time

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Lee Thomason - 2004-08-04
  
  At first I thought that wasn't valid XML...but I think it is. Odd looking, but syntactically correct.
  
  Please file a bug!
  Thanks, lee
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - steven smethurst - 2004-08-05
    
    Done and i included an example of the problem
    
    http://sourceforge.net/tracker/index.php?func=detail&aid=1003662&group_id=13559&atid=113559
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Ellers - 2004-08-05
  
  Works fine for me. Running the program below this is the output I get (it has indenting but sourceforge doesn't display it):
  
  Loaded OK. Now, traverse...
  Document
  Element Named [node]
  Element Named [path]
     Text [index]
  Element Named [htmldata]
     Text [hello world this is a test of a bug in tinyXML]
     Element Named [BR]
     Text [This part will never get shown]
  Press any key to continue
  
  Within the element "htmldata" you have 3 nodes: a text node, a element node, and another text node.
  
  Are you sure you're traversing the loaded document correctly?
  
  #include "tinyxml.h"
  
  void traverse( TiXmlNode * pNode, int indent = 0 )
  {
      static const char* indentString = "                           ";
      static int indentStringLen = strlen( indentString );
  
      printf( "%s", &indentString[(indent<indentStringLen)?indentStringLen-indent:0] );
  
      switch ( pNode->Type( )) {
      case TiXmlNode::DOCUMENT:
          printf( "Document" );
          break;
      case TiXmlNode::ELEMENT:
          printf( "Element" );
          printf( " Named [%s]", pNode->ToElement( )->Value( ));
          break;
      case TiXmlNode::COMMENT:
          printf( "Comment" );
          break;
      case TiXmlNode::UNKNOWN:
          printf( "Unknown" );
          break;
      case TiXmlNode::TEXT:
          printf( "Text" );
          printf( " [" );
          printf( pNode->ToText( )->Value( ));
          printf( "]" );
          break;
      case TiXmlNode::DECLARATION:
          printf( "Declaration" );
          break;
      default:
          printf( "UNRECOGNISED TYPE" );
          break;
      }
      printf( "\n" );
  
      for ( TiXmlNode* pChild = pNode->FirstChild( ); 0 != pChild; pChild = pChild->NextSibling( )) {
  
          traverse( pChild, indent+1 );
      }
  }
  
  void test2( )
  {
      TiXmlDocument doc("html_prob.xml");
      bool bLoad=doc.LoadFile();
  
      if(! bLoad) {
          printf( "FAILED to load\n" );
          return;
      }
      printf( "Loaded OK. Now, traverse...\n" );
  
      traverse( &doc );
  }
  
  int main(int argc, char* argv[])
  {
      test2( );
      return 0;
  }
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - steven smethurst - 2004-08-05
    
    when I print out the html_data node i only get the start of the node up until the first xhtml tag <br/>
    
    this is proven by my test and by yours
    
    "Text [hello world this is a test of a bug in tinyXML]
    Element Named [BR]
    Text [This part will never get shown]"
    
    There is some Text
    Then a Element Named BR
    Then there is more Text
    
    What I expected to happen is that it would print all the data between the start and end tag of html_data
    
    "Text [hello world this is a test of a bug in tinyXML <BR/> this part will never get shown]"
    
    In other words this is what I was thinking
    === XML file ===
    <node>
    <record>
    <si_>1058</si_>
    <notes>Something</notes>
    </record>
    </node>
    === ===
    
    Getting the value of the node record would give me
    Text [<si_>1058</si_>\n<notes>Something</notes>]
    
    But that is wrong, and I was mistaken in my understanding of TinyXML
    
    I guess my question should have been
    How do I get the text and sub tags beneath a node as text?
    
    For example:
    How could I get the this as an output if I had a cursor on the record node?
    <si_>1058</si_>\n<notes>Something</notes>
    
    Thank you for your time.
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Ellers - 2004-08-05
  
  While I'm thinking of it, "embedding" html within xml can be troublesome. Browsers (and people that write html) tend to write html in its original malform-able way. If the html is strict xhtml then you'll be ok, but...
  
  it may be easier/safer to put the whole html into the xml as one text string, with "&lt" instead of the opening tag char etc. The end-user should never know the difference and you won't have to worry about the html corrupting your xml stream.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - steven smethurst - 2004-08-05
    
    i agree is would be a lot safer if i COULD change all the preexisting tags to "&lt"
    but thats not the case.
    
    i get this output from a database in XML form
    i do not have control over how it is formed.
    
    I do know that the output will be xhtml strict thou, so i do not have to worry about malformed tags.
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
    - Ellers - 2004-08-06
      
      If you're stuck with the embedding approach then your best (only?) option is the way I described. A traversal of the revelant parts of the three, converting the DOM back to an xml char string/stream. If you're expecting huge amounts of data you could look into using a SAX parser. You'll still need to be careful with your coding, but you'll basically do
      
      - read nodes until start of html block
      - during html just accumulate the text
      - at the end of the html section go back to parsing as normal.
      
      Personally I prefer DOM, not SAX, but its partly personal preference, and partly problem domain...
      
      If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Ellers - 2004-08-06
  
  TinyXml is a DOM processing XML parser. It is doing exactly what it should as a DOM processor - building a tree of nodes that exactly describe the XML document you provided.
  
  The 'record' element contains:
  - an element named 'si_' which itself contains one text node
  - an element called 'nodes' which also contains one text node.
  
  What I think you are asking is how to write code to:
  - "beginning at node 'record', traverse the tree and CONVERT all contained nodes into a std::string (or stream) re-writing the DOM into character XML data"
  
  This can be done by modifying my traverse() function from above.
  
  I haven't compiled this, but it will be something like:
  
  void traverseAppend( TiXmlNode * pNode, std::string & s )
  {
      switch ( pNode->Type( )) {
  
      case TiXmlNode::DOCUMENT:
          // NOP
          break;
      case TiXmlNode::ELEMENT:
          s.append( "<" );
          s.append( pNode->ToElement( )->Value( ));
          s.append( ">" );
          break;
      case TiXmlNode::COMMENT:
          s.append( "" );
          break;
      case TiXmlNode::UNKNOWN:
          //printf( "Unknown" );
          break;
      case TiXmlNode::TEXT:
          s.append( pNode->ToText( )->Value( ));
          break;
      case TiXmlNode::DECLARATION:
          //printf( "Declaration" );
          break;
      default:
          //printf( "UNRECOGNISED TYPE" );
          break;
      }
  
      for ( TiXmlNode* pChild = pNode->FirstChild( ); 0 != pChild; pChild = pChild->NextSibling( ))
      {
          traverseAppend( pChild, s );
      }
  }
  
  std::string astring;
  traverseAppend( recordNodeHandle, astring )
  // astring is now "<si_>1058</si_><notes>Something</nodes>"
  
  You should understand that this is not a TinyXml thing, it is an *XML* thing, and it will be the same problem with any DOM processing library. With a SAX parser it will be slightly different, but essentially the same problem. True, MSXML or something like that might provide a convenience method like
  
  astring = recordNode->asXml()
  
  which does the traversal exactly as I've shown, writing the node into a string.
  
  Personally, I don't like embedding one XML grammar in another because of these problems which keep coming up. I read some articles about embedding xhtml in xml some time ago for work and the general consensus seemed to be either:
  
  - put the xhtml in as a single text code, e.g. escaping special chars like <html> etc
  or
  - MIME/Base64 encode it (simple and effective)
  
  HTH
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

HTML with in tags

Forums

Help

HTML with in tags

HTML with in tags

Forums

Help

HTML with in tags document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

HTML with in tags