From: Carsten N. <car...@gm...> - 2010-04-13 20:27:13
Attachments:
cdata_preserve_newline.patch
|
Hello, attached patch modifies the tokenizer to preserve newlines in cdata. The way I understand the xml spec that is what is intended there. The test still pass, but cdata may now contain some trailing whitespace - not sure if that is a problem though [1]. FWIW my use case was that we had some shader code stored in an xml file that I wanted to read with cppdom, but since the newlines got removed from things like: <fragment_program> #ifdef HAS_NORMAL_MAP // read normal from normal map texture #endif </fragment_program> the preprocessor conditionals stopped working correctly. An alternative would be to support the "<![CDATA[" "]]>" construct, but that seemed to require a better understanding of the interaction between tokenizer and parser. Cheers, Carsten [1] The whitespace comes from the indention of </some_tag>: <root> <some_tag> cdata here </some_tag> </root> i.e. the cdata string is: "cdata_here\n ". |
From: Patrick H. <pat...@gm...> - 2010-04-14 13:37:59
Attachments:
signature.asc
|
Carsten Neumann wrote: > Hello, > > attached patch modifies the tokenizer to preserve newlines in cdata. The > way I understand the xml spec that is what is intended there. The test > still pass, but cdata may now contain some trailing whitespace - not > sure if that is a problem though [1]. > FWIW my use case was that we had some shader code stored in an xml file > that I wanted to read with cppdom, but since the newlines got removed > from things like: > > <fragment_program> > #ifdef HAS_NORMAL_MAP > // read normal from normal map texture > #endif > </fragment_program> > > the preprocessor conditionals stopped working correctly. An alternative > would be to support the "<![CDATA[" "]]>" construct, but that seemed to > require a better understanding of the interaction between tokenizer and > parser. > > Cheers, > Carsten > > [1] The whitespace comes from the indention of </some_tag>: > > <root> > <some_tag> > cdata here > </some_tag> > </root> > > i.e. the cdata string is: "cdata_here\n ". I committed your change, but this brings up a point in CppDOM that confuses me. I think that the CppDOM "cdata" concept is really the DOM text node. I am not sure if XML CDATA is even supported by CppDOM, but I suppose that I could write a test to determine that. The difference is between the following: <fragment_program> #ifdef HAS_NORMAL_MAP // read normal from normal map texture #endif </fragment_program> versus: <fragment_program> <![CDATA[#ifdef HAS_NORMAL_MAP // read normal from normal map texture #endif ]]> </fragment_program> Perhaps there does not need to be a distinction as far as user-level code is concerned. What makes me wonder is that actual DOM implementations do distinguish between the two in their APIs. JDOM, however, has Element.getText(), which returns what they describe as "the concatenation of all Text and CDATA nodes returned by getContent()." Note that JDOM has the class type org.jdom.CDATA which is a subclass of org.jdom.Text. CppDOM identifies nodes as being of the "cdata" type, but I think that, in CppDOM terms, this means that the node contains a sequence of characters. I guess as long as CppDOM can properly parse both a CDATA node and a text node in the input XML, my concerns don't matter. It may really just boil down to a terminology issue. -Patrick -- Patrick L. Hartling Senior Software Engineer, Priority 5 http://www.priority5.com/ |
From: Carsten N. <car...@gm...> - 2010-04-15 21:58:40
|
Hello Patrick, Patrick Hartling wrote: > I committed your change, Thanks! > but this brings up a point in CppDOM that confuses > me. I think that the CppDOM "cdata" concept is really the DOM text node. I > am not sure if XML CDATA is even supported by CppDOM, but I suppose that I > could write a test to determine that. hm, I had looked at <http://www.w3schools.com/xml/xml_cdata.asp> and from there it seems there is a distinction between parsed character data (PCDATA) which is what you get when using: <some_tag>this is parsed character data</some_tag> and unparsed character data (CDATA) which looks like: <some_tag><![CDATA[unparsed character data here]]></some_tag> The difference seems to be that PCDATA is seen by the xml parser and it looks for more tags in it, so that neesting tags works, while CDATA seems to turn the parser mostly off until the "]]>" sequence is seen. > Perhaps there does not need to be a distinction as far as user-level code is > concerned. What makes me wonder is that actual DOM implementations do > distinguish between the two in their APIs. JDOM, however, has > Element.getText(), which returns what they describe as "the concatenation of > all Text and CDATA nodes returned by getContent()." Note that JDOM has the > class type org.jdom.CDATA which is a subclass of org.jdom.Text. CppDOM > identifies nodes as being of the "cdata" type, but I think that, in CppDOM > terms, this means that the node contains a sequence of characters. hm, I wonder if the distinction is only made so that reading in a file and writing it out again becomes an identity operation - i.e. so that the writer can put in the "<![CDATA[", "]]>" sequences? > I guess as long as CppDOM can properly parse both a CDATA node and a text > node in the input XML, my concerns don't matter. It may really just boil > down to a terminology issue. I think that is the case here. Cheers, Carsten |