Re: Entity parsing
Brought to you by:
bs_php,
nigelswinson
From: Sam B. <bs...@us...> - 2002-05-23 09:53:38
|
Hi Peter , Always's happy to have new ppl on board giving us a hand! ;o) You seam to have some trouble with the Entity handling in Php.XPath. I'm not quite sure if I understand the problem. But pleas read following extract taken from Php.XPath's _translateAmpersand () function. I think it's a good base to undersand the problematic. /** * Translate all ampersands to it's literal entities '&' and back. * * I wasn't aware of this problem at first but it's important to understand why we do this. * At first you must know: * a) PHP's XML parser *translates* all entities to the equivalent char E.g. < is returned as '<' * b) PHP's XML parser (in V 4.1.0) has problems with most *literal* entities! The only one's that are * recognized are &, < > and ". *ALL* others (like © a.s.o.) cause an * XML_ERROR_UNDEFINED_ENTITY error. I reported this as bug at http://bugs.php.net/bug.php?id=15092 * (It turned out not to be a 'real' bug, but one of those nice W3C-spec things). * * Forget position b) now. It's just for info. Because the way we will solve a) will also solve b) too. * * THE PROBLEM * To understand the problem, here a sample: * Given is the following XML: "<AAA> < > </AAA>" * Try to parse it and PHP's XML parser will fail with a XML_ERROR_UNDEFINED_ENTITY becaus of * the unknown litteral-entity ' '. (The numeric equivalent ' ' would work though). * Next try is to use the numeric equivalent 160 for ' ', thus "<AAA> <   > </AAA>" * The data we receive in the tag <AAA> is " < > ". So we get the *translated entities* and * NOT the 3 entities <   >. Thus, we will not even notice that there were entities at all! * In *most* cases we're not able to tell if the data was given as entity or as 'normal' char. * E.g. When receiving a quote or a single space were not able to tell if it was given as 'normal' char * or as or ". Thus we loose the entity-information of the XML-data! * * THE SOLUTION * The better solution is to keep the data 'as is' by replacing the '&' before parsing begins. * E.g. Taking the original input from above, this would result in "<AAA> &lt; &nbsp; &gt; </AAA>" * The data we receive now for the tag <AAA> is " < > ". and that's what we want. * * The bad thing is, that a global replace will also replace data in section that are NOT translated by the * PHP XML-parser. That is comments (<!-- -->), IP-sections (stuff between <? ? >) and CDATA-block too. * So all data comming from those sections must be reversed. This is done during the XML parse phase. * So: * a) Replacement of all '&' in the XML-source. * b) All data that is not char-data or in CDATA-block have to be reversed during the XML-parse phase. * */ -- Sam Blum <bs...@us...> =========================== For the most recent version of PHP.XPath and an archive of this list visit: http://sourceforge.net/projects/phpxpath |