Thread: Entity parsing
Brought to you by:
bs_php,
nigelswinson
From: Peter R. <php...@pe...> - 2002-05-23 09:09:49
|
doesn't seem to work any more. Looking through the class, it looks like the problem is to do with the translateAmpersand function, though I tried disabling this function and that didn't fix the problem, so there must be more to it. The documentation for this seems to be saying that the parser should be prevented from, er, parsing, which seems odd to me - if you don't want entities parsed, why use them? - but if this is necessary, can it be made switchable please: yes I want entities parsed, or no I don't. I don't use entities extensively but they are a modestly useful function. |
From: Sam B. <bs...@us...> - 2002-05-23 09:53:38
|
Hi Peter , Always's happy to have new ppl on board giving us a hand! ;o) You seam to have some trouble with the Entity handling in Php.XPath. I'm not quite sure if I understand the problem. But pleas read following extract taken from Php.XPath's _translateAmpersand () function. I think it's a good base to undersand the problematic. /** * Translate all ampersands to it's literal entities '&' and back. * * I wasn't aware of this problem at first but it's important to understand why we do this. * At first you must know: * a) PHP's XML parser *translates* all entities to the equivalent char E.g. < is returned as '<' * b) PHP's XML parser (in V 4.1.0) has problems with most *literal* entities! The only one's that are * recognized are &, < > and ". *ALL* others (like © a.s.o.) cause an * XML_ERROR_UNDEFINED_ENTITY error. I reported this as bug at http://bugs.php.net/bug.php?id=15092 * (It turned out not to be a 'real' bug, but one of those nice W3C-spec things). * * Forget position b) now. It's just for info. Because the way we will solve a) will also solve b) too. * * THE PROBLEM * To understand the problem, here a sample: * Given is the following XML: "<AAA> < > </AAA>" * Try to parse it and PHP's XML parser will fail with a XML_ERROR_UNDEFINED_ENTITY becaus of * the unknown litteral-entity ' '. (The numeric equivalent ' ' would work though). * Next try is to use the numeric equivalent 160 for ' ', thus "<AAA> <   > </AAA>" * The data we receive in the tag <AAA> is " < > ". So we get the *translated entities* and * NOT the 3 entities <   >. Thus, we will not even notice that there were entities at all! * In *most* cases we're not able to tell if the data was given as entity or as 'normal' char. * E.g. When receiving a quote or a single space were not able to tell if it was given as 'normal' char * or as or ". Thus we loose the entity-information of the XML-data! * * THE SOLUTION * The better solution is to keep the data 'as is' by replacing the '&' before parsing begins. * E.g. Taking the original input from above, this would result in "<AAA> &lt; &nbsp; &gt; </AAA>" * The data we receive now for the tag <AAA> is " < > ". and that's what we want. * * The bad thing is, that a global replace will also replace data in section that are NOT translated by the * PHP XML-parser. That is comments (<!-- -->), IP-sections (stuff between <? ? >) and CDATA-block too. * So all data comming from those sections must be reversed. This is done during the XML parse phase. * So: * a) Replacement of all '&' in the XML-source. * b) All data that is not char-data or in CDATA-block have to be reversed during the XML-parse phase. * */ -- Sam Blum <bs...@us...> =========================== For the most recent version of PHP.XPath and an archive of this list visit: http://sourceforge.net/projects/phpxpath |
From: Peter R. <php...@pe...> - 2002-05-24 08:26:54
|
yes, I read that, but don't see why all this is necessary or desirable. If you don't define an entity, how's the parser to know what to do with it? And if you don't want the parser to parse an entity, why use an entity at all? It's a pity that expat, as a non-validating parser, ignores external DTDs/PEs but, even so, entities have their uses. To take a simple example: if you define the following at the front of your xml file <!ENTITY ourname "Perfect Programmers"> then you can reference &ourname; in the rest of the file (such as "We are proud of &ourname;'s unrivaled modesty.") and, if the name changes at some point, you only need to change your entity definition and the parser will automatically reflect the change in all the references. expat parses this without any problem; phpxpath used to, but if looks like it no longer does. This is a simple example, but you can do clever things with entities if you preprocess the xml file to create them dynamically - a sort of templating system. (btw, there are actually 5 xml entities: apos is missing from the list :-) On Thursday 23 May 2002 10:54, Sam Blum wrote: > > Always's happy to have new ppl on board giving us a hand! ;o) > You seam to have some trouble with the Entity handling in Php.XPath. > I'm not quite sure if I understand the problem. But pleas read following > extract taken from Php.XPath's _translateAmpersand () function. > I think it's a good base to undersand the problematic. > > /** > * Translate all ampersands to it's literal entities '&' and back. > * > * I wasn't aware of this problem at first but it's important to > understand why we do this. * At first you must know: > * a) PHP's XML parser *translates* all entities to the equivalent char > E.g. < is returned as '<' * b) PHP's XML parser (in V 4.1.0) has > problems with most *literal* entities! The only one's that are * > recognized are &, < > and ". *ALL* others (like > © a.s.o.) cause an * XML_ERROR_UNDEFINED_ENTITY error. I reported > this as bug at http://bugs.php.net/bug.php?id=15092 * (It turned out not > to be a 'real' bug, but one of those nice W3C-spec things). * > * Forget position b) now. It's just for info. Because the way we will > solve a) will also solve b) too. * > * THE PROBLEM > * To understand the problem, here a sample: > * Given is the following XML: "<AAA> < > </AAA>" > * Try to parse it and PHP's XML parser will fail with a > XML_ERROR_UNDEFINED_ENTITY becaus of * the unknown litteral-entity > ' '. (The numeric equivalent ' ' would work though). * Next try > is to use the numeric equivalent 160 for ' ', thus "<AAA> <   > > </AAA>" * The data we receive in the tag <AAA> is " < > ". So we > get the *translated entities* and * NOT the 3 entities <   >. > Thus, we will not even notice that there were entities at all! * In > *most* cases we're not able to tell if the data was given as entity or as > 'normal' char. * E.g. When receiving a quote or a single space were not > able to tell if it was given as 'normal' char * or as or ". > Thus we loose the entity-information of the XML-data! * > * THE SOLUTION > * The better solution is to keep the data 'as is' by replacing the '&' > before parsing begins. * E.g. Taking the original input from above, this > would result in "<AAA> &lt; &nbsp; &gt; </AAA>" * The data we > receive now for the tag <AAA> is " < > ". and that's what we > want. * > * The bad thing is, that a global replace will also replace data in > section that are NOT translated by the * PHP XML-parser. That is comments > (<!-- -->), IP-sections (stuff between <? ? >) and CDATA-block too. * So > all data comming from those sections must be reversed. This is done during > the XML parse phase. * So: > * a) Replacement of all '&' in the XML-source. > * b) All data that is not char-data or in CDATA-block have to be > reversed during the XML-parse phase. * > */ |
From: Sam B. <bs...@us...> - 2002-05-24 09:43:41
|
Hi Peter, > To take a simple example: if you define the following at the front of your > xml file <!ENTITY ourname "Perfect Programmers"> then you can > reference &ourname; in the rest of the file Yes, I know about the that feature. But it has 2 sides. Assuming we would let expat parse the entity in the sample below <!ENTITY ourname "Perfect Programmers>"> <&ourname; and would generate a file with one of the export-functions of Php.XPath the result would be: <!ENTITY ourname "Perfect Programmers>"> <Perfect Programmers> All defined entities (plus the 5 xml entities) have been replaces and any changes to <!ENTITY ourname "Perfect Programmers"> in the exported file would have *no effect* any more! Even worse the < and > have been replaced too! After the XML has been parsed (and maybe modified by one of the included DOM functions) you wouldn't want this to happen when exporting to an XML-file. Note that the expat parser replaces the 5 xml entities plus all the entities that are defined by <!ENTITY ...> 'silently'. I see no simple solution to this problem and therefore I believe the way we handle it, be keeping the entities 'as is', is the best we can do. Regards -- Sam Blum <bs...@us...> =========================== For the most recent version of PHP.XPath and an archive of this list visit: http://sourceforge.net/projects/phpxpath |
From: Peter R. <php...@pe...> - 2002-05-24 10:31:09
|
On Friday 24 May 2002 10:44, Sam Blum wrote: > Assuming we would let expat parse the entity in the sample below > <!ENTITY ourname "Perfect Programmers>"> > <&ourname; > > and would generate a file with one of the export-functions of Php.XPath the > result would be: > <!ENTITY ourname "Perfect Programmers>"> > <Perfect Programmers> > > All defined entities (plus the 5 xml entities) have been replaces and any > changes to <!ENTITY ourname "Perfect Programmers"> in the exported file > would have *no effect* any more! Even worse the < and > have been > replaced too! ok, so this means if you are importing/parsing in order to change/rewrite the file, then you don't want entities parsed (treat & as &); if for outputing to some other medium such as html, then you do. So . . . can we have it switchable please, passing over a y/n parameter to the import functions. |
From: Peter R. <php...@pe...> - 2002-05-24 12:48:33
|
just been reading up a bit further on expat, and it seems if you do not set a character data handler but let it be handled by the default handler, this has the effect of not parsing/expanding internal entities. Just tested this in php briefly, and it's true. Might this be the way forward (of course it's perfectly possible it might screw something else up!)? Only set a character data handler if you want the entities parsed? It also seems expat can after all handle external DTDs and PEs, but you have to define this when you compile it. I see the latest versions of PHP now have handlers for namespaces, so perhaps they will get around to PEs eventually. On Friday 24 May 2002 11:30, Peter Robins wrote: > On Friday 24 May 2002 10:44, Sam Blum wrote: > > Assuming we would let expat parse the entity in the sample below > > <!ENTITY ourname "Perfect Programmers>"> > > <&ourname; > > > > and would generate a file with one of the export-functions of Php.XPath > > the result would be: > > <!ENTITY ourname "Perfect Programmers>"> > > <Perfect Programmers> > > > > All defined entities (plus the 5 xml entities) have been replaces and any > > changes to <!ENTITY ourname "Perfect Programmers"> in the exported file > > would have *no effect* any more! Even worse the < and > have been > > replaced too! > > ok, so this means if you are importing/parsing in order to change/rewrite > the file, then you don't want entities parsed (treat & as &); if for > outputing to some other medium such as html, then you do. So . . . can we > have it switchable please, passing over a y/n parameter to the import > functions. > > _______________________________________________________________ > > Don't miss the 2002 Sprint PCS Application Developer's Conference > August 25-28 in Las Vegas -- http://devcon.sprintpcs.com/adp/index.cfm > > _______________________________________________ > Phpxpath-users mailing list > Php...@li... > https://lists.sourceforge.net/lists/listinfo/phpxpath-users |