From: Ed A. <ed...@me...> - 2003-02-18 20:36:48
|
On Mon, 17 Feb 2003, Axel Thimm wrote: >the tv_grab_de sources seem to have a bug in their generator >(http://www.szing.at/xmltv/). Instead of generating "&whatever;" for >html entries, it generates " und whatever;". It's always best to give an example when reporting things like this. Anyway, one example is <http://yasd.cc/xmltv/tv_20030222.xml.gz>, which when gunzipped contains the line: <title lang="de">und Auml;gypten und das obere Niltal</title> which should surely be Ägypten, the first letter being capital A-umlaut. >Note that "und" is the german word for "and", so probably some >varible called "und" contains the ampersand, but the variable isn't >referenced correctly. No, I think I know the explanation. The upstream listings source is using the & character and because this cannot be included raw in XML, it is converted to 'und', which means 'and'. It should instead be converted to the entity & but perhaps there were other quoting problems and just getting rid of the & altogether was the easiest thing to do. ISTR that I once did this for tv_grab_uk, doing 's/&/ and /g', because trying to write out a correct & entity with the XML libraries was too confusing. I may even have suggested this kludge to the chap who provides tv_grab_de's listings. >Unfortunatley there is no address to contact at >http://www.szing.at/xmltv/. It is Gottfried Szing a.k.a. 'Goofy' (who is mentioned in the tv_grab_de manual page), I have cc'd him on this message. >Should this be fixed in xmltv (s/ und ([a-zA-Z]*;)/\&$/g)? Maybe but I think it is better to see if G.S. can fix it at his end. >BTW is such html code (&xxx;) in the xmltv output well defined? Ah - good point, I had forgotten that. No, an entity like &Aauml; will not be allowed. Try nsgmls to confirm this: nsgmls:test.xml:6:12:E: general entity "Aauml" not defined and no default entity So that could be the reason to get rid of the & character globally. My guess is that the upstream data source from which the XML files are generated is in HTML, or uses HTML entities. The generator should use HTML::Entities if it is written in Perl, or the equivalent library in languages, to decode the entities into Latin-1 characters. Goofy, do you think you can implement this? -- Ed Avis <ed...@me...> |