#316 HTML codes changed by tv_cat tv_grep tv_sort

tv_cat (2)

I have a xmltv.xml with html codes like &#nnn or &acute and the tv_cat and other tools change this code by &nnn; or ´
Whem the "&" stay alone they dont' change.

Sorry for my bad english.


  • Nick Morrott
    Nick Morrott

    • labels: --> tv_cat
  • Karl Dietz
    Karl Dietz

    Can you post an example?
    The xmltv format doesn't support the usual HTML entities and I don't know a good reason to change that.

  • Karl Dietz
    Karl Dietz

    same root cause as #1101376
    the bugs seem to be rooted in changing behaviour of XML::Twig which still
    has quite a bunch of related open bugs over at CPAN (no updates for >3
    years now)
    see: https://rt.cpan.org/Public/Dist/Display.html?Name=XML-Twig

    I don't see how we can work around them without requiring a newer
    XML::Twig or ditching XML::Twig in favour of some other Library.

  • Geoff

    > I have a xmltv.xml with html codes like &#nnn or &acute and the tv_cat
    > and other tools change this code by &nnn; or ´


    This is easily explained by the trite "XML is not HTML" ;-) You are trying to use HTML entities in an XML file.

    If you run your source file through tv_validate_file you will get an error

       parser error : Entity 'acute' not defined

    Similarly if you try to open the file in a browser:

       XML Parsing Error: undefined entity

    The only predefined entities in XML are " & ' < >

    To use anything else it must be defined in the DTD. In other words the DTD would need an

       <!ENTITY acute "&#180;">

    for every html entity you might possibly use.

    (see http://www.w3.org/TR/xhtml1/DTD/xhtml-lat1.ent )

    Now, since you haven't defined &acute; as an entity, Twig sees it as just a string of characters: $-a-c-u-t-e-; And since "&" is a reserved character it converts this character to the predefined entity &amp;. Hence you then have "&amp;cute;" as you've seen.

    This isn't a bug; it's what xml twig/writer is supposed to do!

    It should be possible to use numerical character references such as &#nnn; but this obviously depends on what has been coded into the library module. It seems XML::Twig doesn't know about &#180; and so (wrongly) behaves as above.

    Why not simply use the character equivalent of the html entity you are trying to use? e.g. 0xB4 for ISO8859, or C2 B4 for UTF-8. There is no need to use HTML entities at all, in an XML file.

  • Geoff

    • status: open --> closed-wont-fix
    • Group: --> none