#440 tv_imdb: character encoding of xml input not followed

0.5.65
closed-fixed
tv_imdb (7)
5
2014-04-19
2011-06-21
user1024
No

I have an xmltv file with utf-8 encoding. When I applied tv_imdb to this file an actor tag was added but the actors name was not encoded in utf-8. Firefox gives an "XML Parsing Error: not well-formed" page when I try to open the output xmltv file.

Attached is a patch of a work around which when applied generates an output file which will open without error in Firefox. This patch is intended to illustrate the problem not to provide a general solution. I expect that the same issue applies to other tags not just the author tag.

I am unsure at which stage the issue should be addressed. I imagine that the encoding could be changed when the imdb database files are created, when text is read from the database file or when the text is inserted into the output structure, or is it a problem with the imdb database files? etc.

Discussion

  • user1024

    user1024 - 2011-06-21

    Workaround patch for character encoding

     
  • Nick Morrott

    Nick Morrott - 2011-06-22

    Can you provide details on the grabber (name and version) you used to produce the XMLTV data you are feeding to tv_imdb?

    If the XMLTV data is not being created as valid UTF-8 by the grabber, it should ideally be fixed there first.

     
  • user1024

    user1024 - 2011-06-22

    Thanks for looking into this.

    Here are the details of the grabber used.
    ~$ tv_grab_uk_rt --version
    XMLTV module version 0.5.59
    This is tv_grab_uk_rt version 1.331, 2010/11/19 11:31:00

    The file output from the grabber appears to have the correct utf-8 encoding and will open without error in Firefox.

    When I pipe the output from the grabber into tv_imdb the new file generated has additional tags added, as expected. However this new file generated from tv_imdb does not open successfully in Firefox. One of the additional tags contains an actor's name which appears to be incorrectly encoded. The error in Firefox points to a character in the actors name.

    A snippet of the xmltv file output from tv_imdb (with the patch applied) is given below. The error in Firefox points to the "Curd Jürgens" actor tag.

    <programme start="20110529110000 +0100" stop="20110529125500 +0100" channel="filmfour.channel4.com">
    <title>The Enemy Below</title>
    <desc lang="en">This is one of the best submarine movies...</desc>
    <credits>
    <director>Dick Powell</director>
    <actor>Robert Mitchum</actor>
    <actor>Curd Jürgens</actor>
    <actor>David Hedison</actor>
    ...

     
  • Karl Dietz

    Karl Dietz - 2011-07-14
    • summary: Character encoding of xml input not followed --> tv_imdb: character encoding of xml input not followed
     
  • Karl Dietz

    Karl Dietz - 2011-07-14

    Nick, seems it's nothing with your grabber. (and we're checking quite a bit of garbled utf-8 characters now at the nightly tester)
    User1024, seems like tv_imdb could use some love with regard to characters outside the ASCII set, mind to give it a try? tv_validate_file should find all instances of strange stuff instead of proper utf-8.

     
  • Geoff

    Geoff - 2014-04-15

    This is becasue XMLTV::Writer doesn't do any encoding on output (an oversight IMO, and one which causes problems elsewhere - e.g. why tv_cat won't combine files with different encodings). It assumes the data are already suitably encoded and just outputs what it's got.

    By default the IMDb file will be read as iso-8859-1, so if you add this to a utf-8 xml file you are going to get wrong chars output, as you've found.

     
    Simplified test case:

    Run the attached files through tv_imdb and try to open the output.

    tv_imdb --imdbdir /IMDB test_8859.in.xml > test_8859.out.xml
    tv_imdb --imdbdir /IMDB test_utf8.in.xml > test_utf8.out.xml

    test_utf8.out.xml will report XML Parsing Error on the u-umlaut.

     
    If your xml is utf8 (or anything other than iso-8859-1) then I expect you will also get problems when matching film titles, for the same reaon

    e.g. "À la carte (2001)" in a utf8 xml file won't match the IMDb database (iso-8859-1).

     
    Try this fix.

    1) Copy the file Encode.pm into XMLTV/Data/Recursive/ (you will have to create this dir) within your XMLTV directory (e.g. /usr/lib/perl5/site_perl/5.8.8/XMLTV/Data/Recursive/Encode.pm ).
    (DO NOT overwrite your existing Encode.pm in the std Perl library!)

    2) Patch tv_imdb as attached.

    3) Re-run the test cases.

    Works for me... Please let me know how you get on.

     
    Last edit: Geoff 2014-04-15
  • Geoff

    Geoff - 2014-04-19
    • status: open --> closed-fixed
    • Group: --> 0.5.65
     
  • Geoff

    Geoff - 2014-04-19

    Fixed in tv_imdb version 1.27

     

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:





No, thanks