From: SourceForge.net <no...@so...> - 2010-04-28 16:52:45
|
Bugs item #2989519, was opened at 2010-04-19 20:28 Message generated for change (Comment added) made by duncanwebb You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=446895&aid=2989519&group_id=46652 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: helpers Group: 1.9.x Status: Open Resolution: None Priority: 5 Private: No Submitted By: J.R. Oldroyd (jroldroyd) Assigned to: Nobody/Anonymous (nobody) Summary: UTF-8 problem when TV.xml contains 8-bit chars Initial Comment: I'm following-up on issue freevo-Bugs-2953821 which the system has auto-closed since I was too slow to respond. Sorry about the long delay. Your suggestion that I need to set LOCALE in the config file doesn't help, I'm afraid. I am downloading from SchedulesDirect. I have LOCALE set to iso-8859-1 (also tried iso-8859-15 and utf-8 and also leaving it unset). In all cases, the downloaded TV.xml is always iso-8859-1 and the freevo webserver crashes with the error shown in issue 2953821 whenever someone searches for programs that contain 8-bit characters in their descriptions or titles. I also see the example TV.xml embedded in src/tv/xmltv.py... that example is also iso-8859-1 encoding. Where is the code that is supposed to convert the download to utf-8? It doesn't seem to be being invoked here so maybe I am missing some additional config? In the absence of conversion of the download, the patch I uploaded to the other issue does make it work with an 8859-1 download. ---------------------------------------------------------------------- >Comment By: Duncan Webb (duncanwebb) Date: 2010-04-28 18:52 Message: Hi JR, The files TV.xml and tv_grab_na_dd_ZPzR.tmp are both encoded as UTF8 but TV.xml says it is encoded as iso-8859-1. So you will get the wrong characters displayed. You can prove this by changing the encoding line in the TV.xml line to utf-8. Then you will need to remove the pickle file in the cache directory, run freevo and see what is displayed. The pickle file should be regenerated. The next question is what is causing the file to have the wrong encoding. You can try to run the tv_grab command from the command line and see if this is correct. If it is then the problem is in the tv_grab command. Looking forward to your next instalment. ---------------------------------------------------------------------- Comment By: J.R. Oldroyd (jroldroyd) Date: 2010-04-28 18:04 Message: Quick extra note... turns out tv_grab_na_dd is a Perl script... and it explicitly codes ISO-8859-1 in its call to XMLTV::Writer() which means it is explicitly writing the output in 8859-1 which is why TV.xml is in 8859-1. This is not reconfigurable without changing the tv_grab_na_dd Perl script - no environment variable or argument will override this. I'm trying to see what happens in freevo if freevo's LOCALE is not set in local_conf.py and why, if I do set it, I start getting channel errors in channels.py. I do not see where LOCALE receives a default value, if indeed it does. For me, if LOCALE is not set, using the patch makes it all work perfectly and it's a solution I am happy with. But if I set LOCALE as you suggested, it then can't find TV channels when it goes to start a recording. I see that the TV channels list is generated using the TV_CHANNELS_COMPARE sort routines in config.py... the xmltv parser there does use the LOCALE setting to parse the channels' display_names. Now, none of my channels have any 8-bit characters in the names, but I have re-written TV_CHANNELS_COMPARE to handle the "N-N" channel number format of US broadcast ATSC channels. This is not new... I've had this for a couple of years now and it works fine if LOCALE is not set in local_conf.py, and anyway, this change is just to the lambda function to first sort the channel number (before the "-") and then sort the sub-channel number (after the "-") and this doesn't use LOCALE at all. For some reason, though, it is breaking when LOCALE is set. I haven't quite yet figured out why this may be... does the "-" encode differently when LOCALE gets set? Here's my lambda function for sorting channel numbers of the form "2-1", "2-2", "4-1", "5-1", "5-2", "5-3" etc: TV_CHANNELS_COMPARE = lambda a, b: cmp(int(a[2][0:a[2].find('-')]), int(b[2][0:b[2].find('-')])) or cmp(int(a[2][a[2].find('-')+1:len(a[2])]), int(b[2][b[2].find('-')+1:len(b[2])])) Do I have to call encoding() on the '-' here...? I'm really not sure, though, how much further this really needs to be investigated since the other patch does work. ---------------------------------------------------------------------- Comment By: J.R. Oldroyd (jroldroyd) Date: 2010-04-28 16:26 Message: OK... I have put last night's files at these links: http://opal.com/jr/freevodebug/TV.xml http://opal.com/jr/freevodebug/TV.xml.pickled http://opal.com/jr/freevodebug/tv_grab-523.log These were loaded with LOCALE commented out in local_conf.py (therefore defaulting to iso-8859-15, I think). If you look in the two TV.xml files and search for "Jacques" or "Bonnie" or " "Chadian" or "Jazz" you will see that the words next to those (i.e., "Pépin" or "Erbé" or "Hissène Habré"or "Café") are encoded in iso-8859-1 in the TV.xml file but the same program has the same words encoded as utf-8 in the TV.xml.pickled file. WITHOUT my patch, any attempt to display this program will crash the webserver with the coding error I mentioned before. WITH the patch, it displays just fine. I have also placed the tv_grab_na_dd temporary file here: http://opal.com/jr/freevodebug/tv_grab_na_dd_ZPzR.tmp That is the raw data that is downloaded from the SchedulesDirect server. You will see here that this IS utf-8. However, tv_grab_na_dd converts the downloaded data to iso-8859-1 before delivering it to freevo for storage in TV.xml. I can replicate this behavior by running tv_grab_na_dd by hand and see the utf-8 temp download file, but I get iso-8859-1 delivered to stdout. I.e., I have: ScheduleDirect download utf-8 tv_grab_na_dd output iso-8859-1 TV.xml iso-8859-1 TV.xml.pickled utf-8 and it looks like the freevo webserver expects utf-8 but is seeing the iso-8859-1 and barfing without that patch of mine. Are you expecting TV.xml to be in utf-8 also? If so, my problem could be the tv_grab_na_dd configuration? That said, since TV.xml does contain an xml encoding="ISO-8859-1" tag, I'd have thought that the application reading it should be able to handle the (re-)conversion. My patch adds that reconversion. Admittedly I reconvert using iso-8859-1 explicitly... perhaps my patch should look up the encoding specified in the file's encoding tag and reconvert using that? BTW, if you want to email me instead of loading tickets like this, you can make my email address by swapping the first two parts of those URLs. ---------------------------------------------------------------------- Comment By: Duncan Webb (duncanwebb) Date: 2010-04-27 17:17 Message: Hi J.R. It might be useful to attach the TV.xml that you have grabbed plus the logs that are generated. TV.xml should be parsed by elementtree and converted to Unicode (not quite the same as UTF8). We don't generally store the files in Unicode but they should be stored in UTF8. The format stored in sitecustomize.py. I don't think that this is a general problem as it works for me. ---------------------------------------------------------------------- Comment By: J.R. Oldroyd (jroldroyd) Date: 2010-04-20 02:19 Message: Hmm... this problem seems to be more subtle than just setting LOCALE. Having removed my patch, and set LOCALE as I mentioned earlier, and re-run tv_grab to rebuild all affected files, I now get problems with the recordserver being unable to find a channel at the time it is to start recording: channels.py (169): Cannot find tuner channel "###" in the TV channel listing this is puzzling, because I don't see any 8-bit chars anywhere in the channel listings. But, restore the patch, comment out the LOCALE setting again, and it works without problems again... showing the 8-bit program titles/sub-titles/descs OK and finding channels when it is to record. I can spend some more time looking at this one and seeing why the channel selection no longer works when LOCALE is set... or since I see that TV.xml.pickled is always UTF-8 regardless of the setting of LOCALE, perhaps we can just accept the patch since it fixes the problem? If you do want the patch, I'm re-attaching it here... this version is slightly updated from the one in the earlier ticket because I had missed out conversion of the title and only done sub-title and descr before. ---------------------------------------------------------------------- Comment By: J.R. Oldroyd (jroldroyd) Date: 2010-04-19 21:05 Message: Even more interesting... After having converted the download from utf-8 to iso-8859-1, the next phase "caching data...?" converts it BACK to utf-8 and stores that in TV.xml.pickled. So we have a double conversion going on here. And... Setting LOCALE to "utf-8" does, in fact, make it work. I am sure I did try that earlier, but perhaps I did so while my earlier patch was still applied. Sorry about the confusion. Anyway, slight change needed to your suggested fix, but it does now work without the patch being needed. ---------------------------------------------------------------------- Comment By: J.R. Oldroyd (jroldroyd) Date: 2010-04-19 20:56 Message: Interesting... The raw SchedulesDirect download, stored in /tmp/tv_grab_na_dd_TcrN.tmp during the "freevo tv_grab" "loading data" phase IS in utf-8. The "Writing schedule" phase then converts it to ISO-8859-1. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=446895&aid=2989519&group_id=46652 |