Due to a recent change to telkku.com, listings in Finland are broken.
getting list of channels: 0% [ ]could not fetch http://www.telkku.com/telkku?tila=knvt&kan=149, error: 404 Not Found, aborting
Yup, looks like it. I'll have a look at the changes today.
URL format has definitely changed, e.g. URL for TV2 page for today is:
One bad news: telkku.com has re-numbered the channels. So a simple update of tv_grab_fi will not be enough. You'll have to update your configuration or run XMLTV configure again.
There were more changes required than I anticipated, but it is working again:
- URL format changed
- channel list HTML code changed
- program info HTML code changed
- handle UTF-8 encoded web pages
- make sure to convert Unicode characters not compatible with XMLTV's ISO-8859-1 encoding (currently U+2013, U+2019, U+201D)
I can't add files to this bug report, so it has to be via pastebin:
Updated script: <http://pastebin.com/Qfyr9iDh>
Diffs for review: <http://pastebin.com/MmLXCVDg>
Ville: I could commit the changes directly from my CVS repository if I would get write access.
Both your pastebin links gives some 502 Bad Gateway error...
It seems http://pastebin.com is down. Please try again later.
I have uploaded the files to an alternative service:
I hope this works better.
Stefan, thanks again for the swift response; I'll try out your patch and commit the changes as soon as pastebin starts working again. I'll also see that you get write access to cvs asap (I don't have permission to give write access myself but will contact the admins about it).
The data contains duplicate programs, it seems the site is providing the program at the border of days on both days.
Please verify and leave out either the first or last program on each day. (as the exact times vary between channels there's no other logic I could come up with quickly)
Here is a small diff that makes tv_grab_fi output utf-8 instead of latin1 now that the source data is also in utf-8
You seem to have CSS turned off.
Please don't fill out this field.
Now that the source data is in utf-8 the tv_grab_fi should pass the data through as is without hacking it to iso-8859-1
Ahh, I was wondering if XMLTV supports UTF-8. Thanks.
Here is the updated version: <http://filebin.ca/mpsmxv/tv_grab_fi>
- applied UTF-8 patch with fixes:
* reverted decode_utf8 removal, as XMLTV only seems to read octects, even if the web page is UTF-8 encoded
* set utf-8 encoding on output XML file
- As we no longer tidy the program data, we now have to process the config file as UTF-8 encoded file (example: "70's show" on "TV Viisi/The Voice") otherwise the user won't be able to specify a "series description",
- Drop last program entry from each page, as telkku.com now puts the last show of the day also at the start of the next day too. This avoids duplicate programme entries.
there is one small issue left with the encoding (as found by
The output gets encoded as utf-8 only when writing directly to a file (--output tv.xml) but not when outputting to STDOUT and redirecting to a file (> tv.xml)
The tester is complaining about iso-8859-1 encoding in the redirected output when it's expecting utf-8 encoding.
*sigh* I had been wondering about the encoding handling of the default XMLTV::Writer output. But as I didn't see any "wide character" warnings in my tests any more, I thought XML::Writer would set the file encoding automatically. Appearently MythTV uses the --output option, or otherwise I would have seen those warnings during my test runs.
Well it's easy to fix.
I'm getting programs listed one day too late when the first program in the scraped page is from the day before. Example: http://telkku.com/channel/list/1/20101204 . Changing tv_grab_fi line 529: "pop(@data) if @data;" that removes the last program to "shift(@data) if @data;" fixes the problem, but possibly causes others.
@nobody: this is another problem which is tracked in bug #3125542. Your solution proposal isn't correct either.
@va1210: please close this bug.
Log in to post a comment.
Sign up for the SourceForge newsletter: