[xmltv-commit] CVS: xmltv/grab/uk_rt tv_grab_uk_rt.in, 1.300, 1.301 utf8_fixups, 1.67, 1.68

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Update of /cvsroot/xmltv/xmltv/grab/uk_rt
In directory sfp-cvsdas-3.v30.ch3.sourceforge.com:/tmp/cvs-serv8469/grab/uk_rt

Modified Files:
	tv_grab_uk_rt.in utf8_fixups 
Log Message:
Handle mis-encoded characters in range [C2][80-9F] that have just started appearing in the source data.

Index: tv_grab_uk_rt.in
===================================================================
RCS file: /cvsroot/xmltv/xmltv/grab/uk_rt/tv_grab_uk_rt.in,v
retrieving revision 1.300
retrieving revision 1.301
diff -C2 -d -r1.300 -r1.301
*** tv_grab_uk_rt.in	31 Jul 2010 09:10:12 -0000	1.300
--- tv_grab_uk_rt.in	31 Jul 2010 22:07:13 -0000	1.301
***************
*** 1876,1881 ****
          # once in source data to be able to construct a suitable fixup.
          #
          t("    Looking for Unicode Replacement Character...");
!         if ( /\xEF\xBF\xBD/ || /\xC3\xAF\xC2\xBF\xC2\xBD/ ) {
  
              if (%utf8_fixups) {
--- 1876,1888 ----
          # once in source data to be able to construct a suitable fixup.
          #
+         # Mis-encoded characters in range [C2][80-9F]
+         # ===========================================
+         #
+         # Single characters that are seen in the source data as bytes in the
+         # range [C2][80-9F] that UTF-8 decode as non-printing characters
+         # instead of their intended character.
+         #
          t("    Looking for Unicode Replacement Character...");
!         if ( /\xEF\xBF\xBD/ || /\xC3\xAF\xC2\xBF\xC2\xBD/ || /\xC2[\x80-\x9F]/ ) {
  
              if (%utf8_fixups) {

Index: utf8_fixups
===================================================================
RCS file: /cvsroot/xmltv/xmltv/grab/uk_rt/utf8_fixups,v
retrieving revision 1.67
retrieving revision 1.68
diff -C2 -d -r1.67 -r1.68
*** utf8_fixups	28 Jul 2010 18:46:16 -0000	1.67
--- utf8_fixups	31 Jul 2010 22:07:13 -0000	1.68
***************
*** 6,10 ****
  # frequently seen in the source data from the Radio Times.
  #
! # When the grabber is run with --debug a summary of listings files processed
  # containing unhandled mis-encoded characters is created. In order to create a
  # fixup, download the source data file from the Radio Times and examine the
--- 6,10 ----
  # frequently seen in the source data from the Radio Times.
  #
! # When the grabber is run with --debug, a summary of listings files processed
  # containing unhandled mis-encoded characters is created. In order to create a
  # fixup, download the source data file from the Radio Times and examine the
***************
*** 12,20 ****
  # replacement characters should be.
  #
! # ** From analysis of the bad and replacement bytes, it is getting more obvious
! # that it may be possible to handle the majority of the mis-encoded characters
! # in the grabber directly when certain byte sequences are found **
! #
! # The file is split into four sections:
  #
  # 1) Mis-encoded characters in range [C3][A0-AF][C2][80-BF][C2][80-BF]
--- 12,16 ----
  # replacement characters should be.
  #
! # The file is split into five sections:
  #
  # 1) Mis-encoded characters in range [C3][A0-AF][C2][80-BF][C2][80-BF]
***************
*** 22,29 ****
  # 3) Mis-encoded single characters represented with [EF][BF][BD] bytes
  # 4) Mis-encoded single characters represented with [C3][AF][C2][BF][C2][BD] bytes
  #
! # Where possible, only the affected bytes are included in the fixup to allow
! # them to be applied to other listings data in addition to that which required
! # the original fixup.
  #
  # Each entry comprises two pipe-separated fields:
--- 18,26 ----
  # 3) Mis-encoded single characters represented with [EF][BF][BD] bytes
  # 4) Mis-encoded single characters represented with [C3][AF][C2][BF][C2][BD] bytes
+ # 5) Mis-encoded single characters in range [C2][80-9F]
  #
! # As of XMLTV 0.5.57, the first two types of fixup are now handled automatically
! # in the grabber. Their fixups will be kept in this file until at least XMLTV
! # 0.5.58 for the benefit of users of older versions.
  #
  # Each entry comprises two pipe-separated fields:
***************
*** 337,338 ****
--- 334,346 ----
  # "vÃ©ritÃ©"
  \x76\xC3\xAF\xC2\xBF\xC2\xBD\x72\x69\x74\xC3\xAF\xC2\xBF\xC2\xBD|\x76\xC3\xA9\x72\x69\x74\xC3\xA9
+ #
+ #
+ # 5) UTF-8 characters seen in the raw data as bytes in the range [C2][80-9F]
+ # which should represent a single printing character but UTF-8 decode as 
+ # non-printing control characters.
+ #
+ #
+ # "'"
+ \xC2\x91|\x27
+ # "'"
+ \xC2\x92|\x27

[xmltv-commit] CVS: xmltv/grab/uk_rt tv_grab_uk_rt.in, 1.300, 1.301 utf8_fixups, 1.67, 1.68

XMLTV obtains and processes TV listings data

[xmltv-commit] CVS: xmltv/grab/uk_rt tv_grab_uk_rt.in, 1.300, 1.301 utf8_fixups, 1.67, 1.68