From: Nick M. <kno...@us...> - 2010-07-31 22:07:22
|
Update of /cvsroot/xmltv/xmltv/grab/uk_rt In directory sfp-cvsdas-3.v30.ch3.sourceforge.com:/tmp/cvs-serv8469/grab/uk_rt Modified Files: tv_grab_uk_rt.in utf8_fixups Log Message: Handle mis-encoded characters in range [C2][80-9F] that have just started appearing in the source data. Index: tv_grab_uk_rt.in =================================================================== RCS file: /cvsroot/xmltv/xmltv/grab/uk_rt/tv_grab_uk_rt.in,v retrieving revision 1.300 retrieving revision 1.301 diff -C2 -d -r1.300 -r1.301 *** tv_grab_uk_rt.in 31 Jul 2010 09:10:12 -0000 1.300 --- tv_grab_uk_rt.in 31 Jul 2010 22:07:13 -0000 1.301 *************** *** 1876,1881 **** # once in source data to be able to construct a suitable fixup. # t(" Looking for Unicode Replacement Character..."); ! if ( /\xEF\xBF\xBD/ || /\xC3\xAF\xC2\xBF\xC2\xBD/ ) { if (%utf8_fixups) { --- 1876,1888 ---- # once in source data to be able to construct a suitable fixup. # + # Mis-encoded characters in range [C2][80-9F] + # =========================================== + # + # Single characters that are seen in the source data as bytes in the + # range [C2][80-9F] that UTF-8 decode as non-printing characters + # instead of their intended character. + # t(" Looking for Unicode Replacement Character..."); ! if ( /\xEF\xBF\xBD/ || /\xC3\xAF\xC2\xBF\xC2\xBD/ || /\xC2[\x80-\x9F]/ ) { if (%utf8_fixups) { Index: utf8_fixups =================================================================== RCS file: /cvsroot/xmltv/xmltv/grab/uk_rt/utf8_fixups,v retrieving revision 1.67 retrieving revision 1.68 diff -C2 -d -r1.67 -r1.68 *** utf8_fixups 28 Jul 2010 18:46:16 -0000 1.67 --- utf8_fixups 31 Jul 2010 22:07:13 -0000 1.68 *************** *** 6,10 **** # frequently seen in the source data from the Radio Times. # ! # When the grabber is run with --debug a summary of listings files processed # containing unhandled mis-encoded characters is created. In order to create a # fixup, download the source data file from the Radio Times and examine the --- 6,10 ---- # frequently seen in the source data from the Radio Times. # ! # When the grabber is run with --debug, a summary of listings files processed # containing unhandled mis-encoded characters is created. In order to create a # fixup, download the source data file from the Radio Times and examine the *************** *** 12,20 **** # replacement characters should be. # ! # ** From analysis of the bad and replacement bytes, it is getting more obvious ! # that it may be possible to handle the majority of the mis-encoded characters ! # in the grabber directly when certain byte sequences are found ** ! # ! # The file is split into four sections: # # 1) Mis-encoded characters in range [C3][A0-AF][C2][80-BF][C2][80-BF] --- 12,16 ---- # replacement characters should be. # ! # The file is split into five sections: # # 1) Mis-encoded characters in range [C3][A0-AF][C2][80-BF][C2][80-BF] *************** *** 22,29 **** # 3) Mis-encoded single characters represented with [EF][BF][BD] bytes # 4) Mis-encoded single characters represented with [C3][AF][C2][BF][C2][BD] bytes # ! # Where possible, only the affected bytes are included in the fixup to allow ! # them to be applied to other listings data in addition to that which required ! # the original fixup. # # Each entry comprises two pipe-separated fields: --- 18,26 ---- # 3) Mis-encoded single characters represented with [EF][BF][BD] bytes # 4) Mis-encoded single characters represented with [C3][AF][C2][BF][C2][BD] bytes + # 5) Mis-encoded single characters in range [C2][80-9F] # ! # As of XMLTV 0.5.57, the first two types of fixup are now handled automatically ! # in the grabber. Their fixups will be kept in this file until at least XMLTV ! # 0.5.58 for the benefit of users of older versions. # # Each entry comprises two pipe-separated fields: *************** *** 337,338 **** --- 334,346 ---- # "vérité" \x76\xC3\xAF\xC2\xBF\xC2\xBD\x72\x69\x74\xC3\xAF\xC2\xBF\xC2\xBD|\x76\xC3\xA9\x72\x69\x74\xC3\xA9 + # + # + # 5) UTF-8 characters seen in the raw data as bytes in the range [C2][80-9F] + # which should represent a single printing character but UTF-8 decode as + # non-printing control characters. + # + # + # "'" + \xC2\x91|\x27 + # "'" + \xC2\x92|\x27 |