Re: [cx-oracle-users] Antw: Bad conversion of a unicode value?
Brought to you by:
atuining
From: Michael S. <ms...@co...> - 2007-11-28 11:13:17
|
matilda matilda schrieb: >>>> Michael Schlenker <ms...@co...> 27.11.2007 10:49 >>> >> SQL> select asciistr(unicp), unicp from nls_testing; >> >> ASCIISTR(UNICP) UNICP >> -------------------------------------------------- -------- >> \20AC =E2=82=AC >> \2456 =C2=BF >> >> SQL> commit; >> Transaktion mit COMMIT abgeschlossen. >> >> As told, i get the first one correct, but the second does >> appear to not reach python intact. >=20 > Hi Michael, >=20 > just a short information in between. At the moment I can only > investigate > this problem in spare time. That's the reason that it will last some > time > anyway. >=20 > What I did: I created a short Embedded SQL C program to get the values > from the database. Interestingly I get the same byte sequences as I > get > them from python. My conclusion: There must be something with the > internal character set conversion between DB backend and DB client. >=20 > I also don't know at the moment how much the client environment (Linux > for me) > influences the results. E.g. I get "rubbish" for both values stored in > your > test tables. >=20 > If you have new informations please let us know. Anthony is interested > in=20 > improving the i18n and character set stuff in cx_Oracle. At the moment > he > seems a little bit busy. :-) >=20 I created a little test program to see where it starts to break, see end = of the message. Basically i find that many codepoints >=3D 128 (0x80)(aka everything outs= ide 7 Bit ASCII) are broken and don't make it through unharmed, but not all of them. The ones that make it through look suspiciously like CP1252..., which is = the Database Encoding (MSWINxxxx), but not the Database National Encoding. The first block is "0080..00FF; Latin-1 Supplement" from the unicode char= acter database, the second one is the upper part of "Latin Extended-A", and then one char= from "Latin Extended-B", then some from punctuation and some from currency symbols. So i think cx_Oracle somehow mistakenly converts to database encoding som= ewhere in between. Michael Here is the full list of Non ASCII that gets through unbroken: Returned: ASCIISTR( =7F ) UTF8( =7F ) UNICODE( u'\x7f' ) Returned: ASCIISTR( \0081 ) UTF8( =C3=82=C2=81 ) UNICODE( u'\x81' ) Returned: ASCIISTR( \008D ) UTF8( =C3=82=C2=8D ) UNICODE( u'\x8d' ) Returned: ASCIISTR( \008F ) UTF8( =C3=82=C2=8F ) UNICODE( u'\x8f' ) Returned: ASCIISTR( \0090 ) UTF8( =C3=82=C2=90 ) UNICODE( u'\x90' ) Returned: ASCIISTR( \009D ) UTF8( =C3=82=C2=9D ) UNICODE( u'\x9d' ) Returned: ASCIISTR( \00A0 ) UTF8( =C3=82 ) UNICODE( u'\xa0' ) Returned: ASCIISTR( \00A1 ) UTF8( =C3=82=C2=A1 ) UNICODE( u'\xa1' ) Returned: ASCIISTR( \00A2 ) UTF8( =C3=82=C2=A2 ) UNICODE( u'\xa2' ) Returned: ASCIISTR( \00A3 ) UTF8( =C3=82=C2=A3 ) UNICODE( u'\xa3' ) Returned: ASCIISTR( \00A4 ) UTF8( =C3=82=C2=A4 ) UNICODE( u'\xa4' ) Returned: ASCIISTR( \00A5 ) UTF8( =C3=82=C2=A5 ) UNICODE( u'\xa5' ) Returned: ASCIISTR( \00A6 ) UTF8( =C3=82=C2=A6 ) UNICODE( u'\xa6' ) Returned: ASCIISTR( \00A7 ) UTF8( =C3=82=C2=A7 ) UNICODE( u'\xa7' ) Returned: ASCIISTR( \00A8 ) UTF8( =C3=82=C2=A8 ) UNICODE( u'\xa8' ) Returned: ASCIISTR( \00A9 ) UTF8( =C3=82=C2=A9 ) UNICODE( u'\xa9' ) Returned: ASCIISTR( \00AA ) UTF8( =C3=82=C2=AA ) UNICODE( u'\xaa' ) Returned: ASCIISTR( \00AB ) UTF8( =C3=82=C2=AB ) UNICODE( u'\xab' ) Returned: ASCIISTR( \00AC ) UTF8( =C3=82=C2=AC ) UNICODE( u'\xac' ) Returned: ASCIISTR( \00AD ) UTF8( =C3=82=C2=AD ) UNICODE( u'\xad' ) Returned: ASCIISTR( \00AE ) UTF8( =C3=82=C2=AE ) UNICODE( u'\xae' ) Returned: ASCIISTR( \00AF ) UTF8( =C3=82=C2=AF ) UNICODE( u'\xaf' ) Returned: ASCIISTR( \00B0 ) UTF8( =C3=82=C2=B0 ) UNICODE( u'\xb0' ) Returned: ASCIISTR( \00B1 ) UTF8( =C3=82=C2=B1 ) UNICODE( u'\xb1' ) Returned: ASCIISTR( \00B2 ) UTF8( =C3=82=C2=B2 ) UNICODE( u'\xb2' ) Returned: ASCIISTR( \00B3 ) UTF8( =C3=82=C2=B3 ) UNICODE( u'\xb3' ) Returned: ASCIISTR( \00B4 ) UTF8( =C3=82=C2=B4 ) UNICODE( u'\xb4' ) Returned: ASCIISTR( \00B5 ) UTF8( =C3=82=C2=B5 ) UNICODE( u'\xb5' ) Returned: ASCIISTR( \00B6 ) UTF8( =C3=82=C2=B6 ) UNICODE( u'\xb6' ) Returned: ASCIISTR( \00B7 ) UTF8( =C3=82=C2=B7 ) UNICODE( u'\xb7' ) Returned: ASCIISTR( \00B8 ) UTF8( =C3=82=C2=B8 ) UNICODE( u'\xb8' ) Returned: ASCIISTR( \00B9 ) UTF8( =C3=82=C2=B9 ) UNICODE( u'\xb9' ) Returned: ASCIISTR( \00BA ) UTF8( =C3=82=C2=BA ) UNICODE( u'\xba' ) Returned: ASCIISTR( \00BB ) UTF8( =C3=82=C2=BB ) UNICODE( u'\xbb' ) Returned: ASCIISTR( \00BC ) UTF8( =C3=82=C2=BC ) UNICODE( u'\xbc' ) Returned: ASCIISTR( \00BD ) UTF8( =C3=82=C2=BD ) UNICODE( u'\xbd' ) Returned: ASCIISTR( \00BE ) UTF8( =C3=82=C2=BE ) UNICODE( u'\xbe' ) Returned: ASCIISTR( \00BF ) UTF8( =C3=82=C2=BF ) UNICODE( u'\xbf' ) Returned: ASCIISTR( \00C0 ) UTF8( =C3=83=E2=82=AC ) UNICODE( u'\xc0' ) Returned: ASCIISTR( \00C1 ) UTF8( =C3=83=C2=81 ) UNICODE( u'\xc1' ) Returned: ASCIISTR( \00C2 ) UTF8( =C3=83=E2=80=9A ) UNICODE( u'\xc2' ) Returned: ASCIISTR( \00C3 ) UTF8( =C3=83=C6=92 ) UNICODE( u'\xc3' ) Returned: ASCIISTR( \00C4 ) UTF8( =C3=83=E2=80=9E ) UNICODE( u'\xc4' ) Returned: ASCIISTR( \00C5 ) UTF8( =C3=83=E2=80=A6 ) UNICODE( u'\xc5' ) Returned: ASCIISTR( \00C6 ) UTF8( =C3=83=E2=80=A0 ) UNICODE( u'\xc6' ) Returned: ASCIISTR( \00C7 ) UTF8( =C3=83=E2=80=A1 ) UNICODE( u'\xc7' ) Returned: ASCIISTR( \00C8 ) UTF8( =C3=83=CB=86 ) UNICODE( u'\xc8' ) Returned: ASCIISTR( \00C9 ) UTF8( =C3=83=E2=80=B0 ) UNICODE( u'\xc9' ) Returned: ASCIISTR( \00CA ) UTF8( =C3=83=C5=A0 ) UNICODE( u'\xca' ) Returned: ASCIISTR( \00CB ) UTF8( =C3=83=E2=80=B9 ) UNICODE( u'\xcb' ) Returned: ASCIISTR( \00CC ) UTF8( =C3=83=C5=92 ) UNICODE( u'\xcc' ) Returned: ASCIISTR( \00CD ) UTF8( =C3=83=C2=8D ) UNICODE( u'\xcd' ) Returned: ASCIISTR( \00CE ) UTF8( =C3=83=C5=BD ) UNICODE( u'\xce' ) Returned: ASCIISTR( \00CF ) UTF8( =C3=83=C2=8F ) UNICODE( u'\xcf' ) Returned: ASCIISTR( \00D0 ) UTF8( =C3=83=C2=90 ) UNICODE( u'\xd0' ) Returned: ASCIISTR( \00D1 ) UTF8( =C3=83=E2=80=98 ) UNICODE( u'\xd1' ) Returned: ASCIISTR( \00D2 ) UTF8( =C3=83=E2=80=99 ) UNICODE( u'\xd2' ) Returned: ASCIISTR( \00D3 ) UTF8( =C3=83=E2=80=9C ) UNICODE( u'\xd3' ) Returned: ASCIISTR( \00D4 ) UTF8( =C3=83=E2=80=9D ) UNICODE( u'\xd4' ) Returned: ASCIISTR( \00D5 ) UTF8( =C3=83=E2=80=A2 ) UNICODE( u'\xd5' ) Returned: ASCIISTR( \00D6 ) UTF8( =C3=83=E2=80=93 ) UNICODE( u'\xd6' ) Returned: ASCIISTR( \00D7 ) UTF8( =C3=83=E2=80=94 ) UNICODE( u'\xd7' ) Returned: ASCIISTR( \00D8 ) UTF8( =C3=83=CB=9C ) UNICODE( u'\xd8' ) Returned: ASCIISTR( \00D9 ) UTF8( =C3=83=E2=84=A2 ) UNICODE( u'\xd9' ) Returned: ASCIISTR( \00DA ) UTF8( =C3=83=C5=A1 ) UNICODE( u'\xda' ) Returned: ASCIISTR( \00DB ) UTF8( =C3=83=E2=80=BA ) UNICODE( u'\xdb' ) Returned: ASCIISTR( \00DC ) UTF8( =C3=83=C5=93 ) UNICODE( u'\xdc' ) Returned: ASCIISTR( \00DD ) UTF8( =C3=83=C2=9D ) UNICODE( u'\xdd' ) Returned: ASCIISTR( \00DE ) UTF8( =C3=83=C5=BE ) UNICODE( u'\xde' ) Returned: ASCIISTR( \00DF ) UTF8( =C3=83=C5=B8 ) UNICODE( u'\xdf' ) Returned: ASCIISTR( \00E0 ) UTF8( =C3=83 ) UNICODE( u'\xe0' ) Returned: ASCIISTR( \00E1 ) UTF8( =C3=83=C2=A1 ) UNICODE( u'\xe1' ) Returned: ASCIISTR( \00E2 ) UTF8( =C3=83=C2=A2 ) UNICODE( u'\xe2' ) Returned: ASCIISTR( \00E3 ) UTF8( =C3=83=C2=A3 ) UNICODE( u'\xe3' ) Returned: ASCIISTR( \00E4 ) UTF8( =C3=83=C2=A4 ) UNICODE( u'\xe4' ) Returned: ASCIISTR( \00E5 ) UTF8( =C3=83=C2=A5 ) UNICODE( u'\xe5' ) Returned: ASCIISTR( \00E6 ) UTF8( =C3=83=C2=A6 ) UNICODE( u'\xe6' ) Returned: ASCIISTR( \00E7 ) UTF8( =C3=83=C2=A7 ) UNICODE( u'\xe7' ) Returned: ASCIISTR( \00E8 ) UTF8( =C3=83=C2=A8 ) UNICODE( u'\xe8' ) Returned: ASCIISTR( \00E9 ) UTF8( =C3=83=C2=A9 ) UNICODE( u'\xe9' ) Returned: ASCIISTR( \00EA ) UTF8( =C3=83=C2=AA ) UNICODE( u'\xea' ) Returned: ASCIISTR( \00EB ) UTF8( =C3=83=C2=AB ) UNICODE( u'\xeb' ) Returned: ASCIISTR( \00EC ) UTF8( =C3=83=C2=AC ) UNICODE( u'\xec' ) Returned: ASCIISTR( \00ED ) UTF8( =C3=83=C2=AD ) UNICODE( u'\xed' ) Returned: ASCIISTR( \00EE ) UTF8( =C3=83=C2=AE ) UNICODE( u'\xee' ) Returned: ASCIISTR( \00EF ) UTF8( =C3=83=C2=AF ) UNICODE( u'\xef' ) Returned: ASCIISTR( \00F0 ) UTF8( =C3=83=C2=B0 ) UNICODE( u'\xf0' ) Returned: ASCIISTR( \00F1 ) UTF8( =C3=83=C2=B1 ) UNICODE( u'\xf1' ) Returned: ASCIISTR( \00F2 ) UTF8( =C3=83=C2=B2 ) UNICODE( u'\xf2' ) Returned: ASCIISTR( \00F3 ) UTF8( =C3=83=C2=B3 ) UNICODE( u'\xf3' ) Returned: ASCIISTR( \00F4 ) UTF8( =C3=83=C2=B4 ) UNICODE( u'\xf4' ) Returned: ASCIISTR( \00F5 ) UTF8( =C3=83=C2=B5 ) UNICODE( u'\xf5' ) Returned: ASCIISTR( \00F6 ) UTF8( =C3=83=C2=B6 ) UNICODE( u'\xf6' ) Returned: ASCIISTR( \00F7 ) UTF8( =C3=83=C2=B7 ) UNICODE( u'\xf7' ) Returned: ASCIISTR( \00F8 ) UTF8( =C3=83=C2=B8 ) UNICODE( u'\xf8' ) Returned: ASCIISTR( \00F9 ) UTF8( =C3=83=C2=B9 ) UNICODE( u'\xf9' ) Returned: ASCIISTR( \00FA ) UTF8( =C3=83=C2=BA ) UNICODE( u'\xfa' ) Returned: ASCIISTR( \00FB ) UTF8( =C3=83=C2=BB ) UNICODE( u'\xfb' ) Returned: ASCIISTR( \00FC ) UTF8( =C3=83=C2=BC ) UNICODE( u'\xfc' ) Returned: ASCIISTR( \00FD ) UTF8( =C3=83=C2=BD ) UNICODE( u'\xfd' ) Returned: ASCIISTR( \00FE ) UTF8( =C3=83=C2=BE ) UNICODE( u'\xfe' ) Returned: ASCIISTR( \00FF ) UTF8( =C3=83=C2=BF ) UNICODE( u'\xff' ) Returned: ASCIISTR( \0152 ) UTF8( =C3=85=E2=80=99 ) UNICODE( u'\u0152' ) Returned: ASCIISTR( \0153 ) UTF8( =C3=85=E2=80=9C ) UNICODE( u'\u0153' ) Returned: ASCIISTR( \0160 ) UTF8( =C3=85 ) UNICODE( u'\u0160' ) Returned: ASCIISTR( \0161 ) UTF8( =C3=85=C2=A1 ) UNICODE( u'\u0161' ) Returned: ASCIISTR( \0178 ) UTF8( =C3=85=C2=B8 ) UNICODE( u'\u0178' ) Returned: ASCIISTR( \017D ) UTF8( =C3=85=C2=BD ) UNICODE( u'\u017d' ) Returned: ASCIISTR( \017E ) UTF8( =C3=85=C2=BE ) UNICODE( u'\u017e' ) Returned: ASCIISTR( \0192 ) UTF8( =C3=86=E2=80=99 ) UNICODE( u'\u0192' ) Returned: ASCIISTR( \02C6 ) UTF8( =C3=8B=E2=80=A0 ) UNICODE( u'\u02c6' ) Returned: ASCIISTR( \02DC ) UTF8( =C3=8B=C5=93 ) UNICODE( u'\u02dc' ) Returned: ASCIISTR( \2013 ) UTF8( =C3=A2=E2=82=AC=E2=80=9C ) UNICODE( u'\= u2013' ) Returned: ASCIISTR( \2014 ) UTF8( =C3=A2=E2=82=AC=E2=80=9D ) UNICODE( u'\= u2014' ) Returned: ASCIISTR( \2018 ) UTF8( =C3=A2=E2=82=AC=CB=9C ) UNICODE( u'\u20= 18' ) Returned: ASCIISTR( \2019 ) UTF8( =C3=A2=E2=82=AC=E2=84=A2 ) UNICODE( u'\= u2019' ) Returned: ASCIISTR( \201A ) UTF8( =C3=A2=E2=82=AC=C5=A1 ) UNICODE( u'\u20= 1a' ) Returned: ASCIISTR( \201C ) UTF8( =C3=A2=E2=82=AC=C5=93 ) UNICODE( u'\u20= 1c' ) Returned: ASCIISTR( \201D ) UTF8( =C3=A2=E2=82=AC=C2=9D ) UNICODE( u'\u20= 1d' ) Returned: ASCIISTR( \201E ) UTF8( =C3=A2=E2=82=AC=C5=BE ) UNICODE( u'\u20= 1e' ) Returned: ASCIISTR( \2020 ) UTF8( =C3=A2=E2=82=AC ) UNICODE( u'\u2020' )= Returned: ASCIISTR( \2021 ) UTF8( =C3=A2=E2=82=AC=C2=A1 ) UNICODE( u'\u20= 21' ) Returned: ASCIISTR( \2022 ) UTF8( =C3=A2=E2=82=AC=C2=A2 ) UNICODE( u'\u20= 22' ) Returned: ASCIISTR( \2026 ) UTF8( =C3=A2=E2=82=AC=C2=A6 ) UNICODE( u'\u20= 26' ) Returned: ASCIISTR( \2030 ) UTF8( =C3=A2=E2=82=AC=C2=B0 ) UNICODE( u'\u20= 30' ) Returned: ASCIISTR( \2039 ) UTF8( =C3=A2=E2=82=AC=C2=B9 ) UNICODE( u'\u20= 39' ) Returned: ASCIISTR( \203A ) UTF8( =C3=A2=E2=82=AC=C2=BA ) UNICODE( u'\u20= 3a' ) Returned: ASCIISTR( \20AC ) UTF8( =C3=A2=E2=80=9A=C2=AC ) UNICODE( u'\u20= ac' ) Returned: ASCIISTR( \2122 ) UTF8( =C3=A2=E2=80=9E=C2=A2 ) UNICODE( u'\u21= 22' ) import os # path to oci.dll os.environ['PATH'] =3D 'c:\\instantclient10_1' # path to tnsnames.ora os.environ['TNS_ADMIN'] =3D 'U:\\' os.environ['NLS_LANG'] =3D 'GERMAN_GERMANY.AL32UTF8' import cx_Oracle conn =3D cx_Oracle.connect("scott/tiger@DEV") # iterate over the Unicode BMP cur =3D conn.cursor() stmt =3D "SELECT ASCIISTR(nchr(:codepoint)),nchr(:codepoint) FROM DUAL" cur.prepare(stmt) fail =3D 0 # loop from 0 to the start of the surrogate pairs for i in xrange(55295): c =3D unichr(i) cutf8 =3D c.encode('utf-8') cur.execute(None,{'codepoint':i}) asciiret, utfret=3D cur.fetchone() cret =3D utfret.decode('utf-8') if cret !=3D c: print "Difference for codepoint: %x" % i print "Returned: ASCIISTR(",asciiret,")\tUTF8(",utfret,")\tUNICO= DE(",repr(cret),")" print "Expected: \t\t\tUTF8(",cutf8,")\tUNICODE(",repr(c),")" fail +=3D 1 if fail > 50: break |