Thread: [cx-oracle-users] Bad conversion of a unicode value?
Brought to you by:
atuining
From: Michael S. <ms...@co...> - 2007-11-26 17:44:08
|
Hi, I'm trying to use unicode with cx_Oracle but get some strange results, ma= ybe someone can enlighten me to what goes wrong. The setup: cx_Oracle.version =3D 4.3.3 OCI from instantclient10_1 (oci.dll Version 10.01.0000.0004) NLS_LANG set to GERMAN_GERMANY.AL32UTF8 before loading cx_Oracle Windows XP SP 2 from sqlplus: SQL> select * from nls_database_parameters; PARAMETER VALUE ------------------------------ ---------------------------------------- NLS_LANGUAGE AMERICAN NLS_TERRITORY AMERICA NLS_CURRENCY $ NLS_ISO_CURRENCY AMERICA NLS_NUMERIC_CHARACTERS ., NLS_CHARACTERSET WE8MSWIN1252 NLS_CALENDAR GREGORIAN NLS_DATE_FORMAT DD-MON-RR NLS_DATE_LANGUAGE AMERICAN NLS_SORT BINARY NLS_TIME_FORMAT HH.MI.SSXFF AM PARAMETER VALUE ------------------------------ ---------------------------------------- NLS_TIMESTAMP_FORMAT DD-MON-RR HH.MI.SSXFF AM NLS_TIME_TZ_FORMAT HH.MI.SSXFF AM TZR NLS_TIMESTAMP_TZ_FORMAT DD-MON-RR HH.MI.SSXFF AM TZR NLS_DUAL_CURRENCY $ NLS_COMP BINARY NLS_LENGTH_SEMANTICS BYTE NLS_NCHAR_CONV_EXCP FALSE NLS_NCHAR_CHARACTERSET AL16UTF16 NLS_RDBMS_VERSION 10.2.0.3.0 I created a Table with sqlplus with some unicode characters in it: CREATE TABLE nls_testing ( unicp NVARCHAR(10)); INSERT INTO nls_testing (nchar(8365)); -- euro symbol uses 2 utf-8 bytes INSERT INTO nls_testing (nchar(9305)); -- some cjk idogram uses 3 utf-8 b= ytes COMMIT; A select shows all is fine in the db, i see the correct Unicode codepoint= s. SELECT asciistr(unicp) FROM nls_testing; ASCIISTR(UNICP) -------------------------------------------------------------------------= ------- \20AC \2456 Now i tried to select the data with cx_Oracle from a python shell (2.5.1,= windows): >>> import cx_Oracle >>> cx_Oracle.version '4.3.3' >>> conn =3D cx_Oracle.connect('...') >>> cur =3D conn.cursor() >>> conn.encoding 'UTF-8' >>> conn.nencoding 'UTF-8' >>> cur.execute('select unicp, asciistr(unicp) from nls_testing') [<cx_Oracle.STRING with value None>, <cx_Oracle.STRING with value None>] >>> rows =3D cur.fetchall() >>> print rows [('\xe2\x82\xac', '\\20AC'), ('\xc2\xbf', '\\2456')] >>> for row in rows: =2E.. print repr(unicode(row[0].decode('utf-8'))) =2E.. u'\u20ac' u'\xbf' So i get the euro symbol correctly delivered and decoded from utf-8, but = i get trash from the chinese character. Expected utf-8 hex value for unicode co= depoint 0x2456 is 'e29196'. Any idea what goes wrong here? Michael --=20 Michael Schlenker Software Engineer CONTACT Software GmbH Tel.: +49 (421) 20153-80 Wiener Stra=DFe 1-3 Fax: +49 (421) 20153-41 28359 Bremen http://www.contact.de/ E-Mail: ms...@co... Sitz der Gesellschaft: Bremen | Gesch=E4ftsf=FChrer: Karl Heinz Zachries Eingetragen im Handelsregister des Amtsgerichts Bremen unter HRB 13215 |
From: matilda m. <ma...@gr...> - 2007-11-27 08:28:19
|
>>> Michael Schlenker <ms...@co...> 26.11.2007 18:24 >>> > >I created a Table with sqlplus with some unicode characters in it: > >CREATE TABLE nls_testing ( unicp NVARCHAR(10)); >INSERT INTO nls_testing (nchar(8365)); -- euro symbol uses 2 utf-8 bytes >INSERT INTO nls_testing (nchar(9305)); -- some cjk idogram uses 3 utf-8 = bytes >COMMIT; Hi Michael, I just wanted to reproduce your problem, but: 1) CREATE TABLE nls_testing ( unicp NVARCHAR(10)); doesn't work. 2) a) INSERT INTO nls_testing (nchar(8365)) b) INSERT INTO nls_testing (nchar(9305)) are syntactically wrong. Decimal 8365 =3D Hex 20AD and NOT 20AC Decimal 9305 =3D Hex 2459 and NOT 2456 Can you check this, so that we can have a look at this? Best regards Andreas Mock |
From: Michael S. <ms...@co...> - 2007-11-28 15:31:15
Attachments:
cx_Oracle.diff
|
Michael Schlenker schrieb: > matilda matilda schrieb: >>>>> Michael Schlenker <ms...@co...> 28.11.2007 12:13 >>> >>> So i think cx_Oracle somehow mistakenly converts to database encoding= somewhere in between. >> Hi Michael, >> >> I'm not sure if this is a problem of cx_Oracle. That's the reason I wr= ote this little >> Embedded C Program for that. On the C level I only get byte sequences = anyway. So, I'm >> sure that nothing but the Oracle client library is in between (hopeful= ly ;-)). >> Okay, i got my test to work after patching cx_Oracle a little bit. From taking a closer look at the code Unicode support is at best to be d= escribed as 'rudimentary', lots of fine points still missing in there. With that patch (against the cx_Oracle-4.3.3.zip file) at least my test r= uns through cleanly when i set the right environment. One surely can do better.... Michael --=20 Michael Schlenker Software Engineer CONTACT Software GmbH Tel.: +49 (421) 20153-80 Wiener Stra=DFe 1-3 Fax: +49 (421) 20153-41 28359 Bremen http://www.contact.de/ E-Mail: ms...@co... Sitz der Gesellschaft: Bremen | Gesch=E4ftsf=FChrer: Karl Heinz Zachries Eingetragen im Handelsregister des Amtsgerichts Bremen unter HRB 13215 |
From: Anthony T. <ant...@gm...> - 2007-11-29 15:46:20
|
Well, imagine my surprise to return after a short trip to find a dozen messages, including patches! :-) As I've said before, I'm happy to accept patches and I'll certainly look at this one. I have some questions that would help me understand further what you are trying to accomplish. I'm only slowly beginning to get my head around Unicode -- the concept is very simple but the implementations are many and varied... :-( 1) You used OCIEnvNlsCreate() but then specified 0 for both the charset and the ncharset which implies that you're not gaining anything by using it? Did you try this just using the normal OCIEnvCreate()? 2) You set the OCI_ATTR_CHARSET_FORM but I am already doing that in (for example) StringVar.c. Did you remove it from that location? Are there more changes that you haven't specified? Perhaps it will all become clear when I look at it further. Hopefully I'll get a chance to look at it today yet. What would be very helpful is a set of SQL statements and Python code that clearly demonstrate the problem and then applying the patch clearly removes that problem. I might be able to piece things together from the maelstrom of e-mails that were sent in the past couple of days but one clear summarizing e-mail would be very helpful. :-) I'll then use that information for the basis of a test case as well. On Nov 28, 2007 8:31 AM, Michael Schlenker <ms...@co...> wrote: > Michael Schlenker schrieb: > > matilda matilda schrieb: > >>>>> Michael Schlenker <ms...@co...> 28.11.2007 12:13 >>> > >>> So i think cx_Oracle somehow mistakenly converts to database encoding= somewhere in between. > >> Hi Michael, > >> > >> I'm not sure if this is a problem of cx_Oracle. That's the reason I wr= ote this little > >> Embedded C Program for that. On the C level I only get byte sequences = anyway. So, I'm > >> sure that nothing but the Oracle client library is in between (hopeful= ly ;-)). > >> > Okay, i got my test to work after patching cx_Oracle a little bit. > > From taking a closer look at the code Unicode support is at best to be d= escribed as > 'rudimentary', lots of fine points still missing in there. > > With that patch (against the cx_Oracle-4.3.3.zip file) at least my test r= uns through > cleanly when i set the right environment. One surely can do better.... > > > Michael > > -- > Michael Schlenker > Software Engineer > > CONTACT Software GmbH Tel.: +49 (421) 20153-80 > Wiener Stra=DFe 1-3 Fax: +49 (421) 20153-41 > 28359 Bremen > http://www.contact.de/ E-Mail: ms...@co... > > Sitz der Gesellschaft: Bremen | Gesch=E4ftsf=FChrer: Karl Heinz Zachries > Eingetragen im Handelsregister des Amtsgerichts Bremen unter HRB 13215 > > diff -u5 cx_Oracle-4.3.3/Environment.c cx_Oracle-4.3.3_patched/Environmen= t.c > --- cx_Oracle-4.3.3/Environment.c 2007-10-02 05:46:58.000000000 +02= 00 > +++ cx_Oracle-4.3.3_patched/Environment.c 2007-11-28 15:26:15.33905= 3700 +0100 > @@ -83,12 +83,12 @@ > mode =3D OCI_OBJECT; > if (threaded) > mode |=3D OCI_THREADED; > > // create the environment handle > - status =3D OCIEnvCreate(&environment->handle, mode, NULL, NULL, > - NULL, NULL, 0, NULL); > + status =3D OCIEnvNlsCreate(&environment->handle, mode, NULL, NULL, > + NULL, NULL, 0, NULL,0,0); > if (!environment->handle) { > Py_DECREF(environment); > PyErr_SetString(PyExc_RuntimeError, > "Unable to acquire Oracle environment handle"); > return NULL; > Only in cx_Oracle-4.3.3_patched: Environment.c~ > diff -u5 cx_Oracle-4.3.3/Variable.c cx_Oracle-4.3.3_patched/Variable.c > --- cx_Oracle-4.3.3/Variable.c 2007-10-02 05:46:58.000000000 +0200 > +++ cx_Oracle-4.3.3_patched/Variable.c 2007-11-28 16:27:31.292026100 +01= 00 > @@ -445,10 +445,11 @@ > case SQLT_LNG: > return &vt_LongString; > case SQLT_AFC: > return &vt_FixedChar; > case SQLT_CHR: > + case SQLT_VCS: > if (charsetForm =3D=3D SQLCS_NCHAR) > return &vt_NationalCharString; > return &vt_String; > case SQLT_RDD: > return &vt_Rowid; > @@ -669,11 +670,11 @@ > if (Environment_CheckForError(cursor->environment, status, > "Variable_Define(): data type") < 0) > return NULL; > > // retrieve character set form of the parameter > - if (dataType !=3D SQLT_CHR && dataType !=3D SQLT_CLOB) { > + if (dataType !=3D SQLT_CHR && dataType !=3D SQLT_VCS && dataType != =3D SQLT_CLOB) { > charsetForm =3D SQLCS_IMPLICIT; > } else { > status =3D OCIAttrGet(param, OCI_HTYPE_DESCRIBE, (dvoid*) &chars= etForm, > 0, OCI_ATTR_CHARSET_FORM, cursor->environment->errorHand= le); > if (Environment_CheckForError(cursor->environment, status, > @@ -736,10 +737,20 @@ > "Variable_Define(): define") < 0) { > Py_DECREF(var); > return NULL; > } > > + status =3D OCIAttrSet((void *)var->defineHandle, > + (ub4) OCI_HTYPE_DEFINE, (void *) &var->type->charsetForm, > + (ub4) 0, (ub4)OCI_ATTR_CHARSET_FORM, > + cursor->environment->errorHandle); > + if (Environment_CheckForError(var->environment, status, > + "Variable_Define(): attr_charset_form") < 0) { > + Py_DECREF(var); > + return NULL; > + } > + > // call the procedure to set values after define > if (var->type->postDefineProc) { > if ((*var->type->postDefineProc)(var) < 0) { > Py_DECREF(var); > return NULL; > @@ -831,11 +842,11 @@ > "Variable_InternalBind(): set charset form") < 0) > return -1; > } > > // set the max data size for strings > - if ((var->type =3D=3D &vt_String || var->type =3D=3D &vt_FixedChar) > + if ((var->type =3D=3D &vt_String || var->type =3D=3D &vt_FixedChar |= | var->type =3D=3D &vt_NationalCharString) > && var->maxLength > var->type->elementLength) { > status =3D OCIAttrSet(var->bindHandle, OCI_HTYPE_BIND, > (dvoid*) &var->type->elementLength, 0, OCI_ATTR_MAXDATA_= SIZE, > var->environment->errorHandle); > if (Environment_CheckForError(var->environment, status, > > > ------------------------------------------------------------------------- > SF.Net email is sponsored by: The Future of Linux Business White Paper > from Novell. From the desktop to the data center, Linux is going > mainstream. Let it simplify your IT future. > http://altfarm.mediaplex.com/ad/ck/8857-50307-18918-4 > _______________________________________________ > cx-oracle-users mailing list > cx-...@li... > https://lists.sourceforge.net/lists/listinfo/cx-oracle-users > > |
From: matilda m. <ma...@gr...> - 2007-11-28 16:16:46
|
>>> Michael Schlenker <ms...@co...> 28.11.2007 16:31 >>> >Michael Schlenker schrieb: > Okay, i got my test to work after patching cx_Oracle a little bit. Anthony will be happy to hear that. ;-) Anthony: Are you still here? > From taking a closer look at the code Unicode support is at best to be = described as >'rudimentary', lots of fine points still missing in there. I'm sure Anthony will agree. Especially with the upcoming Py3000 there = will be many questions to answer regarding byte-strams, unicode-streams, = characterset conversion (implicit/explicit), character representation. See the change history to see when Anthony started to focus on character = set conversion. Amaury Forgeot d'Arc who also gives valueable input is probably also = interested in that topic while speeking and writing a language with many special = characters. > With that patch (against the cx_Oracle-4.3.3.zip file) at least my test = runs through > cleanly when i set the right environment. One surely can do better.... Can you easily enhance the tests in the test directory to unit test the charset conversion cases? Best regards Andreas |
From: Anthony T. <ant...@gm...> - 2007-11-29 15:48:54
|
On Nov 28, 2007 9:16 AM, matilda matilda <ma...@gr...> wrote: > >>> Michael Schlenker <ms...@co...> 28.11.2007 16:31 >>> > >Michael Schlenker schrieb: > > Okay, i got my test to work after patching cx_Oracle a little bit. > > Anthony will be happy to hear that. ;-) Anthony: Are you still here? I am. Just got back from a short trip and a little busy. I promise to get to this as soon as I get a chance. > > From taking a closer look at the code Unicode support is at best to be described as > >'rudimentary', lots of fine points still missing in there. > > I'm sure Anthony will agree. Especially with the upcoming Py3000 there will > be many questions to answer regarding byte-strams, unicode-streams, characterset > conversion (implicit/explicit), character representation. Yes, I'll agree. My Unicode skills are only slowly ramping up. :-) It doesn't help that there are so many conflicting implementations to confuse me, too. Any pointers, patches, etc. are highly welcomed. I'd really like to get this Unicode thing licked properly instead of the partially working code that is there now. Anthony |
From: Michael S. <ms...@co...> - 2007-11-28 16:28:10
|
matilda matilda schrieb: >>>> Michael Schlenker <ms...@co...> 28.11.2007 16:31 >>> >=20 >> From taking a closer look at the code Unicode support is at best to be= described as >> 'rudimentary', lots of fine points still missing in there. >=20 > I'm sure Anthony will agree. Especially with the upcoming Py3000 there = will > be many questions to answer regarding byte-strams, unicode-streams, cha= racterset > conversion (implicit/explicit), character representation. I thought Py3k should solve those questions ;-)... I usually write Tcl where unicode just works and you don't step into the = deep morass that Python string/uncode dichotomy is all the time. Basically it = uses a model very close to the one now used by Py3k, but got it stable some ye= ars ago (around Tcl 8.2). For cx_Oracle i would like to see maximum ease of use, e.g. expunge the e= vil NLS_LANG dependency as it should never ever be needed if your using Pytho= n ( case 1: non unicode db encoding -> let cx_Oracle convert to python system= encoding internally or to unicode, and get happy, case 2: unicode db encoding -> l= et cx_Oracle just eat Python Unicode strings and spit them out). There are some pitfalls for literal SQL in the path, but i think you can = write the interface in a way that you don't have to worry about all this stupid= encoding conversion stuff forced upon us by Oracle, Python and other forces... > See the change history to see when Anthony started to focus on characte= r set > conversion. >=20 > Amaury Forgeot d'Arc who also gives valueable input is probably also in= terested > in that topic while speeking and writing a language with many special c= haracters. Probably, but the main benefits come when you need to cross borders betwe= en Oracles iso charsets (e.g. polish and french in one database). >> With that patch (against the cx_Oracle-4.3.3.zip file) at least my tes= t runs through >> cleanly when i set the right environment. One surely can do better....= >=20 > Can you easily enhance the tests in the test directory to unit test the= > charset conversion cases? Haven't looked there yet. But maybe i can rip something from my nosetest = files when done. Basically i need to first get permission to invest more time in these iss= ues with cx_Oracle, it could happen that my manager decides to go a different route... :-(. Michael --=20 Michael Schlenker Software Engineer CONTACT Software GmbH Tel.: +49 (421) 20153-80 Wiener Stra=DFe 1-3 Fax: +49 (421) 20153-41 28359 Bremen http://www.contact.de/ E-Mail: ms...@co... Sitz der Gesellschaft: Bremen | Gesch=E4ftsf=FChrer: Karl Heinz Zachries Eingetragen im Handelsregister des Amtsgerichts Bremen unter HRB 13215 |
From: Amaury F. d'A. <ama...@gm...> - 2007-11-28 16:33:51
|
Michael Schlenker wrote: > > I'm sure Anthony will agree. Especially with the upcoming Py3000 there will > > be many questions to answer regarding byte-strams, unicode-streams, characterset > > conversion (implicit/explicit), character representation. > > I thought Py3k should solve those questions ;-)... It won't. Instead, it will force you to correctly answer all those questions. Hopefully it will come with a guide of "best practices" in this area. -- Amaury Forgeot d'Arc |
From: Amaury F. d'A. <ama...@gm...> - 2007-11-28 16:32:08
Attachments:
unicode2.diff
|
matilda matilda wrote: > >>> Michael Schlenker <ms...@co...> 28.11.2007 16:31 >>> > >Michael Schlenker schrieb: > > Okay, i got my test to work after patching cx_Oracle a little bit. > > Anthony will be happy to hear that. ;-) Anthony: Are you still here? > > > From taking a closer look at the code Unicode support is at best to be described as > >'rudimentary', lots of fine points still missing in there. > > I'm sure Anthony will agree. Especially with the upcoming Py3000 there will > be many questions to answer regarding byte-strams, unicode-streams, characterset > conversion (implicit/explicit), character representation. > > See the change history to see when Anthony started to focus on character set > conversion. > > Amaury Forgeot d'Arc who also gives valueable input is probably also interested > in that topic while speeking and writing a language with many special characters. I indeed proposed a patch one year ago, to support unicode. It was against version 4.2.1, I join it again in the hope it can be useful. Here are the comments I sent at the time: The visible changes are: - a new cx_Oracle.UNICODE variable type, which accepts and return unicode values. - NVARCHAR columns will return this variable type. - a new test file: UnicodeVar.py covers the most common cases. There are some caveats: - I based my work on the new vt_NationalCharString type (and its charsetForm member). As there are no unit tests for it, I may have broken something there. Same for the new NCLOB type. - Automatic conversion between string and unicode could be supported (should be easy to add, by using connection.encoding) - I suspect it will only work if Python is compiled in UCS2 mode. (== sizeof(PyUNICODE) must be 2) -- Amaury Forgeot d'Arc |
From: Michael S. <ms...@co...> - 2007-11-27 09:49:30
|
matilda matilda schrieb: >>>> Michael Schlenker <ms...@co...> 26.11.2007 18:24 >>> >> I created a Table with sqlplus with some unicode characters in it: >> >> CREATE TABLE nls_testing ( unicp NVARCHAR(10)); >> INSERT INTO nls_testing (nchar(8365)); -- euro symbol uses 2 utf-8 byt= es >> INSERT INTO nls_testing (nchar(9305)); -- some cjk idogram uses 3 utf-= 8 bytes >> COMMIT; >=20 > Hi Michael, >=20 > I just wanted to reproduce your problem, but: > 1) CREATE TABLE nls_testing ( unicp NVARCHAR(10)); > doesn't work. >=20 > 2) a) INSERT INTO nls_testing (nchar(8365)) > b) INSERT INTO nls_testing (nchar(9305)) > are syntactically wrong. >=20 > Decimal 8365 =3D Hex 20AD and NOT 20AC > Decimal 9305 =3D Hex 2459 and NOT 2456 Okay, i try with cut and copy now, had a crappy console..., the values in my python example are still correct, as are the nls settings that i reported. SQL> create table nls_testing (unicp nvarchar2(10)); Tabelle wurde angelegt. SQL> insert into nls_testing values (nchr(8364)); 1 Zeile wurde erstellt. SQL> insert into nls_testing values (nchr(9302)); 1 Zeile wurde erstellt. SQL> select asciistr(unicp), unicp from nls_testing; ASCIISTR(UNICP) UNICP -------------------------------------------------- -------- \20AC =80 \2456 =BF SQL> commit; Transaktion mit COMMIT abgeschlossen. As told, i get the first one correct, but the second does appear to not reach python intact. Michael --=20 Michael Schlenker Software Engineer CONTACT Software GmbH Tel.: +49 (421) 20153-80 Wiener Stra=DFe 1-3 Fax: +49 (421) 20153-41 28359 Bremen http://www.contact.de/ E-Mail: ms...@co... Sitz der Gesellschaft: Bremen | Gesch=E4ftsf=FChrer: Karl Heinz Zachries Eingetragen im Handelsregister des Amtsgerichts Bremen unter HRB 13215 |
From: matilda m. <ma...@gr...> - 2007-11-27 11:02:30
|
>>>> Michael Schlenker <ms...@co...> 27.11.2007 10:49 >>> >> matilda matilda schrieb: >>>>> Michael Schlenker <ms...@co...> 26.11.2007 18:24 >>> >>> I created a Table with sqlplus with some unicode characters in it: >>> >>> CREATE TABLE nls_testing ( unicp NVARCHAR(10)); >>> INSERT INTO nls_testing (nchar(8365)); -- euro symbol uses 2 utf-8 = bytes >>> INSERT INTO nls_testing (nchar(9305)); -- some cjk idogram uses 3 = utf-8 bytes >>> COMMIT; > >SQL> insert into nls_testing values (nchr(8364)); >SQL> insert into nls_testing values (nchr(9302)); Hi Michael, ok, thanks. You see there where subtle differences between the values you talked about the first and the second time. ;-) I just wanted to get sure that we're talking about the same. Now I can reproduce the problem. Best regards Andreas Mock |
From: matilda m. <ma...@gr...> - 2007-11-28 10:50:27
|
>>> Michael Schlenker <msc=40contact.de> 27.11.2007 10:49 >>> > >SQL> select asciistr(unicp), unicp from nls_testing; > >ASCIISTR(UNICP) UNICP >-------------------------------------------------- -------- >=5C20AC =E2=82=AC >=5C2456 =C2=BF > >SQL> commit; >Transaktion mit COMMIT abgeschlossen. > >As told, i get the first one correct, but the second does >appear to not reach python intact. Hi Michael, just a short information in between. At the moment I can only investigate this problem in spare time. That=27s the reason that it will last some time anyway. What I did: I created a short Embedded SQL C program to get the values from the database. Interestingly I get the same byte sequences as I get them from python. My conclusion: There must be something with the internal character set conversion between DB backend and DB client. I also don=27t know at the moment how much the client environment (Linux for me) influences the results. E.g. I get =22rubbish=22 for both values stored in your test tables. If you have new informations please let us know. Anthony is interested in=20 improving the i18n and character set stuff in cx_Oracle. At the moment he seems a little bit busy. :-) Best regards Andreas Mock |
From: Michael S. <ms...@co...> - 2007-11-28 11:13:17
|
matilda matilda schrieb: >>>> Michael Schlenker <ms...@co...> 27.11.2007 10:49 >>> >> SQL> select asciistr(unicp), unicp from nls_testing; >> >> ASCIISTR(UNICP) UNICP >> -------------------------------------------------- -------- >> \20AC =E2=82=AC >> \2456 =C2=BF >> >> SQL> commit; >> Transaktion mit COMMIT abgeschlossen. >> >> As told, i get the first one correct, but the second does >> appear to not reach python intact. >=20 > Hi Michael, >=20 > just a short information in between. At the moment I can only > investigate > this problem in spare time. That's the reason that it will last some > time > anyway. >=20 > What I did: I created a short Embedded SQL C program to get the values > from the database. Interestingly I get the same byte sequences as I > get > them from python. My conclusion: There must be something with the > internal character set conversion between DB backend and DB client. >=20 > I also don't know at the moment how much the client environment (Linux > for me) > influences the results. E.g. I get "rubbish" for both values stored in > your > test tables. >=20 > If you have new informations please let us know. Anthony is interested > in=20 > improving the i18n and character set stuff in cx_Oracle. At the moment > he > seems a little bit busy. :-) >=20 I created a little test program to see where it starts to break, see end = of the message. Basically i find that many codepoints >=3D 128 (0x80)(aka everything outs= ide 7 Bit ASCII) are broken and don't make it through unharmed, but not all of them. The ones that make it through look suspiciously like CP1252..., which is = the Database Encoding (MSWINxxxx), but not the Database National Encoding. The first block is "0080..00FF; Latin-1 Supplement" from the unicode char= acter database, the second one is the upper part of "Latin Extended-A", and then one char= from "Latin Extended-B", then some from punctuation and some from currency symbols. So i think cx_Oracle somehow mistakenly converts to database encoding som= ewhere in between. Michael Here is the full list of Non ASCII that gets through unbroken: Returned: ASCIISTR( =7F ) UTF8( =7F ) UNICODE( u'\x7f' ) Returned: ASCIISTR( \0081 ) UTF8( =C3=82=C2=81 ) UNICODE( u'\x81' ) Returned: ASCIISTR( \008D ) UTF8( =C3=82=C2=8D ) UNICODE( u'\x8d' ) Returned: ASCIISTR( \008F ) UTF8( =C3=82=C2=8F ) UNICODE( u'\x8f' ) Returned: ASCIISTR( \0090 ) UTF8( =C3=82=C2=90 ) UNICODE( u'\x90' ) Returned: ASCIISTR( \009D ) UTF8( =C3=82=C2=9D ) UNICODE( u'\x9d' ) Returned: ASCIISTR( \00A0 ) UTF8( =C3=82 ) UNICODE( u'\xa0' ) Returned: ASCIISTR( \00A1 ) UTF8( =C3=82=C2=A1 ) UNICODE( u'\xa1' ) Returned: ASCIISTR( \00A2 ) UTF8( =C3=82=C2=A2 ) UNICODE( u'\xa2' ) Returned: ASCIISTR( \00A3 ) UTF8( =C3=82=C2=A3 ) UNICODE( u'\xa3' ) Returned: ASCIISTR( \00A4 ) UTF8( =C3=82=C2=A4 ) UNICODE( u'\xa4' ) Returned: ASCIISTR( \00A5 ) UTF8( =C3=82=C2=A5 ) UNICODE( u'\xa5' ) Returned: ASCIISTR( \00A6 ) UTF8( =C3=82=C2=A6 ) UNICODE( u'\xa6' ) Returned: ASCIISTR( \00A7 ) UTF8( =C3=82=C2=A7 ) UNICODE( u'\xa7' ) Returned: ASCIISTR( \00A8 ) UTF8( =C3=82=C2=A8 ) UNICODE( u'\xa8' ) Returned: ASCIISTR( \00A9 ) UTF8( =C3=82=C2=A9 ) UNICODE( u'\xa9' ) Returned: ASCIISTR( \00AA ) UTF8( =C3=82=C2=AA ) UNICODE( u'\xaa' ) Returned: ASCIISTR( \00AB ) UTF8( =C3=82=C2=AB ) UNICODE( u'\xab' ) Returned: ASCIISTR( \00AC ) UTF8( =C3=82=C2=AC ) UNICODE( u'\xac' ) Returned: ASCIISTR( \00AD ) UTF8( =C3=82=C2=AD ) UNICODE( u'\xad' ) Returned: ASCIISTR( \00AE ) UTF8( =C3=82=C2=AE ) UNICODE( u'\xae' ) Returned: ASCIISTR( \00AF ) UTF8( =C3=82=C2=AF ) UNICODE( u'\xaf' ) Returned: ASCIISTR( \00B0 ) UTF8( =C3=82=C2=B0 ) UNICODE( u'\xb0' ) Returned: ASCIISTR( \00B1 ) UTF8( =C3=82=C2=B1 ) UNICODE( u'\xb1' ) Returned: ASCIISTR( \00B2 ) UTF8( =C3=82=C2=B2 ) UNICODE( u'\xb2' ) Returned: ASCIISTR( \00B3 ) UTF8( =C3=82=C2=B3 ) UNICODE( u'\xb3' ) Returned: ASCIISTR( \00B4 ) UTF8( =C3=82=C2=B4 ) UNICODE( u'\xb4' ) Returned: ASCIISTR( \00B5 ) UTF8( =C3=82=C2=B5 ) UNICODE( u'\xb5' ) Returned: ASCIISTR( \00B6 ) UTF8( =C3=82=C2=B6 ) UNICODE( u'\xb6' ) Returned: ASCIISTR( \00B7 ) UTF8( =C3=82=C2=B7 ) UNICODE( u'\xb7' ) Returned: ASCIISTR( \00B8 ) UTF8( =C3=82=C2=B8 ) UNICODE( u'\xb8' ) Returned: ASCIISTR( \00B9 ) UTF8( =C3=82=C2=B9 ) UNICODE( u'\xb9' ) Returned: ASCIISTR( \00BA ) UTF8( =C3=82=C2=BA ) UNICODE( u'\xba' ) Returned: ASCIISTR( \00BB ) UTF8( =C3=82=C2=BB ) UNICODE( u'\xbb' ) Returned: ASCIISTR( \00BC ) UTF8( =C3=82=C2=BC ) UNICODE( u'\xbc' ) Returned: ASCIISTR( \00BD ) UTF8( =C3=82=C2=BD ) UNICODE( u'\xbd' ) Returned: ASCIISTR( \00BE ) UTF8( =C3=82=C2=BE ) UNICODE( u'\xbe' ) Returned: ASCIISTR( \00BF ) UTF8( =C3=82=C2=BF ) UNICODE( u'\xbf' ) Returned: ASCIISTR( \00C0 ) UTF8( =C3=83=E2=82=AC ) UNICODE( u'\xc0' ) Returned: ASCIISTR( \00C1 ) UTF8( =C3=83=C2=81 ) UNICODE( u'\xc1' ) Returned: ASCIISTR( \00C2 ) UTF8( =C3=83=E2=80=9A ) UNICODE( u'\xc2' ) Returned: ASCIISTR( \00C3 ) UTF8( =C3=83=C6=92 ) UNICODE( u'\xc3' ) Returned: ASCIISTR( \00C4 ) UTF8( =C3=83=E2=80=9E ) UNICODE( u'\xc4' ) Returned: ASCIISTR( \00C5 ) UTF8( =C3=83=E2=80=A6 ) UNICODE( u'\xc5' ) Returned: ASCIISTR( \00C6 ) UTF8( =C3=83=E2=80=A0 ) UNICODE( u'\xc6' ) Returned: ASCIISTR( \00C7 ) UTF8( =C3=83=E2=80=A1 ) UNICODE( u'\xc7' ) Returned: ASCIISTR( \00C8 ) UTF8( =C3=83=CB=86 ) UNICODE( u'\xc8' ) Returned: ASCIISTR( \00C9 ) UTF8( =C3=83=E2=80=B0 ) UNICODE( u'\xc9' ) Returned: ASCIISTR( \00CA ) UTF8( =C3=83=C5=A0 ) UNICODE( u'\xca' ) Returned: ASCIISTR( \00CB ) UTF8( =C3=83=E2=80=B9 ) UNICODE( u'\xcb' ) Returned: ASCIISTR( \00CC ) UTF8( =C3=83=C5=92 ) UNICODE( u'\xcc' ) Returned: ASCIISTR( \00CD ) UTF8( =C3=83=C2=8D ) UNICODE( u'\xcd' ) Returned: ASCIISTR( \00CE ) UTF8( =C3=83=C5=BD ) UNICODE( u'\xce' ) Returned: ASCIISTR( \00CF ) UTF8( =C3=83=C2=8F ) UNICODE( u'\xcf' ) Returned: ASCIISTR( \00D0 ) UTF8( =C3=83=C2=90 ) UNICODE( u'\xd0' ) Returned: ASCIISTR( \00D1 ) UTF8( =C3=83=E2=80=98 ) UNICODE( u'\xd1' ) Returned: ASCIISTR( \00D2 ) UTF8( =C3=83=E2=80=99 ) UNICODE( u'\xd2' ) Returned: ASCIISTR( \00D3 ) UTF8( =C3=83=E2=80=9C ) UNICODE( u'\xd3' ) Returned: ASCIISTR( \00D4 ) UTF8( =C3=83=E2=80=9D ) UNICODE( u'\xd4' ) Returned: ASCIISTR( \00D5 ) UTF8( =C3=83=E2=80=A2 ) UNICODE( u'\xd5' ) Returned: ASCIISTR( \00D6 ) UTF8( =C3=83=E2=80=93 ) UNICODE( u'\xd6' ) Returned: ASCIISTR( \00D7 ) UTF8( =C3=83=E2=80=94 ) UNICODE( u'\xd7' ) Returned: ASCIISTR( \00D8 ) UTF8( =C3=83=CB=9C ) UNICODE( u'\xd8' ) Returned: ASCIISTR( \00D9 ) UTF8( =C3=83=E2=84=A2 ) UNICODE( u'\xd9' ) Returned: ASCIISTR( \00DA ) UTF8( =C3=83=C5=A1 ) UNICODE( u'\xda' ) Returned: ASCIISTR( \00DB ) UTF8( =C3=83=E2=80=BA ) UNICODE( u'\xdb' ) Returned: ASCIISTR( \00DC ) UTF8( =C3=83=C5=93 ) UNICODE( u'\xdc' ) Returned: ASCIISTR( \00DD ) UTF8( =C3=83=C2=9D ) UNICODE( u'\xdd' ) Returned: ASCIISTR( \00DE ) UTF8( =C3=83=C5=BE ) UNICODE( u'\xde' ) Returned: ASCIISTR( \00DF ) UTF8( =C3=83=C5=B8 ) UNICODE( u'\xdf' ) Returned: ASCIISTR( \00E0 ) UTF8( =C3=83 ) UNICODE( u'\xe0' ) Returned: ASCIISTR( \00E1 ) UTF8( =C3=83=C2=A1 ) UNICODE( u'\xe1' ) Returned: ASCIISTR( \00E2 ) UTF8( =C3=83=C2=A2 ) UNICODE( u'\xe2' ) Returned: ASCIISTR( \00E3 ) UTF8( =C3=83=C2=A3 ) UNICODE( u'\xe3' ) Returned: ASCIISTR( \00E4 ) UTF8( =C3=83=C2=A4 ) UNICODE( u'\xe4' ) Returned: ASCIISTR( \00E5 ) UTF8( =C3=83=C2=A5 ) UNICODE( u'\xe5' ) Returned: ASCIISTR( \00E6 ) UTF8( =C3=83=C2=A6 ) UNICODE( u'\xe6' ) Returned: ASCIISTR( \00E7 ) UTF8( =C3=83=C2=A7 ) UNICODE( u'\xe7' ) Returned: ASCIISTR( \00E8 ) UTF8( =C3=83=C2=A8 ) UNICODE( u'\xe8' ) Returned: ASCIISTR( \00E9 ) UTF8( =C3=83=C2=A9 ) UNICODE( u'\xe9' ) Returned: ASCIISTR( \00EA ) UTF8( =C3=83=C2=AA ) UNICODE( u'\xea' ) Returned: ASCIISTR( \00EB ) UTF8( =C3=83=C2=AB ) UNICODE( u'\xeb' ) Returned: ASCIISTR( \00EC ) UTF8( =C3=83=C2=AC ) UNICODE( u'\xec' ) Returned: ASCIISTR( \00ED ) UTF8( =C3=83=C2=AD ) UNICODE( u'\xed' ) Returned: ASCIISTR( \00EE ) UTF8( =C3=83=C2=AE ) UNICODE( u'\xee' ) Returned: ASCIISTR( \00EF ) UTF8( =C3=83=C2=AF ) UNICODE( u'\xef' ) Returned: ASCIISTR( \00F0 ) UTF8( =C3=83=C2=B0 ) UNICODE( u'\xf0' ) Returned: ASCIISTR( \00F1 ) UTF8( =C3=83=C2=B1 ) UNICODE( u'\xf1' ) Returned: ASCIISTR( \00F2 ) UTF8( =C3=83=C2=B2 ) UNICODE( u'\xf2' ) Returned: ASCIISTR( \00F3 ) UTF8( =C3=83=C2=B3 ) UNICODE( u'\xf3' ) Returned: ASCIISTR( \00F4 ) UTF8( =C3=83=C2=B4 ) UNICODE( u'\xf4' ) Returned: ASCIISTR( \00F5 ) UTF8( =C3=83=C2=B5 ) UNICODE( u'\xf5' ) Returned: ASCIISTR( \00F6 ) UTF8( =C3=83=C2=B6 ) UNICODE( u'\xf6' ) Returned: ASCIISTR( \00F7 ) UTF8( =C3=83=C2=B7 ) UNICODE( u'\xf7' ) Returned: ASCIISTR( \00F8 ) UTF8( =C3=83=C2=B8 ) UNICODE( u'\xf8' ) Returned: ASCIISTR( \00F9 ) UTF8( =C3=83=C2=B9 ) UNICODE( u'\xf9' ) Returned: ASCIISTR( \00FA ) UTF8( =C3=83=C2=BA ) UNICODE( u'\xfa' ) Returned: ASCIISTR( \00FB ) UTF8( =C3=83=C2=BB ) UNICODE( u'\xfb' ) Returned: ASCIISTR( \00FC ) UTF8( =C3=83=C2=BC ) UNICODE( u'\xfc' ) Returned: ASCIISTR( \00FD ) UTF8( =C3=83=C2=BD ) UNICODE( u'\xfd' ) Returned: ASCIISTR( \00FE ) UTF8( =C3=83=C2=BE ) UNICODE( u'\xfe' ) Returned: ASCIISTR( \00FF ) UTF8( =C3=83=C2=BF ) UNICODE( u'\xff' ) Returned: ASCIISTR( \0152 ) UTF8( =C3=85=E2=80=99 ) UNICODE( u'\u0152' ) Returned: ASCIISTR( \0153 ) UTF8( =C3=85=E2=80=9C ) UNICODE( u'\u0153' ) Returned: ASCIISTR( \0160 ) UTF8( =C3=85 ) UNICODE( u'\u0160' ) Returned: ASCIISTR( \0161 ) UTF8( =C3=85=C2=A1 ) UNICODE( u'\u0161' ) Returned: ASCIISTR( \0178 ) UTF8( =C3=85=C2=B8 ) UNICODE( u'\u0178' ) Returned: ASCIISTR( \017D ) UTF8( =C3=85=C2=BD ) UNICODE( u'\u017d' ) Returned: ASCIISTR( \017E ) UTF8( =C3=85=C2=BE ) UNICODE( u'\u017e' ) Returned: ASCIISTR( \0192 ) UTF8( =C3=86=E2=80=99 ) UNICODE( u'\u0192' ) Returned: ASCIISTR( \02C6 ) UTF8( =C3=8B=E2=80=A0 ) UNICODE( u'\u02c6' ) Returned: ASCIISTR( \02DC ) UTF8( =C3=8B=C5=93 ) UNICODE( u'\u02dc' ) Returned: ASCIISTR( \2013 ) UTF8( =C3=A2=E2=82=AC=E2=80=9C ) UNICODE( u'\= u2013' ) Returned: ASCIISTR( \2014 ) UTF8( =C3=A2=E2=82=AC=E2=80=9D ) UNICODE( u'\= u2014' ) Returned: ASCIISTR( \2018 ) UTF8( =C3=A2=E2=82=AC=CB=9C ) UNICODE( u'\u20= 18' ) Returned: ASCIISTR( \2019 ) UTF8( =C3=A2=E2=82=AC=E2=84=A2 ) UNICODE( u'\= u2019' ) Returned: ASCIISTR( \201A ) UTF8( =C3=A2=E2=82=AC=C5=A1 ) UNICODE( u'\u20= 1a' ) Returned: ASCIISTR( \201C ) UTF8( =C3=A2=E2=82=AC=C5=93 ) UNICODE( u'\u20= 1c' ) Returned: ASCIISTR( \201D ) UTF8( =C3=A2=E2=82=AC=C2=9D ) UNICODE( u'\u20= 1d' ) Returned: ASCIISTR( \201E ) UTF8( =C3=A2=E2=82=AC=C5=BE ) UNICODE( u'\u20= 1e' ) Returned: ASCIISTR( \2020 ) UTF8( =C3=A2=E2=82=AC ) UNICODE( u'\u2020' )= Returned: ASCIISTR( \2021 ) UTF8( =C3=A2=E2=82=AC=C2=A1 ) UNICODE( u'\u20= 21' ) Returned: ASCIISTR( \2022 ) UTF8( =C3=A2=E2=82=AC=C2=A2 ) UNICODE( u'\u20= 22' ) Returned: ASCIISTR( \2026 ) UTF8( =C3=A2=E2=82=AC=C2=A6 ) UNICODE( u'\u20= 26' ) Returned: ASCIISTR( \2030 ) UTF8( =C3=A2=E2=82=AC=C2=B0 ) UNICODE( u'\u20= 30' ) Returned: ASCIISTR( \2039 ) UTF8( =C3=A2=E2=82=AC=C2=B9 ) UNICODE( u'\u20= 39' ) Returned: ASCIISTR( \203A ) UTF8( =C3=A2=E2=82=AC=C2=BA ) UNICODE( u'\u20= 3a' ) Returned: ASCIISTR( \20AC ) UTF8( =C3=A2=E2=80=9A=C2=AC ) UNICODE( u'\u20= ac' ) Returned: ASCIISTR( \2122 ) UTF8( =C3=A2=E2=80=9E=C2=A2 ) UNICODE( u'\u21= 22' ) import os # path to oci.dll os.environ['PATH'] =3D 'c:\\instantclient10_1' # path to tnsnames.ora os.environ['TNS_ADMIN'] =3D 'U:\\' os.environ['NLS_LANG'] =3D 'GERMAN_GERMANY.AL32UTF8' import cx_Oracle conn =3D cx_Oracle.connect("scott/tiger@DEV") # iterate over the Unicode BMP cur =3D conn.cursor() stmt =3D "SELECT ASCIISTR(nchr(:codepoint)),nchr(:codepoint) FROM DUAL" cur.prepare(stmt) fail =3D 0 # loop from 0 to the start of the surrogate pairs for i in xrange(55295): c =3D unichr(i) cutf8 =3D c.encode('utf-8') cur.execute(None,{'codepoint':i}) asciiret, utfret=3D cur.fetchone() cret =3D utfret.decode('utf-8') if cret !=3D c: print "Difference for codepoint: %x" % i print "Returned: ASCIISTR(",asciiret,")\tUTF8(",utfret,")\tUNICO= DE(",repr(cret),")" print "Expected: \t\t\tUTF8(",cutf8,")\tUNICODE(",repr(c),")" fail +=3D 1 if fail > 50: break |
From: matilda m. <ma...@gr...> - 2007-11-28 12:07:02
|
>>> Michael Schlenker <ms...@co...> 28.11.2007 12:13 >>> > So i think cx_Oracle somehow mistakenly converts to database encoding = somewhere in between. Hi Michael, I'm not sure if this is a problem of cx_Oracle. That's the reason I wrote = this little Embedded C Program for that. On the C level I only get byte sequences = anyway. So, I'm sure that nothing but the Oracle client library is in between (hopefully = ;-)). I don't know if we have a change to verify this by using another access = method. Best regards Andreas Mock |
From: Michael S. <ms...@co...> - 2007-11-28 12:32:52
|
matilda matilda schrieb: >>>> Michael Schlenker <ms...@co...> 28.11.2007 12:13 >>> >> So i think cx_Oracle somehow mistakenly converts to database encoding = somewhere in between. >=20 > Hi Michael, >=20 > I'm not sure if this is a problem of cx_Oracle. That's the reason I wro= te this little > Embedded C Program for that. On the C level I only get byte sequences a= nyway. So, I'm > sure that nothing but the Oracle client library is in between (hopefull= y ;-)). >=20 > I don't know if we have a change to verify this by using another access= method. I tried to do a similar thing with the Tcl Oratcl package, but it failed = in a similar way (not the same way, it produced different garbage), so i don't know if= its OCI thats just damn hard to get right, or if it is some environment setup thats bro= ken. Sure, only Oracle OCI in between but from my cursory reading of the OCI d= ocs, it seems you need to do at least some special handling for NCHAR values w= hen binding, but OCI docs are huge... Not sure if cx_Oracle does them yet, hoped to escape this mess by using c= x_Oracle in the first place (to replace an ancient non unicode savy ProC/C++ based= Oracle connector, which could only be converted to unicode with a complete rewrite). It might all work fine for database encoding =3D utf-8, but for people st= uck with older codebases and databases that need slow migration of only some columns the= split brained Non-Unicode Database Charset and Unicode National Charset are a must. I hope i do not have to ditch cx_Oracle again and try to find happiness d= own an ODBC or JDBC route (where none is to be found). Michael --=20 Michael Schlenker Software Engineer CONTACT Software GmbH Tel.: +49 (421) 20153-80 Wiener Stra=DFe 1-3 Fax: +49 (421) 20153-41 28359 Bremen http://www.contact.de/ E-Mail: ms...@co... Sitz der Gesellschaft: Bremen | Gesch=E4ftsf=FChrer: Karl Heinz Zachries Eingetragen im Handelsregister des Amtsgerichts Bremen unter HRB 13215 |
From: Michael S. <ms...@co...> - 2007-11-28 16:46:28
|
Amaury Forgeot d'Arc schrieb: > matilda matilda wrote: >>>>> Michael Schlenker <ms...@co...> 28.11.2007 16:31 >>> >>> Michael Schlenker schrieb: >>> Okay, i got my test to work after patching cx_Oracle a little bit. >> Anthony will be happy to hear that. ;-) Anthony: Are you still here? >> >>> From taking a closer look at the code Unicode support is at best to be described as >>> 'rudimentary', lots of fine points still missing in there. >> I'm sure Anthony will agree. Especially with the upcoming Py3000 there will >> be many questions to answer regarding byte-strams, unicode-streams, characterset >> conversion (implicit/explicit), character representation. >> >> See the change history to see when Anthony started to focus on character set >> conversion. >> >> Amaury Forgeot d'Arc who also gives valueable input is probably also interested >> in that topic while speeking and writing a language with many special characters. > > I indeed proposed a patch one year ago, to support unicode. > It was against version 4.2.1, I join it again in the hope it can be useful. Looks good, the minimal stuff i did goes a similar way but I didn't use UTF16 yet, so there might be buffers with problems due to UTF-8 variable length... I'll try to use your stuff with a recent cx_Oracle if i find the time. And yes, it will break with UCS-4 builds of Python..., easy to fix though, if one uses AL32UTF-8 instead of the UTF16 code and converts on read. Makes the code immune against possible BigEndian vs LittleEndian problems too (although i assume those are handled by OCI for UTF-16 anyway.) But surrogates and the astral plane is a treacherous ground anyway, so if BMP works for a start its nice. Michael |
From: Amaury F. d'A. <ama...@gm...> - 2007-11-28 17:08:40
|
Michael Schlenker wrote: > Amaury Forgeot d'Arc schrieb: > > matilda matilda wrote: > >>>>> Michael Schlenker <ms...@co...> 28.11.2007 16:31 >>> > >>> Michael Schlenker schrieb: > >>> Okay, i got my test to work after patching cx_Oracle a little bit. > >> Anthony will be happy to hear that. ;-) Anthony: Are you still here? > >> > >>> From taking a closer look at the code Unicode support is at best to be described as > >>> 'rudimentary', lots of fine points still missing in there. > >> I'm sure Anthony will agree. Especially with the upcoming Py3000 there will > >> be many questions to answer regarding byte-strams, unicode-streams, characterset > >> conversion (implicit/explicit), character representation. > >> > >> See the change history to see when Anthony started to focus on character set > >> conversion. > >> > >> Amaury Forgeot d'Arc who also gives valueable input is probably also interested > >> in that topic while speeking and writing a language with many special characters. > > > > I indeed proposed a patch one year ago, to support unicode. > > It was against version 4.2.1, I join it again in the hope it can be useful. > > Looks good, the minimal stuff i did goes a similar way but I didn't use UTF16 yet, > so there might be buffers with problems due to UTF-8 variable length... > > I'll try to use your stuff with a recent cx_Oracle if i find the time. > > And yes, it will break with UCS-4 builds of Python..., easy to fix though, if > one uses AL32UTF-8 instead of the UTF16 code and converts on read. Makes the code > immune against possible BigEndian vs LittleEndian problems too (although i assume > those are handled by OCI for UTF-16 anyway.) But surrogates and the astral plane > is a treacherous ground anyway, so if BMP works for a start its nice. I'm not sure to understand everything here, but it seemed to me that the correct way was to set the charsetId to OCI_UTF16ID, because it is completely independent of any NLS settings (there is no other possible value, btw). The values are expressed in a unicode-capable encoding, and this is enough. To comply to UCS-4 builds, it should be enough to properly use functions like PyUnicode_DecodeUTF16 and PyUnicode_EncodeUTF16 in StringVar_SetValue, instead of the plain memcpy. Sorry I don't have the time at the moment, but I am sure you can do something with it. -- Amaury Forgeot d'Arc |
From: Michael S. <ms...@co...> - 2007-11-28 17:20:54
|
Amaury Forgeot d'Arc schrieb: > Michael Schlenker wrote: >> Amaury Forgeot d'Arc schrieb: >>> matilda matilda wrote: >>>>>>> Michael Schlenker <ms...@co...> 28.11.2007 16:31 >>> >>>>> Michael Schlenker schrieb: >>>>> Okay, i got my test to work after patching cx_Oracle a little bit. >>>> Anthony will be happy to hear that. ;-) Anthony: Are you still here= ? >>>> >>>>> From taking a closer look at the code Unicode support is at best to= be described as >>>>> 'rudimentary', lots of fine points still missing in there. >>>> I'm sure Anthony will agree. Especially with the upcoming Py3000 the= re will >>>> be many questions to answer regarding byte-strams, unicode-streams, = characterset >>>> conversion (implicit/explicit), character representation. >>>> >>>> See the change history to see when Anthony started to focus on chara= cter set >>>> conversion. >>>> >>>> Amaury Forgeot d'Arc who also gives valueable input is probably also= interested >>>> in that topic while speeking and writing a language with many specia= l characters. >>> I indeed proposed a patch one year ago, to support unicode. >>> It was against version 4.2.1, I join it again in the hope it can be u= seful. >> Looks good, the minimal stuff i did goes a similar way but I didn't us= e UTF16 yet, >> so there might be buffers with problems due to UTF-8 variable length..= =2E >> >> I'll try to use your stuff with a recent cx_Oracle if i find the time.= >> >> And yes, it will break with UCS-4 builds of Python..., easy to fix tho= ugh, if >> one uses AL32UTF-8 instead of the UTF16 code and converts on read. Mak= es the code >> immune against possible BigEndian vs LittleEndian problems too (althou= gh i assume >> those are handled by OCI for UTF-16 anyway.) But surrogates and the as= tral plane >> is a treacherous ground anyway, so if BMP works for a start its nice. >=20 > I'm not sure to understand everything here, but it seemed to me that > the correct way was to set the charsetId to OCI_UTF16ID, because it is > completely independent of any NLS settings (there is no other possible > value, btw).=20 I don't think so. The comments at least tell me that i can feed in all valid oracle charsets (probably after converting string to id via ub2 OCINlsCharSetNameToId(dvoid *envhp, const oratext *name); ) and that OCI_UTF16ID is just the only one that cannot be specified for NLS_LANG. So you can choose the one you want and let OCI deal with the conversion, OCI_UTF16ID is just the one used in the OCI docs because its very convenient when working with windows wchar_t. So its probably the recommended value for charsetId but not the only possible one. >The values are expressed in a unicode-capable encoding, > and this is enough. Yes. >=20 > To comply to UCS-4 builds, it should be enough to properly use > functions like PyUnicode_DecodeUTF16 and PyUnicode_EncodeUTF16 in > StringVar_SetValue, instead of the plain memcpy. Yes. > Sorry I don't have the time at the moment, but I am sure you can do > something with it. If i find the time i will yes. Thanks again for the patch. Michael --=20 Michael Schlenker Software Engineer CONTACT Software GmbH Tel.: +49 (421) 20153-80 Wiener Stra=DFe 1-3 Fax: +49 (421) 20153-41 28359 Bremen http://www.contact.de/ E-Mail: ms...@co... Sitz der Gesellschaft: Bremen | Gesch=E4ftsf=FChrer: Karl Heinz Zachries Eingetragen im Handelsregister des Amtsgerichts Bremen unter HRB 13215 |
From: Anthony T. <ant...@gm...> - 2007-11-29 18:37:21
|
On Nov 28, 2007 9:28 AM, Michael Schlenker <ms...@co...> wrote: > matilda matilda schrieb: > >>>> Michael Schlenker <ms...@co...> 28.11.2007 16:31 >>> > > > >> From taking a closer look at the code Unicode support is at best to be described as > >> 'rudimentary', lots of fine points still missing in there. > > > > I'm sure Anthony will agree. Especially with the upcoming Py3000 there will > > be many questions to answer regarding byte-strams, unicode-streams, characterset > > conversion (implicit/explicit), character representation. > > I thought Py3k should solve those questions ;-)... > > I usually write Tcl where unicode just works and you don't step into the deep > morass that Python string/uncode dichotomy is all the time. Basically it uses > a model very close to the one now used by Py3k, but got it stable some years ago > (around Tcl 8.2). > > For cx_Oracle i would like to see maximum ease of use, e.g. expunge the evil > NLS_LANG dependency as it should never ever be needed if your using Python ( > case 1: non unicode db encoding -> let cx_Oracle convert to python system encoding > internally or to unicode, and get happy, case 2: unicode db encoding -> let cx_Oracle > just eat Python Unicode strings and spit them out). Not everyone is going to want to switch wholesale to unicode immediately -- the amount of pain that would cause is considerable. I would agree that long term, that is the way to go, but some intermediate steps should probably be taken. I am thinking that it might work to do the following: Add a connection constructor parameter that specified that __all__ strings should be returned as unicode objects, not string objects. This parameter would default to False to get the current behavior -- string objects returned (fixing whatever is causing the problem right now). It would likely be reasonable to return unicode for nvarchar2 data regardless of this setting but I'd appreciate some feedback on that. I've also considered some setting that would allow you to specify unicode for certain columns but perhaps it would be better to go all unicode or the hybrid I suggested above, and not confuse matters by adding that capability. Thoughts? > There are some pitfalls for literal SQL in the path, but i think you can write > the interface in a way that you don't have to worry about all this stupid encoding > conversion stuff forced upon us by Oracle, Python and other forces... What are you referring to about literal SQL in the path? > >> With that patch (against the cx_Oracle-4.3.3.zip file) at least my test runs through > >> cleanly when i set the right environment. One surely can do better.... > > > > Can you easily enhance the tests in the test directory to unit test the > > charset conversion cases? > > Haven't looked there yet. But maybe i can rip something from my nosetest files when done. It doesn't have to be pretty, just work. I can pretty it up as long as I know what you are trying to accomplish or can ask you questions, of course. :-) > Basically i need to first get permission to invest more time in these issues with cx_Oracle, > it could happen that my manager decides to go a different route... :-(. Understood. Let us know. Anthony |
From: Jim P. <uni...@gm...> - 2007-11-30 01:17:30
|
On Nov 29, 2007 1:37 PM, Anthony Tuininga <ant...@gm...> wrote: > Not everyone is going to want to switch wholesale to unicode > immediately -- the amount of pain that would cause is considerable. I > would agree that long term, that is the way to go, but some > intermediate steps should probably be taken. I am thinking that it > might work to do the following: > > Add a connection constructor parameter that specified that __all__ > strings should be returned as unicode objects, not string objects. > This parameter would default to False to get the current behavior -- > string objects returned (fixing whatever is causing the problem right > I think that would be a fine solution. > now). It would likely be reasonable to return unicode for nvarchar2 > data regardless of this setting but I'd appreciate some feedback on > that. > I would be somewhat concerned that existing code might break if we change the return type without user intervention. But that said, I don't actually have any existing code that interfaces with NCHAR types. > I've also considered some setting that would allow you to specify > unicode for certain columns but perhaps it would be better to go all > unicode or the hybrid I suggested above, and not confuse matters by > adding that capability. Thoughts? > I would not need that. In my case, I'm either doing 100 Unicode, or 100 ASCII. All my legacy systems are ASCII only, and eveything new is Unicode all the way. Jim Patterson |
From: Michael S. <ms...@co...> - 2007-12-05 17:34:24
|
Anthony Tuininga schrieb: > Not everyone is going to want to switch wholesale to unicode > immediately -- the amount of pain that would cause is considerable. I > would agree that long term, that is the way to go, but some > intermediate steps should probably be taken. I am thinking that it > might work to do the following: > > Add a connection constructor parameter that specified that __all__ > strings should be returned as unicode objects, not string objects. > This parameter would default to False to get the current behavior -- > string objects returned (fixing whatever is causing the problem right > now). It would likely be reasonable to return unicode for nvarchar2 > data regardless of this setting but I'd appreciate some feedback on > that. That would be a fine thing. Always returning unicode for nvarchar2 columns is just fine for oracle >= 9i, but not always for older ones, where nvarchar could be some other encoding than one of the unicode encodings. > I've also considered some setting that would allow you to specify > unicode for certain columns but perhaps it would be better to go all > unicode or the hybrid I suggested above, and not confuse matters by > adding that capability. Thoughts? If you return unicode for all nvarchar2 columns by default the user can always force it by using, or using nvarchar2 as column type. SELECT TO_NCHAR(...) FROM ... > >> There are some pitfalls for literal SQL in the path, but i think you can write >> the interface in a way that you don't have to worry about all this stupid encoding >> conversion stuff forced upon us by Oracle, Python and other forces... > > What are you referring to about literal SQL in the path? Oracle has the nasty habit to trash literal unicode values in SQL when the database encoding is non-unicode, bind variables are fine, but literal unicode values in SQL statements is not safe in general, because they get converted to the database encoding while passing through the server side of OCI. See http://download.oracle.com/docs/cd/B19306_01/server.102/b14225/ch7progrunicode.htm#i1006315 So maybe the autoquoting mentioned there is a good idea. Michael |
From: Anthony T. <ant...@gm...> - 2007-11-29 18:48:19
|
Yes, I just looked through my e-mail and noticed this one. Some of the patch has already been applied but the addition of a new type that returns and accepts unicode I have not yet done. Recently I have been looking into this again (in conjunction with the ceODBC module) and ran into a few difficulties since most Unix builds are wide unicode and not narrow. Windows, of course, is different in this respect. Your code assumes a narrow build so I'd have to come up with a solution for that. I believe somewhere in this thread a solution to that problem has been given but I'll have to see if it really solves the problem or not. My apologies for dropping this one for so long. You should be more persistent and send it to me every few months. :-) If you don't, unless the patch is simple and obvious or I actually need it, it tends to get lost. And since Unicode is tough, I've also tended to avoid it somewhat. :-( Hopefully I'll be able to get some time to work on this in the next few weeks. Bug me again in January if you haven't heard anything about it before then. On Nov 28, 2007 9:32 AM, Amaury Forgeot d'Arc <ama...@gm...> wrote: > matilda matilda wrote: > > >>> Michael Schlenker <ms...@co...> 28.11.2007 16:31 >>> > > >Michael Schlenker schrieb: > > > Okay, i got my test to work after patching cx_Oracle a little bit. > > > > Anthony will be happy to hear that. ;-) Anthony: Are you still here? > > > > > From taking a closer look at the code Unicode support is at best to be described as > > >'rudimentary', lots of fine points still missing in there. > > > > I'm sure Anthony will agree. Especially with the upcoming Py3000 there will > > be many questions to answer regarding byte-strams, unicode-streams, characterset > > conversion (implicit/explicit), character representation. > > > > See the change history to see when Anthony started to focus on character set > > conversion. > > > > Amaury Forgeot d'Arc who also gives valueable input is probably also interested > > in that topic while speeking and writing a language with many special characters. > > I indeed proposed a patch one year ago, to support unicode. > It was against version 4.2.1, I join it again in the hope it can be useful. > > Here are the comments I sent at the time: > > The visible changes are: > - a new cx_Oracle.UNICODE variable type, which accepts and return > unicode values. > - NVARCHAR columns will return this variable type. > - a new test file: UnicodeVar.py covers the most common cases. > > There are some caveats: > - I based my work on the new vt_NationalCharString type (and its > charsetForm member). As there are no unit tests for it, I may have > broken something there. Same for the new NCLOB type. > - Automatic conversion between string and unicode could be supported > (should be easy to add, by using connection.encoding) > - I suspect it will only work if Python is compiled in UCS2 mode. > (== sizeof(PyUNICODE) must be 2) > > -- > Amaury Forgeot d'Arc > > ------------------------------------------------------------------------- > SF.Net email is sponsored by: The Future of Linux Business White Paper > from Novell. From the desktop to the data center, Linux is going > mainstream. Let it simplify your IT future. > http://altfarm.mediaplex.com/ad/ck/8857-50307-18918-4 > _______________________________________________ > cx-oracle-users mailing list > cx-...@li... > https://lists.sourceforge.net/lists/listinfo/cx-oracle-users > > |
From: Jim P. <uni...@gm...> - 2007-11-30 01:25:55
|
On Nov 29, 2007 1:48 PM, Anthony Tuininga <ant...@gm...> wrote: > Recently I have been > looking into this again (in conjunction with the ceODBC module) and > ran into a few difficulties since most Unix builds are wide unicode > and not narrow. Windows, of course, is different in this respect. Your > code assumes a narrow build so I'd have to come up with a solution for > that. I believe somewhere in this thread a solution to that problem > has been given but I'll have to see if it really solves the problem or > not. Using AL23UTF8 as the encoding for OCI prevents this issue. You still have to do all the other stuff (like handling buffer requirements expansions, conversions, ...) but the code will work on all OSes and machines. The only downside to UTF8 might be performance. We would be taking the Unicode in the internal Python format and converting it to UTF-8 then passing it to OCI who might need to convert it to something else for storage inside Oracle. However, the default for Oracle 10G for a Unicode database is AL32UTF8, so I'm guessing that will not be a problem for most people going forward. It looks like Oracle is choosing UTF-8 as the top among equals since some features like Unicode for XML in Oracle 10G requires that the database be UTF-8. Jim P. |
From: Michael S. <ms...@co...> - 2007-11-30 11:31:45
|
Anthony Tuininga schrieb: > Yes, I just looked through my e-mail and noticed this one. Some of the > patch has already been applied but the addition of a new type that > returns and accepts unicode I have not yet done. Recently I have been > looking into this again (in conjunction with the ceODBC module) and > ran into a few difficulties since most Unix builds are wide unicode > and not narrow. Windows, of course, is different in this respect. Your > code assumes a narrow build so I'd have to come up with a solution for > that. I believe somewhere in this thread a solution to that problem > has been given but I'll have to see if it really solves the problem or > not. My apologies for dropping this one for so long. You should be > more persistent and send it to me every few months. :-) If you don't, > unless the patch is simple and obvious or I actually need it, it tends > to get lost. And since Unicode is tough, I've also tended to avoid it > somewhat. :-( Hopefully I'll be able to get some time to work on this > in the next few weeks. Bug me again in January if you haven't heard > anything about it before then. >=20 Amaury's patch is better than mine, so forget my patch... But its still not complete. Hope i can find time to test and fix the issues, and need to do it on a bunch of boxes (HPUX, AIX, Windows, Solaris, Linux etc., so= I might have to fix that wide vs. narrow issue anyway). Using OCINlsCreateEnv() might be better for Python uses, because you can then handle encodings much more graceful and isolate yourself from broken= NLS_LANG settings on the client side. But it surely depends on how twiste= d the usual python usage is, and from what i have seen up to now there is q= uite some insane messing with encodings in standard strings in typical multi-e= ncoding code, including the popular but dangerous changing of the sys.getdefaultencodin= g() to something nicer than ASCII. AFAIK Unicode with Oracle is rather easy if you have a database encoding of Unicode (AL32UTF8), because then all conversions done by OCI are essen= tially lossless. BUT if you have a database encoding like CP1252 or ISO8859-1 th= e full nastyness of Oracle hits you, as all statements are converted via the dat= abase encoding (which is sometimes a subset of the national encoding which is u= nicode nowadays...), so all statements and results are stripped from unicode chars while passi= ng through OCI unless specially protected or quoted. For new databases choosing anything but a unicode database encoding is mo= stly foolish, because life gets hard, but for legacy databases you often don't have a c= hoice (unless your legacy is 7-bit ascii, where you can migrate to UTF-8 easily because= its a strict superset of ASCII.). Michael --=20 Michael Schlenker Software Engineer CONTACT Software GmbH Tel.: +49 (421) 20153-80 Wiener Stra=DFe 1-3 Fax: +49 (421) 20153-41 28359 Bremen http://www.contact.de/ E-Mail: ms...@co... Sitz der Gesellschaft: Bremen | Gesch=E4ftsf=FChrer: Karl Heinz Zachries Eingetragen im Handelsregister des Amtsgerichts Bremen unter HRB 13215 |