Re: [cx-oracle-users] Problem with internal buffer sizes
Brought to you by:
atuining
From: matilda m. <ma...@gr...> - 2007-01-24 14:22:48
|
Hi Anthony, in my opinion the code of cx_Oracle is not realy correct when dealing with a client side encoding which can result in more than one byte per character.=20 I invested some hours to investigat your code and Oracle OCI documentation to get closer to the problem. Now I have some hints and I hope it will help you to extend the code. 1) Szenario to reproduce the error: Database has to be in a character set, which can generate two byte encodings in UTF-8. In most cases LATIN1 (aka WE8ISO8859P1) is ok.=20 Check:=20 select value from nls_database_parameters where parameter =3D 'NLS_CHARACTE= RSET'; 2) Create the following table: create table tst ( data varchar2(2) ) insert into tst values ('1=C3=84'); Thsi should fit in the table, because '=C3=84' can be represented as one byte in Latin1. 3) Take this simple python code: -------------8<-------------------------- #!/usr/bin/python # -*- coding: utf8 -*- import sys import cx_Oracle as dbi if __name__ =3D=3D '__main__': con =3D dbi.connect('scott/tiger@twws') cur =3D con.cursor() cur.execute('select data from tst') row =3D cur.fetchone() print row cur.close() con.close() -------------8<-------------------------- 4) Now the tests: export NLS_LANG=3DGERMAN_GERMANY.WE8ISO8859P1 ./tst.py Output: ('1\xc4',) Everything is fine, because on the client side a buffer of length 2 is allocated and the OCI-define is limited to 2. --- export NLS_LANG=3DGERMAN_GERMANY.UTF8 ./tst.py Output:=20 Traceback (most recent call last): File "./tst.py", line 10, in ? row =3D cur.fetchone() cx_Oracle.DatabaseError: column at array pos 0 fetched with error: 1406 Reason (as far as I can see): 2 characters =3D 2 bytes on the server side are expanded to 2 characters =3D 3 bytes on client side. OCI is telling = that the length (db side) of that variable is 2, so 2 bytes are allocated. Fetch is done in a 2 byte array, that's too small =3D> Error message. (comment: If I use varchar2(1), it seems to work because you add some bytes in any case to the buffer length.) 5) I have seen, that you made the assumption, that strings (not lobs) have a maximum length of 4000. StringVar.c:#define MAX_STRING_LENGTH 4000 As far as I understood the Oracle documentation, this is true in terms of byte representation on the db side. This is not true on the client = side. 6) I changed your code to get a simple trace facility. If you like that = code contact me. I found the following paragraphof code in Environment.c: --------------8<--------------------- #ifdef OCI_NLS_CHARSET_MAXBYTESZ status =3D OCINlsNumericInfoGet(environment->handle, environment->errorHandle, &environment->maxBytesPerCharacter, OCI_NLS_CHARSET_MAXBYTESZ); if (Environment_CheckForError(environment, status, "Environment_New(): get max bytes per character") < 0) { Py_DECREF(environment); return NULL; } // acquire whether character set is fixed width status =3D OCINlsNumericInfoGet(environment->handle, environment->errorHandle, &environment->fixedWidth, OCI_NLS_CHARSET_FIXEDWIDTH); if (Environment_CheckForError(environment, status, "Environment_New(): determine if charset is fixed width") < 0) = { Py_DECREF(environment); return NULL; } #endif trace("Environment.c: maxBytesPerCharacter: %d\n", environment->maxByte= sPerC haracter); // thats from me --------------8<--------------------- This value was 3 when I worked with UTF8 and only 1 when I chose LATIN1. As far as I can see, you use this value only with lobs. Is that true? Now my hint: Could you adjust the allocated client side buffer length with maxBytesPerCharacter. As I'm totally new to your code, I'm doing really hard finding all places and not forgetting some side effects. So, I can't give you a patch. Until now I only checked the code for "defined" variables. The same must be true for "bind" variables.=20 So, I'm pretty sure, that the following code snippet is also not really valid: --------------- else if (PyString_Check(value)) { size =3D PyString_GET_SIZE(value); trace("Variable.c: size: %d\n", size); if (size > MAX_STRING_LENGTH) varType =3D &vt_LongString; ------------------ Szenario. I have a litte table like: create table tst ( data varchar2(4000) ) and a program like: ------------------ #!/usr/bin/python # -*- coding: utf8 -*- import sys import cx_Oracle as dbi if __name__ =3D=3D '__main__': con =3D dbi.connect('scott/tiger@twws') cur =3D con.cursor() val =3D '=C3=84' * 4000 print 'Length: ', len(val) con.begin() cur.execute('insert into tst values (:val)', (val, )) con.commit() cur.close() con.close() ------------------------------ The mentione c code above assumes in any case that I whant to bind a long variable. Interestingly, as far as this long variable is smalller or equal to the size of the destination varchar2-column (db-side, when used UTF-8 8000byte=3D>4000byte), the insert is done even if it's bound as long type. Orcale seems to perform a transparant conversion. As soon as the destination is too small (LATINI1, 8000byte =3D> 8000byte) I get an error about long variable binding what will confuse someone not knowing that you internally switched to long-binding. (Traceback (most recent call last): File "./i.py", line 12, in ? cur.execute('insert into tst values (:val)', (val, )) cx_Oracle.DatabaseError: ORA-01461: Ein LONG-Wert kann nur zur Einf=E2=96= =92gung in eine LONG-Spalte gebunden werden ) 7) In any case this is no criticism about your work. But I would like to work with UTF-8 encoding on client side as transparently as I can with Latin1. I like cx_Oracle with python much more than the Perl DBI stuff. As long as the different semantics of characters and bytes is the way it is in python 2.x < python 3000, the byte arrays delivered by cx_Oracle to python should reflect the correct encoding of the db-client-side.=20 I hope you can correct this problem. As you can see, I like to invest a couple of time to assist or help you. Best regards Andreas Mock P.S.: Ignore all mistakes in this English text. 7) Found the following documentation of byte expansion while using unicode. As far as I can see, the maximum expanded buffer size can be 4000 * 4 in worst case. http://www.stanford.edu/dept/itss/docs/oracle/10g/server.101/b10749/ch7prog= runicode.htm#i1006452=20 >>> "Anthony Tuininga" <ant...@gm...> 23.01.2007 18:59 >>> On 1/23/07, matilda matilda <ma...@gr...> wrote: > As far as I know, sqlplus does really double the internal buffers > to do the communication for the following szenario: > DB character set is Latin1 =3D> I can store a maximum of 4000 > German umlauts to the a string column. That results in 8000 > bytes on the client side if there is a multibyte encoding enabled. Hmm, that's worth trying, I suppose. Would you be able to provide a small test case that demonstrates the problem? Something that has 4000 characters in it which will actually result in considerably larger number of bytes. I can then try that and see if your assumption is correct. If it is, then the matter is fairly simple to address. |