Re: [cx-oracle-users] WITH_UNICODE in Python 2.x. Why ?

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

On Sat, Mar 13, 2010 at 1:54 PM, Michael Bayer <mi...@zz...> wrote:
> First, a bug report involving WITH_UNICODE mode.   I only am familiar with this option as one of my users (i.e. of SQLAlchemy) insists he must use this mode in production.
>
> cx_Oracle is generally pretty good at loudly rejecting non-unicode objects in this mode, unless you send across a string as a bind:
>
> import cx_Oracle
>
> conn = cx_Oracle.connect(
> dsn = u'(DESCRIPTION=(ADDRESS_LIST=(ADDRESS=(PROTOCOL=TCP)(HOST=172.16.248.131)(PORT=1521)))(CONNECT_DATA=(SID=xe)))',
>    password=u'tiger',
>    user=u'scott'
> )
>
> cursor = conn.cursor()
>
> cursor.execute(u"select :name from dual", {'name':u'some name'})
> result = cursor.fetchall()[0][0]
> assert result == u'some name', result  # passes
>
> cursor.execute(u"select :name from dual", {'name':'some name'})
> result = cursor.fetchall()[0][0]
> assert result == u'some name', result  # does not raise an exception.  Returns: 736F6D65206E616D65
>
>
> In SQLAlchemy, we're going to ensure that a plain string never gets passed, but it still seems inconsistent and dangerous that cx_Oracle simply corrupts the data in some circumstances.  For this reason I'm planning on having SQLA issue a huge warning when the presence of this flag is detected, that they must never stray from the confines of SQLA's bind handling or fear the wrath of odd hex-encoded data being generated without their knowledge.

Ok, this is not a "bug" but a "feature". :-) In Python 3.x strings are
considered "binary" data and so Python 2.x in Unicode mode is doing
the same. That could, however, be turned off and in Unicode mode
insist that buffer objects be passed -- or the call to
cx_Oracle.Binary() be used. I'd appreciate other people's feedback on
whether this would be considered an improvement or not. I can see both
sides of the argument, though.

> Secondly, detecting that WITH_UNICODE was used is a trick !   Right now I just check that the version >= 5 and that cx_Oracle.UNICODE does not exist.   It would be nice if an explicit bit of information were available to detect this mode.

That is the only way to do so, currently. I could add a "WITH_UNICODE"
boolean flag at the module level if that would be considered helpful.

> and thirdly, *why* is this mode, in its current form, even available in Python 2.x ?   The fact that it returns Python unicodes in result sets in all cases is great, and very useful.  But the rigidity and arguable bugginess on the connect/statement/bind parameter side doesn't seem to have any clear rationale.  It would be more performant and easier on the outside world if the unicode(x) call took place within cx_oracle's native guts and not within pure-Python libraries that wish to sanitize input.

Well, the reason this mode was added is partly historical. While I was
converting the internals of cx_Oracle to be able to support Python 3.x
and Unicode I decided that the best way to do so was to add this mode
so that I could test everything in Python 2.x before even having to
consider Python 3.x and its own set of quirks. That worked out very
well and I decided to leave it in place as a convenience to those who
wish to use Unicode throughout without porting to Python 3.x. That is
also the primary reason for allowing strings to be treated as binary
data in Unicode mode. I'm not sure what you are referring to with the
"rigidity and arguable bugginess on the connect/statement/bind
parameter side" but be aware that Oracle has a "unicode" mode and in
that mode only Unicode data will be accepted. If I were to accept
strings I would have to convert them to Unicode in some fashion -- and
that implies knowing the encoding which I can't claim to know. Thus
the reason for in Unicode mode ONLY accepting Unicode data and all
strings are therefore treated as binary data. Maybe that will help you
understand the decisions made and the reasons behind some of the
"rigidity".

Note that there is another solution to getting all data returned as
Unicode if that is what is desired. You can use "normal" or
"non-Unicode" mode and simply specify a connection output type handler
that tells cx_Oracle that all strings should be returned as Unicode.
See the sample "ReturnUnicode.py" as an example. In that case,
cx_Oracle will insist upon strings for connect strings, SQL
statements, etc. -- but will accept Unicode for bind parameters and
will return Unicode in result sets. Again, this is simply due to the
way that the OCI deals with Unicode for what it terms "metadata".

Hopefully I've helped answer your questions and concerns!

Anthony