Re: [cx-oracle-users] WITH_UNICODE in Python 2.x. Why ?

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Anthony Tuininga wrote:
> Well, the reason this mode was added is partly historical. While I was
> converting the internals of cx_Oracle to be able to support Python 3.x
> and Unicode I decided that the best way to do so was to add this mode
> so that I could test everything in Python 2.x before even having to
> consider Python 3.x and its own set of quirks. That worked out very
> well and I decided to leave it in place as a convenience to those who
> wish to use Unicode throughout without porting to Python 3.x.

When I learned about this flag, it seemed apparent to me that the mode was
created just to test Py3's behavior under 2x, and in that sense it is
appropriate that it work the way it does.

The issue I have is that one of my users feels that this mode must be used
in a production Python 2.x application.   I'm not aware of what the
rationale would be for this.   I'd rather not have SQLAlchemy needing to
support this mode of operation in 2x (note that in 3x, all strings are
unicodes by default so it is not an issue on that side), since it adds
some conditionals to the internals that are superfluous in almost all
cases.   My fear is that this user thinks the only way to get "unicode"
back in results (for their other cx_oracle, non-SQLAlchemy applications
running in the same environment) is through this flag, which as you've
mentioned below is not the case.

So perhaps something helpful here would be some documentation note from
the creator of cx_oracle that WITH_UNICODE mode doesn't buy you anything
in a production application that you can't get in easier ways, its only
for a specific kind of testing.

> I'm not sure what you are referring to with the
> "rigidity and arguable bugginess on the connect/statement/bind
> parameter side"
> but be aware that Oracle has a "unicode" mode and in
> that mode only Unicode data will be accepted. If I were to accept
> strings I would have to convert them to Unicode in some fashion -- and
> that implies knowing the encoding which I can't claim to know.

It is rigid in that even the strings passed to connect() cannot be plain
Python strings, nor can the statements passed to execute() - strings that
in the overwhelmingly vast majority of cases will be simple ascii values.

I am not aware of any other driver with this limitation, including some
that are all-unicode, such as Pysqlite.  It is standard practice for a
unicode aware application to treat plain bytestrings as ascii (or whatever
the default encoding of the interpreter is set up with).   If the
bytestring contains non-ascii then it raises an error.   Running the
string through the unicode() builtin with no other arguments provides this
effect.

It is "arguably buggy" regarding my previous test case, passing in a plain
bytestring returns back a hex-encoded string in a roundtrip, and not the
exact same thing you passed in.  I'm not too concerned about it but you
can be sure that others would find it to be surprising.

Based on my experience with Python drivers, the ideal "unicode" behavior
includes that Python unicode objects are accepted for all operations,
including the connect arguments, the statement, and the bind parameter
keys and values, in which case the underlying mechanics of Oracle encoding
handle the details.   But arguments passed as strings are decoded from the
platform's default encoding, using unicode(x) or similar, and will
immediately raise if this is not the case.   All result strings are
returned as unicode regardless of source.   Drivers which provide this
behavior include Pysqlite, pg8000, and pyodbc.  These are the easiest
drivers for us to support in SQLAlchemy w.r.t encoding issues.

> Note that there is another solution to getting all data returned as
> Unicode if that is what is desired. You can use "normal" or
> "non-Unicode" mode and simply specify a connection output type handler
> that tells cx_Oracle that all strings should be returned as Unicode.
> See the sample "ReturnUnicode.py" as an example. In that case,
> cx_Oracle will insist upon strings for connect strings, SQL
> statements, etc. -- but will accept Unicode for bind parameters and
> will return Unicode in result sets. Again, this is simply due to the
> way that the OCI deals with Unicode for what it terms "metadata".

This is good to know and I will look into using this method for
SQLAlchemy's default behavior with Oracle, as anything that removes the
need for us to encode/decode on the Python side improves performance for
us.

thanks for your answers !

- mike