Re: [cx-oracle-users] WITH_UNICODE in Python 2.x. Why ?
Brought to you by:
atuining
From: Michael B. <mi...@zz...> - 2010-03-15 15:34:03
|
Anthony Tuininga wrote: > Well, the reason this mode was added is partly historical. While I was > converting the internals of cx_Oracle to be able to support Python 3.x > and Unicode I decided that the best way to do so was to add this mode > so that I could test everything in Python 2.x before even having to > consider Python 3.x and its own set of quirks. That worked out very > well and I decided to leave it in place as a convenience to those who > wish to use Unicode throughout without porting to Python 3.x. When I learned about this flag, it seemed apparent to me that the mode was created just to test Py3's behavior under 2x, and in that sense it is appropriate that it work the way it does. The issue I have is that one of my users feels that this mode must be used in a production Python 2.x application. I'm not aware of what the rationale would be for this. I'd rather not have SQLAlchemy needing to support this mode of operation in 2x (note that in 3x, all strings are unicodes by default so it is not an issue on that side), since it adds some conditionals to the internals that are superfluous in almost all cases. My fear is that this user thinks the only way to get "unicode" back in results (for their other cx_oracle, non-SQLAlchemy applications running in the same environment) is through this flag, which as you've mentioned below is not the case. So perhaps something helpful here would be some documentation note from the creator of cx_oracle that WITH_UNICODE mode doesn't buy you anything in a production application that you can't get in easier ways, its only for a specific kind of testing. > I'm not sure what you are referring to with the > "rigidity and arguable bugginess on the connect/statement/bind > parameter side" > but be aware that Oracle has a "unicode" mode and in > that mode only Unicode data will be accepted. If I were to accept > strings I would have to convert them to Unicode in some fashion -- and > that implies knowing the encoding which I can't claim to know. It is rigid in that even the strings passed to connect() cannot be plain Python strings, nor can the statements passed to execute() - strings that in the overwhelmingly vast majority of cases will be simple ascii values. I am not aware of any other driver with this limitation, including some that are all-unicode, such as Pysqlite. It is standard practice for a unicode aware application to treat plain bytestrings as ascii (or whatever the default encoding of the interpreter is set up with). If the bytestring contains non-ascii then it raises an error. Running the string through the unicode() builtin with no other arguments provides this effect. It is "arguably buggy" regarding my previous test case, passing in a plain bytestring returns back a hex-encoded string in a roundtrip, and not the exact same thing you passed in. I'm not too concerned about it but you can be sure that others would find it to be surprising. Based on my experience with Python drivers, the ideal "unicode" behavior includes that Python unicode objects are accepted for all operations, including the connect arguments, the statement, and the bind parameter keys and values, in which case the underlying mechanics of Oracle encoding handle the details. But arguments passed as strings are decoded from the platform's default encoding, using unicode(x) or similar, and will immediately raise if this is not the case. All result strings are returned as unicode regardless of source. Drivers which provide this behavior include Pysqlite, pg8000, and pyodbc. These are the easiest drivers for us to support in SQLAlchemy w.r.t encoding issues. > Note that there is another solution to getting all data returned as > Unicode if that is what is desired. You can use "normal" or > "non-Unicode" mode and simply specify a connection output type handler > that tells cx_Oracle that all strings should be returned as Unicode. > See the sample "ReturnUnicode.py" as an example. In that case, > cx_Oracle will insist upon strings for connect strings, SQL > statements, etc. -- but will accept Unicode for bind parameters and > will return Unicode in result sets. Again, this is simply due to the > way that the OCI deals with Unicode for what it terms "metadata". This is good to know and I will look into using this method for SQLAlchemy's default behavior with Oracle, as anything that removes the need for us to encode/decode on the Python side improves performance for us. thanks for your answers ! - mike |