[cx-oracle-users] cx_Oracle enhancement proposal: unicode
Brought to you by:
atuining
From: Amaury F. <Ama...@gl...> - 2006-01-19 16:29:19
|
Hello, For my application I need to add unicode support to cx_Oracle. I even volunteer to make the development (but don't hold your breath). Before this I want to know if some cx_Oracle users have ideas on how unicode should work with cx_Oracle,=20 what should be allowed, and what would be cool to have. I am sure I missed some important points. So here is the first cxEP (cx_Oracle Enhancement Proposal). See if it would fit your needs, and please send me all your additions, questions, remarks... And long live cx_Oracle! Abstract =3D=3D=3D=3D=3D=3D=3D=3D This cx_EP describe how to add unicode support to cx_Oracle. Current State =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D cx_Oracle 4.1.2 does not accept any unicode string at all, and strings are always passed 'as is' to the underlying OCI functions.=20 Strings in cx_Oracle are found in two areas: - system strings: SQL and PL/SQL queries, connect strings, transaction IDs... - data strings: CHAR, VARCHAR, CLOB... The default character set is defined by the surrounding environment (the NLS_LANG variable, or the Windows registry, or by system defaults).=20 By default, all strings are assumed to be encoded in this client character set. This behaviour should be clearly advertised, and will not change. There is also a National Character Set (NLS_NCHAR), used for NCHAR data types. Proposal =3D=3D=3D=3D=3D=3D=3D=3D It is proposed the following features: - Ability to control the character set used in strings, and to use unicode strings. - Ability to deal with unicode and NCHAR data. Default Character Set =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D Two module functions: getdefaultencoding() and setdefaultencoding(encoding) allow the application to control the default character set used by the Oracle client. - getdefaultencoding returns a python encoding: latin_1 instead of WE8ISO8859P1. - setdefaultencoding accepts both python and Oracle encodings. Unicode strings are accepted anywhere strings are, and are automatically encoded with this default encoding. An option (todo: find a good name) can be used to set a different error handling scheme. Default value is 'strict', other possible values are 'ignore' and 'replace' (the latter is Oracle default behaviour). See the python help for `unicode.encode`. Changing the encoding while a cursor is opened is likely to return funny results. Note that by construction, returned strings are always encoded with the default encoding and can easily be converted into unicode. NCHAR Data and Unicode Variables =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D There are three new variable types, UNICODE, LONG_UNICODE and LOB_UNICODE. They hold unicode data.=20 Their getvalue() methods return a unicode string.=20 Their setvalue() methods accept both unicode and strings, decoded using getdefaultencoding(). These types are chosen when a column is described as one of the types NCHAR, NVARCHAR and NCLOB, or when building a Variable directly from a unicode value. These types are also be preferred over their string counterparts when getdefaultencoding() is a multi-bytes encoding. (Todo: make this an option, and give it a name. Default should be False, to keep current behaviour) Note that the encoding used to represent unicode values (the National Character Set) is not relevant, as long as it is capable to encode any unicode value. The default value on most databases, AL16UTF16, is correct. Open Issues =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D - RAW, LONG RAW and BLOB datatypes are not subject to the Character Set, and are passed unchanged from Client to Server.=20 Should they accept and convert unicode data? --=20 Amaury Forgeot d'Arc Ubix Development www.ubitrade.com |