[cx-oracle-users] cx_Oracle enhancement proposal: unicode

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hello,

For my application I need to add unicode support to cx_Oracle.
I even volunteer to make the development (but don't hold your breath).

Before this I want to know if some cx_Oracle users have ideas on how
unicode should work with cx_Oracle,=20
what should be allowed, and what would be cool to have.
I am sure I missed some important points.

So here is the first cxEP (cx_Oracle Enhancement Proposal).
See if it would fit your needs, and please send me all your additions,
questions, remarks...

And long live cx_Oracle!

Abstract
=3D=3D=3D=3D=3D=3D=3D=3D

This cx_EP describe how to add unicode support to cx_Oracle.

Current State
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D

cx_Oracle 4.1.2 does not accept any unicode string at all, and strings
are always passed 'as is' to the underlying OCI functions.=20
Strings in cx_Oracle are found in two areas:
- system strings: SQL and PL/SQL queries, connect strings, transaction
IDs...
- data strings: CHAR, VARCHAR, CLOB...

The default character set is defined by the surrounding environment (the
NLS_LANG variable, or the Windows registry, or by system defaults).=20

By default, all strings are assumed to be encoded in this client
character set.
This behaviour should be clearly advertised, and will not change.

There is also a National Character Set (NLS_NCHAR), used for NCHAR data
types.

Proposal
=3D=3D=3D=3D=3D=3D=3D=3D

It is proposed the following features:
- Ability to control the character set used in strings, and to use
unicode strings.
- Ability to deal with unicode and NCHAR data.

Default Character Set
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D

Two module functions: getdefaultencoding() and
setdefaultencoding(encoding) allow the application to control the
default character set used by the Oracle client.
- getdefaultencoding returns a python encoding: latin_1 instead of
WE8ISO8859P1.
- setdefaultencoding accepts both python and Oracle encodings.

Unicode strings are accepted anywhere strings are, and are automatically
encoded with this default encoding.
An option (todo: find a good name) can be used to set a different error
handling scheme. Default value is 'strict', other possible values are
'ignore' and 'replace' (the latter is Oracle default behaviour). See the
python help for `unicode.encode`.

Changing the encoding while a cursor is opened is likely to return funny
results.

Note that by construction, returned strings are always encoded with the
default encoding and can easily be converted into unicode.

NCHAR Data and Unicode Variables
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D

There are three new variable types, UNICODE, LONG_UNICODE and
LOB_UNICODE. They hold unicode data.=20
Their getvalue() methods return a unicode string.=20
Their setvalue() methods accept both unicode and strings, decoded using
getdefaultencoding().

These types are chosen when a column is described as one of the types
NCHAR, NVARCHAR and NCLOB, or when building a Variable directly from a
unicode value.

These types are also be preferred over their string counterparts when
getdefaultencoding() is a multi-bytes encoding.
(Todo: make this an option, and give it a name. Default should be False,
to keep current behaviour)

Note that the encoding used to represent unicode values (the National
Character Set) is not relevant, as long as it is capable to encode any
unicode value. The default value on most databases, AL16UTF16, is
correct.

Open Issues
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D

- RAW, LONG RAW and BLOB datatypes are not subject to the Character Set,
and are passed unchanged from Client to Server.=20
  Should they accept and convert unicode data?

--=20
Amaury Forgeot d'Arc
Ubix Development
www.ubitrade.com