Re: [SQLObject] bug with UnicodeCol

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Ian Bicking wrote:

|> Is there actually a use case for allowing each column to have a
|> different encoding?
|
| Probably not.  Though there's probably a use case for every database
| connection to have a different encoding.

The use case is for non-Unicode aware databases that want to store text
in a particular character set. I imagine the encoding would be specified
~ as part of the connection string. There will be an efficiency
advantage to specifying the encoding for Unicode aware databases too if
your Unicode aware database supports multiple client encodings (such as
PostgreSQL), although I suspect it wouldn't be noticible unless
SQLObject is modified to use bound parameters.

|> I know for PostgreSQL it is simply a matter of
|> setting the database encoding to Unicode and sending everything as UTF-8
|> by simply encoding the entire query (which takes care of other issues
|> like Unicode column names as well). The only use cases I can come up
|> with for your scenario should be usng BINARY columns instead of VARCHAR
|> - - in particular, since the database doesn't know the encoding you are
|> using then all your basic string operations, sorting etc. are now broken.
|
|
| This suggests we should do it in a way that we allow Unicode-aware
| databases to get Unicode data directly, and other databases use
| transparent encoding.
|
|> Hmm... perhaps if you need to store text in some encoding that doesn't
|> contain the ASCII character set it might be necessary, but I don't know
|> what character sets these are or if any databases actually support them.
|> I've gone through the list of encodings PostgreSQL supports and they all
|> contain the basic latin letters and can be used to encode SQL
|> statements, so I suspect this is not a requirement.
|
|
| That seems overly aggressive.  It just feels very wrong to encode the
| entire query.

Ideally, we just throw a Unicode SQL command at the database driver
(which for PostgreSQL, is possible with psycopg2). For psycopg 1,
you have to take care of the encoding yourself, which is simply a matter
of issuing a 'SET client_encoding TO UNICODE' and then encoding all
Unicode strings as UTF8 <rant>(because PostgreSQL, like Java, seems to
have decided Unicode == UTF8)</rant>. Encoding the entire query has the
advantage that Unicode column names, Unicode table names, Unicode in
WHERE clauses etc. are all handled correctly. eg.

Foo.select(u"WHERE name >= '\N{LATIN CAPITAL LETTER A WITH GRAVE}")

If we don't encode the entire query, developers have to worry about what
parts of SQLObject require ASCII only strings and what parts of
SQLObject accept Unicode strings which is really frustrating to those of
us following the recommended 'Unicode everywhere' practice. It also will
cause trouble with modern DB drivers that happily accept Unicode strings
and do the right thing because you have no idea what encoding the
connection is set to use. It could be argued that the correct thing for
them to do if they receive a non-ASCII traditional string is to raise an
exception (since the encoding is not known, it can't tell the backend
what it is).

I don't see any advantage to only encoding portions of the query.
Forcing parts of the SQL statement to remain ASCII would be needlessly
restrictive and a source of bugs (since Unicode strings are viral in
Python, you often find them cropping up in places you didn't expect them).

Internally, we have patched SQLObject to *always* return Unicode strings
and transparently encode/decode. I'd say *this*  might be overly
agressive because it is not backwards compatible (and the reason we
never pushed this patch back upstream), but there needs to be an option
to do it this way because 'Unicode everywhere' has been recommended
practice since Unicode support was first bolted onto Python. It also
means that when I'm wearing my DBA hat I don't have to worry about other
developers pissing in the pool and polluting my nice clean database with
meaningless bytestreams.

I think the following may be a good design, which maintains backwards
compatibility for people working with legacy systems or who are
idealogically opposed to working with Unicode strings. It isn't best
practice, but might be common ground. It also doesn't involve much work ;)

1) The bulk of SQLObject doesn't care what sort of strings it sees. It
just works with strings as Python intended.

2) At the point of issuing the cursor.execute(), the query will be
encoded into the encoding the database backend expects to see (or just
passed through as a Unicode string if the driver supports that). For
non-Unicode aware databases, the developer will need to specify the
encoding when opening the connection (defaulting to ASCII)

3) When results are retrieved, they are returned as-is (traditional
string, encoded) if the column type is StringCol, or decoded into
Unicode if the column type is UnicodeCol. I don't think adding a
'unicode=True' parameter to StringCol would be good, as developers will
forget to add it and we end up with hidden bugs again.

4) Docs are updated to use UnicodeCol rather than StringCol.

Point 4 is actually important, as otherwise people will continue to use
StringCol. This will cause them trouble when they throw Unicode at it
(which stores correctly, but they get encoded strings back).

- --
Stuart Bishop <st...@st...>
http://www.stuartbishop.net/

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (GNU/Linux)

iD8DBQFB13KVAfqZj7rGN0oRAmTzAJ45l9hc/Ag7I/0UCRt1gdbwP0UhPgCfd+zH
dZrfcE6sbDOReEoiV/2CPkA=
=hL4y
-----END PGP SIGNATURE-----

Re: [SQLObject] bug with UnicodeCol

SQLObject is a Python ORM.

Re: [SQLObject] bug with UnicodeCol