From: Stuart B. <st...@st...> - 2005-01-02 04:04:12
|
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Ian Bicking wrote: |> Is there actually a use case for allowing each column to have a |> different encoding? | | Probably not. Though there's probably a use case for every database | connection to have a different encoding. The use case is for non-Unicode aware databases that want to store text in a particular character set. I imagine the encoding would be specified ~ as part of the connection string. There will be an efficiency advantage to specifying the encoding for Unicode aware databases too if your Unicode aware database supports multiple client encodings (such as PostgreSQL), although I suspect it wouldn't be noticible unless SQLObject is modified to use bound parameters. |> I know for PostgreSQL it is simply a matter of |> setting the database encoding to Unicode and sending everything as UTF-8 |> by simply encoding the entire query (which takes care of other issues |> like Unicode column names as well). The only use cases I can come up |> with for your scenario should be usng BINARY columns instead of VARCHAR |> - - in particular, since the database doesn't know the encoding you are |> using then all your basic string operations, sorting etc. are now broken. | | | This suggests we should do it in a way that we allow Unicode-aware | databases to get Unicode data directly, and other databases use | transparent encoding. | |> Hmm... perhaps if you need to store text in some encoding that doesn't |> contain the ASCII character set it might be necessary, but I don't know |> what character sets these are or if any databases actually support them. |> I've gone through the list of encodings PostgreSQL supports and they all |> contain the basic latin letters and can be used to encode SQL |> statements, so I suspect this is not a requirement. | | | That seems overly aggressive. It just feels very wrong to encode the | entire query. Ideally, we just throw a Unicode SQL command at the database driver (which for PostgreSQL, is possible with psycopg2). For psycopg 1, you have to take care of the encoding yourself, which is simply a matter of issuing a 'SET client_encoding TO UNICODE' and then encoding all Unicode strings as UTF8 <rant>(because PostgreSQL, like Java, seems to have decided Unicode == UTF8)</rant>. Encoding the entire query has the advantage that Unicode column names, Unicode table names, Unicode in WHERE clauses etc. are all handled correctly. eg. Foo.select(u"WHERE name >= '\N{LATIN CAPITAL LETTER A WITH GRAVE}") If we don't encode the entire query, developers have to worry about what parts of SQLObject require ASCII only strings and what parts of SQLObject accept Unicode strings which is really frustrating to those of us following the recommended 'Unicode everywhere' practice. It also will cause trouble with modern DB drivers that happily accept Unicode strings and do the right thing because you have no idea what encoding the connection is set to use. It could be argued that the correct thing for them to do if they receive a non-ASCII traditional string is to raise an exception (since the encoding is not known, it can't tell the backend what it is). I don't see any advantage to only encoding portions of the query. Forcing parts of the SQL statement to remain ASCII would be needlessly restrictive and a source of bugs (since Unicode strings are viral in Python, you often find them cropping up in places you didn't expect them). Internally, we have patched SQLObject to *always* return Unicode strings and transparently encode/decode. I'd say *this* might be overly agressive because it is not backwards compatible (and the reason we never pushed this patch back upstream), but there needs to be an option to do it this way because 'Unicode everywhere' has been recommended practice since Unicode support was first bolted onto Python. It also means that when I'm wearing my DBA hat I don't have to worry about other developers pissing in the pool and polluting my nice clean database with meaningless bytestreams. I think the following may be a good design, which maintains backwards compatibility for people working with legacy systems or who are idealogically opposed to working with Unicode strings. It isn't best practice, but might be common ground. It also doesn't involve much work ;) 1) The bulk of SQLObject doesn't care what sort of strings it sees. It just works with strings as Python intended. 2) At the point of issuing the cursor.execute(), the query will be encoded into the encoding the database backend expects to see (or just passed through as a Unicode string if the driver supports that). For non-Unicode aware databases, the developer will need to specify the encoding when opening the connection (defaulting to ASCII) 3) When results are retrieved, they are returned as-is (traditional string, encoded) if the column type is StringCol, or decoded into Unicode if the column type is UnicodeCol. I don't think adding a 'unicode=True' parameter to StringCol would be good, as developers will forget to add it and we end up with hidden bugs again. 4) Docs are updated to use UnicodeCol rather than StringCol. Point 4 is actually important, as otherwise people will continue to use StringCol. This will cause them trouble when they throw Unicode at it (which stores correctly, but they get encoded strings back). - -- Stuart Bishop <st...@st...> http://www.stuartbishop.net/ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.5 (GNU/Linux) iD8DBQFB13KVAfqZj7rGN0oRAmTzAJ45l9hc/Ag7I/0UCRt1gdbwP0UhPgCfd+zH dZrfcE6sbDOReEoiV/2CPkA= =hL4y -----END PGP SIGNATURE----- |