[I tried sending this to the email address I have for Andy Dustman, but it got bounced]
Hi. I'm a long time MySQLdb user (as part of the P4DTI project). I'm
changing the way that we use unicode data, and I need to better
understand MySQLdb's treatment of it, including in some older versions
of MySQLdb.
Looking through some older MySQLdb source code, it looks as if the
'use_unicode' argument was introduced in 1.2.0.
Before that there was single 'unicode' argument, in which one passes
the character encoding name (e.g. 'utf8').
After 1.2.0 one can change the connection character set, and specify
the character set in a 'charset' argument.
If my understanding is correct, there's no way to enforce the
character set for 1.2.0. Is that right? I expect I will deprecate
MySQLdb releases up to and including 1.2.0 for the P4DTI, and insist
on 1.2.1 or later for users interested in Unicode.
I'm mostly developing against MySQLdb 1.2.2 on Windows XP and FreeBSD.
Secondly:
If I have a MySQLdb connection object in my hand, the documentation
says I should be able to get at the encoding through a 'charset'
attribute. But the connection object doesn't seem to have one, and
neither does the cursor. Rooting about in the source I see that I can
say "connection.character_set_name()". Is this the approved way?
Finally:
Is metadata (e.g. names of databases, tables, columns, indices, users,
passwords) ever non-ASCII? If so, how is it encoded?
Thanks in advance, and thanks again for MySQLdb.
Nick Barnes
P4DTI Project
Ravenbrook Limited
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
use_unicode is a boolean; it only indicates that MySQL character data (i.e. all CHAR, VARCHAR, and TEXT columns, but not BINARY columns) are returned as Python unicode objects using the connection's character set. You need at least MySQL-4.0 (as I recall) to get the current character set value, and MySQL-4.1 to set the current character set.
charset will set the connection character set at connection-time. Setting charset automatically sets use_unicode=True, though you can override this with use_unicode=False.
You do not have to use both of these together. charset is only needed if you need to override the default connection character set which is configured on the server-side. use_unicode is needed if you want to get back Python unicode objects from your SELECT statements. You do not have to set use_unicode if you want to write unicode values into your database: These will always be converted to strings in with the connection's character set encoding.
I definitely recommend requiring at least 1.2.1, and 1.2.1 may have some encoding problems that are not in 1.2.2.
And yes, you should use connection.character_set_name() to get the current character set, and connection.set_character_set_name() if you need to change it for some reason, though I strongly recommend you set it at connection time with the charset parameter and not change it thereafter.
In the MySQL privilege tables, some of the columns are binary columns, i.e. they have a binary collation. These will always be returned as strings even if use_unicode=True. Otherwise I think most of the metadata is non-binary.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Thanks very much Andy. To clarify a little, My requirement is to be able to get *TEXT, CHAR, VARCHAR fields into and out of Bugzilla databases (stored in MySQL) as Python unicode objects. Your recommendations and information will be very helpful for this.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
[I tried sending this to the email address I have for Andy Dustman, but it got bounced]
Hi. I'm a long time MySQLdb user (as part of the P4DTI project). I'm
changing the way that we use unicode data, and I need to better
understand MySQLdb's treatment of it, including in some older versions
of MySQLdb.
Looking through some older MySQLdb source code, it looks as if the
'use_unicode' argument was introduced in 1.2.0.
Before that there was single 'unicode' argument, in which one passes
the character encoding name (e.g. 'utf8').
After 1.2.0 one can change the connection character set, and specify
the character set in a 'charset' argument.
That is,
pre-1.2.0: unicode='utf8'
1.2.0: use_unicode=True
post-1.2.0: charset='utf8'
Is this right?
If my understanding is correct, there's no way to enforce the
character set for 1.2.0. Is that right? I expect I will deprecate
MySQLdb releases up to and including 1.2.0 for the P4DTI, and insist
on 1.2.1 or later for users interested in Unicode.
I'm mostly developing against MySQLdb 1.2.2 on Windows XP and FreeBSD.
Secondly:
If I have a MySQLdb connection object in my hand, the documentation
says I should be able to get at the encoding through a 'charset'
attribute. But the connection object doesn't seem to have one, and
neither does the cursor. Rooting about in the source I see that I can
say "connection.character_set_name()". Is this the approved way?
Finally:
Is metadata (e.g. names of databases, tables, columns, indices, users,
passwords) ever non-ASCII? If so, how is it encoded?
Thanks in advance, and thanks again for MySQLdb.
Nick Barnes
P4DTI Project
Ravenbrook Limited
use_unicode is a boolean; it only indicates that MySQL character data (i.e. all CHAR, VARCHAR, and TEXT columns, but not BINARY columns) are returned as Python unicode objects using the connection's character set. You need at least MySQL-4.0 (as I recall) to get the current character set value, and MySQL-4.1 to set the current character set.
charset will set the connection character set at connection-time. Setting charset automatically sets use_unicode=True, though you can override this with use_unicode=False.
You do not have to use both of these together. charset is only needed if you need to override the default connection character set which is configured on the server-side. use_unicode is needed if you want to get back Python unicode objects from your SELECT statements. You do not have to set use_unicode if you want to write unicode values into your database: These will always be converted to strings in with the connection's character set encoding.
I definitely recommend requiring at least 1.2.1, and 1.2.1 may have some encoding problems that are not in 1.2.2.
And yes, you should use connection.character_set_name() to get the current character set, and connection.set_character_set_name() if you need to change it for some reason, though I strongly recommend you set it at connection time with the charset parameter and not change it thereafter.
In the MySQL privilege tables, some of the columns are binary columns, i.e. they have a binary collation. These will always be returned as strings even if use_unicode=True. Otherwise I think most of the metadata is non-binary.
Thanks very much Andy. To clarify a little, My requirement is to be able to get *TEXT, CHAR, VARCHAR fields into and out of Bugzilla databases (stored in MySQL) as Python unicode objects. Your recommendations and information will be very helpful for this.