Menu

Unicode/UTF8 in old versions

Help
2008-05-11
2012-09-19
  • Nick Barnes

    Nick Barnes - 2008-05-11

    [I tried sending this to the email address I have for Andy Dustman, but it got bounced]

    Hi. I'm a long time MySQLdb user (as part of the P4DTI project). I'm
    changing the way that we use unicode data, and I need to better
    understand MySQLdb's treatment of it, including in some older versions
    of MySQLdb.

    Looking through some older MySQLdb source code, it looks as if the
    'use_unicode' argument was introduced in 1.2.0.
    Before that there was single 'unicode' argument, in which one passes
    the character encoding name (e.g. 'utf8').
    After 1.2.0 one can change the connection character set, and specify
    the character set in a 'charset' argument.

    That is,

    pre-1.2.0: unicode='utf8'
    1.2.0: use_unicode=True
    post-1.2.0: charset='utf8'

    Is this right?

    If my understanding is correct, there's no way to enforce the
    character set for 1.2.0. Is that right? I expect I will deprecate
    MySQLdb releases up to and including 1.2.0 for the P4DTI, and insist
    on 1.2.1 or later for users interested in Unicode.

    I'm mostly developing against MySQLdb 1.2.2 on Windows XP and FreeBSD.

    Secondly:

    If I have a MySQLdb connection object in my hand, the documentation
    says I should be able to get at the encoding through a 'charset'
    attribute. But the connection object doesn't seem to have one, and
    neither does the cursor. Rooting about in the source I see that I can
    say "connection.character_set_name()". Is this the approved way?

    Finally:

    Is metadata (e.g. names of databases, tables, columns, indices, users,
    passwords) ever non-ASCII? If so, how is it encoded?

    Thanks in advance, and thanks again for MySQLdb.

    Nick Barnes
    P4DTI Project
    Ravenbrook Limited

     
    • Andy Dustman

      Andy Dustman - 2008-05-13

      use_unicode is a boolean; it only indicates that MySQL character data (i.e. all CHAR, VARCHAR, and TEXT columns, but not BINARY columns) are returned as Python unicode objects using the connection's character set. You need at least MySQL-4.0 (as I recall) to get the current character set value, and MySQL-4.1 to set the current character set.

      charset will set the connection character set at connection-time. Setting charset automatically sets use_unicode=True, though you can override this with use_unicode=False.

      You do not have to use both of these together. charset is only needed if you need to override the default connection character set which is configured on the server-side. use_unicode is needed if you want to get back Python unicode objects from your SELECT statements. You do not have to set use_unicode if you want to write unicode values into your database: These will always be converted to strings in with the connection's character set encoding.

      I definitely recommend requiring at least 1.2.1, and 1.2.1 may have some encoding problems that are not in 1.2.2.

      And yes, you should use connection.character_set_name() to get the current character set, and connection.set_character_set_name() if you need to change it for some reason, though I strongly recommend you set it at connection time with the charset parameter and not change it thereafter.

      In the MySQL privilege tables, some of the columns are binary columns, i.e. they have a binary collation. These will always be returned as strings even if use_unicode=True. Otherwise I think most of the metadata is non-binary.

       
    • Nick Barnes

      Nick Barnes - 2008-05-13

      Thanks very much Andy. To clarify a little, My requirement is to be able to get *TEXT, CHAR, VARCHAR fields into and out of Bugzilla databases (stored in MySQL) as Python unicode objects. Your recommendations and information will be very helpful for this.

       

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.