Menu

Opinions on unicode support?

2006-03-01
2012-09-19
  • Andy Dustman

    Andy Dustman - 2006-03-01

    I am considering making one last change to the way unicode is handled.

    Currently (1.2.x, especially 1.2.1) you can always use unicode strings as parameters, but CHAR/TEXT columns are by default returned as normal strings, unless you specify use_unicode=True (default: False) as a connection parameter. BINARY columns are always returned as strings (at least now they all are in the upcoming 1.2.1). Additionally, the "official" way to change your character set has been to specify it in a configuration file which you read with read_default_file=PATH. Changing it just with SQL didn't work, because there is an internal variable that specifies the current character set.

    What I am considering:

    Adding a charset parameter that would set the connection character set via SET NAMES charset and automatically enable all non-BINARY CHAR/TEXT columns to be returned as unicode.

    This would imply use_unicode=True, but use_unicode=False could be used to override this and still return strings.

    If you are using a version of MySQL older than 4.1, using charset would raise NotSupportedError; however, if the default charset is the same as the one requested, no exception would be raised.

    With unicode becoming more ubiquitous in Python applications, in part due to more use of XML which requires it's use, and with it being more easily available in MySQL, it's just a matter of time before use_unicode should default to True. However, I'm not sure what that will break so I'm inclined to not do it until 1.3.

    Let me know what you think about this as I plan to release 1.2.1 next week.

     
    • chimezie ogbuji

      chimezie ogbuji - 2006-03-01

      I agree with this change, mostly for all the reasons you stated (primarily, because unicode is becoming ubiquitous in Python applications). My only concern (not knowing much about this) is backward compatiblity with MySQL itself, otherwise, I think it's a sound policy decision given the nature of unicode usage in Python

      Chimezie

       
      • Andy Dustman

        Andy Dustman - 2006-03-01

        The main backwards-compatibility issue with MySQL is that prior to 4.1, you could not change the character set used via the client: It had to be done in the server at start time. However I can test both the client and server versions to determine whether or not support is there and raise an appropriate exception.

        Acutally, reviewing the docs, there is an API call for setting the character set in 5.0, so I will use that and fall back to SET NAMES in 4.1.

        http://dev.mysql.com/doc/refman/5.0/en/mysql-set-character-set.html

         
    • David Woods

      David Woods - 2006-03-01

      I'm in favor of anything that makes it simpler for my users, who generally are NOT technically literate. They're often challenged by having to create user names, and I gave up on trying to explain how to do a configuration file. So if you can provide a way I can get Unicode to work regardless of their server's settings, I'm for that.

      Maybe I'm in the minority, but Unicode support for embedded MySQL is a major problem, at least on Windows. I haven't found a formula that works. What you're considering is a fix to a minor annoyance. What I need is a fix to a major development roadblock. From my point of view, it's a far more important issue.

       
    • Brett Powley

      Brett Powley - 2006-03-01

      This change looks pretty reasonable; it's certainly a lot less error-prone to be able to specify in one place what encoding you want to deal with.

      A couple of thoughts:

      (1) Name of the encoding: Python uses "utf-8" whereas MySQL (incorrectly, perhaps) uses "utf8" as the name of one common encoding. It might be helpful for the API documentation to be absolutely clear on which encoding name to use.

      (2) Returning strings: I'm guessing that MySQL does the character set conversion from whatever encoding the column uses to the connection character set, and then MySQLdb decodes that from the connection character set into unicode, assuming that use_unicode=True. If use_unicode=False, then MySQLdb does no decoding, the strings will be returned in the connection character set and it will be up to the client program to decode them into unicode if necessary. Is this correct?

      Again, making sure the API documentation describes exactly what is happening might be helpful. Python encoding errors can be some of the most frustrating things to track down; I've found that understanding when encoding/decoding needs to occur is essential to using unicode successfully in Python.

       
      • Andy Dustman

        Andy Dustman - 2006-03-01

        1) Python seems to accept 'utf8' as an alias for 'utf-8' (and latin1 vs. latin-1):

        >>> u"\u3235\u3233".encode('utf8')
        '\xe3\x88\xb5\xe3\x88\xb3'
        >>> u"\u3235\u3233".encode('utf-8')
        '\xe3\x88\xb5\xe3\x88\xb3'
        >>> u"\u00e7".encode("latin1")
        '\xe7'
        >>> u"\u00e7".encode("latin-1")
        '\xe7'

        2) When use_uncode is True, MySQLdb uses .decode(charset), and when it is false, you just get the strings MySQL sends back, which are in whatever encoding happens to be in use, but can still be manually decoded. So, yes, you're correct.

         
    • Andy Dustman

      Andy Dustman - 2006-03-04

      I just figured out that I need to partially-undo a patch that prevents BINARY columns from being returned as unicode strings. This makes sense for BLOB-like columns (internally TINY_BLOB, MEDIUM_BLOB, LONG_BLOB, and BLOB) but NOT for other string-like columns (VAR_STRING, STRING). (TEXT columns are actually non-BINARY BLOB, which I guess makes them LOB...)

      Why it matters: In MySQL-4.1 and up, each column can have it's own character set AND collation. As an example, in the mysql database (privilege tables), if you do SHOW CREATE TABLE user, you'll see:

      CREATE TABLE user (
      Host char(60) collate utf8_bin NOT NULL default '',
      User char(16) collate utf8_bin NOT NULL default '',
      ...

      utf8_bin is a binary collation, and this causes the API to set the BINARY flag. Thus in 1.2.1c4-6, these columns get returned as array('c') objects, which is an unpleasant result. I didn't notice this until now because my read-write unit tests compare for equality, and array('c','user') == 'user'.

      So in 1.2.1c7, these columns will be returned as unicode strings as they were before, if you set use_unicode=True or specified charset; otherwise they will be ordinary strings (str).

      I don't have time now to fix this tonight, but hopefully by Saturday (March 4) night.

       
    • Andy Dustman

      Andy Dustman - 2006-03-03

      I implemented connect(..., charset=...) in 1.2.1c6. If charset is set, then it implies use_unicode=True (you can explicitly override this with use_unicode=False, though I'm not sure why you would). If the supplied charset is already the default character set on the connection, it doesn't try to change it. If it does need to change it and the server is older than 4.1, you get UnsupportedError.

      There's also a db.get_character_set_info() function exposed (MySQL-5.0 only) which returns possibly useful information (collation, size range of multibyte character sets, descriptive character set name).

      Barring any show-stopper bugs, this will be re-released as 1.2.1 on March 8.

       

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.