Menu

MySQLdb-2.0 roadmap and request for comments

2005-01-05
2012-09-19
  • Andy Dustman

    Andy Dustman - 2005-01-05

    Here is a preliminary roadmap for MySQLdb-2.0:

    • Full MySQL-4.1 support. Currently the only 4.1 feature not supported is the Prepared Statement API, which is a pretty big feature.

    • Parameter placeholder will need to change from format (%s) to qmark (?) to be compatible with prepared statements.

    • There may be a compatibility mode to allow format string placeholders to be used.

    • More prepared statement fallout: The conversion dictionary will probably go away or be curtailed.

    • Minimum MySQL version required will be 4.0, or possibly 4.1. This mostly depends on what the MySQL folks do. Currently both are General Availability, with 4.1 preferred. 3.23 is considered an "older release". If 4.0 is shifted to being an older release, then 4.1 might be the minimum.

    • MySQL-5.0 and 5.1 will probably be supported. The 5.x series does not seem to introduce or otherwise change anything in the C API, so it shouldn't be an issue.

    • Python-2.3 will most likely be the minimum version, but it's possible that 2.4 may be required.

    • There's not a lot of functionality to add at this point.

    • API changes should be minimal.

    • Unit testing (if I have time)

    One thing I considered doing (and haven't completely ruled out) is writing 2.0 using ctypes:

    http://sourceforge.net/projects/ctypes

    This would eliminate the entire _mysql module, and everything would be pure Python. I even went so far as to write a minimal version using ctypes (not DB-API, more like how _mysql is) and got it to return some results.

    However, I am leaning towards doing the Prepared Statement stuff in C for now, because it doesn't look like it'll be that hard to implement.

    Add any addtional requests or comments here.

     
    • Andy Dustman

      Andy Dustman - 2005-01-22

      Here's a summary of MySQL-4.1's character set support:

      http://dev.mysql.com/doc/mysql/en/Charset.html

      In a nutshell, there are character set settings for everything: connection, server, database, table, column. The database defaults to the server default, the table defaults to the database default, and the column defaults to the table default.

      You can change the connection character set with SQL or in an option file, unlike 4.0 and earlier, where the character set is determined at compile time for the server.

      Additionally, there are "introducers" for string literals, which serve to indicate what encoding the following literal is in:

      http://dev.mysql.com/doc/mysql/en/Charset-literal.html

      OK, now the Python side of things: unicode objects. This is an area that I am far from an expert on, but have been trying to read up on lately.

      The unicode object has a signature like this:

      unicode(string [, encoding[, errors]]) -> object

      There is a system default encoding (sys.getdefaultencoding()), and errors can be used for error control; type help(unicode) in the interepreter for more info, and also read the Python Library Reference:

      http://docs.python.org/lib/typesseq.html
      http://docs.python.org/lib/module-codecs.html
      http://www.python.org/peps/pep-0100.html

      The codecs are what are used (by name) for the encoding.

      Here's the most important thing to realize about the unicode object: The encoding that you specify is the encoding of the input string; it is NOT the encoding of the object itself. The unicode object has a special internal representation, which you don't need to be concerned with, but as I understand it, it's UTF-32, i.e. all characters are really 4 bytes.

      http://www.unicode.org/standard/WhatIsUnicode.html
      http://www.unicode.org/faq/utf_bom.html

      Furthermore, nearly all of the encodings that are available are NOT unicode encodings. The only true unicode encodings are the various UTF encodings. UTF-8 is the most space efficient, and can use as few as one byte per characters; good old ASCII is essentially a subset of UTF-8.

      UTF was designed to be a replacement for all the other various encodings in use, from latin character sets (iso-8895-*) to Asian languages (iso-2022-jp).

      http://lfw.org/text/jp.html

      The above page is from 1996, which is pre-unicode, I think; there's no mention of unicode at all.

      OK, some examples are in order. I pulled down a Japanese web page, which was encoded in Shift-JIS. You can use codecs.open to open a file to read/write unicode data:

      Python 2.4 (#1, Jan 9 2005, 14:26:36)
      [GCC 3.3.4 20040623 (Gentoo Linux 3.3.4-r1, ssp-3.3.2-2, pie-8.7.6)] on linux2
      Type "help", "copyright", "credits" or "license" for more information.
      >>> import codecs
      >>> page=codecs.open('03_lu03.htm',encoding='shift-jis').read()
      >>>

      Let's see what happens when I try to print the sucker:

      >>> print page[:100]
      Traceback (most recent call last):
      File "<stdin>", line 1, in ?
      UnicodeEncodeError: 'ascii' codec can't encode characters in position 19-23: ordinal not in range(128)
      >>>

      Oops, you can't get there from here. However, I can convert it into another Japanese encoding:

      >>> page2=page.encode('iso-2022-jp')
      Traceback (most recent call last):
      File "<stdin>", line 1, in ?
      UnicodeEncodeError: 'iso2022_jp' codec can't encode character u'\uff65' in position 3068: illegal multibyte sequence

      Or can I? Apparently there are sometimes problems translating between these two encodings. However, I can definitely turn it into UTF-8:

      >>> page2=page.encode('utf-8')
      >>> type(page2)
      <type 'str'>
      >>>

      Note that the resulting object is a string, not a unicode string.

      Alright, what's the point? The default character set for MySQL is latin1. This limits what you can push across or store. Now if everthing you send is latin1, no problem. If you need to send anything else, you need to change the connection character set or use an introducer.

      I don't believe MySQLdb can create the introducers, because unicode objects themselves are independent of any encoding. You can put the introducers in your SQL, but MySQLdb can't automatically encode your unicode object into arbitrary encodings. It can (and does) use the connection encoding. If you need to use something else, you will probably have to call .encode(charset) on your unicode objects before passing them to .execute(), AND put an introducer in your SQL.

      If you need to use anything other than latin1, you should probably set your connection character set to to utf8. That will let you pass anything. Note that the mysql database itself (with grant tables and such) uses the utf8 encoding and NOT latin1.

      One issue I haven't figured out is how to determine the character set of the result set. You can use introducers, but otherwise it's determined by the connection character set. If you use an introducer, you'll have to encode the result yourself.

      If you have any suggestions -- particularly if you know what you're doing with unicode -- this is a good place for them.

       
    • Brett Powley

      Brett Powley - 2006-03-01

      I've struggled a bit with unicode support in Python for the last month or two, so I hope I can provide some insights here :-)

      I've found that the key to using unicode succesfully in Python is understanding when encoding/decoding is required. Python will (helpfully) crash your program with an encode or decode error if you get this wrong, and it can be frustrating to work it out.

      Explicit encoding/decoding

      Generally, encoding/decoding is only required on I/O (to a file, socket, database, etc). Within your Python code, Python uses 'unicode' (probably UTF-16) or ascii, and happily deals with all the messy stuff (character offsets, etc) so that you don't have to.

      When you read something from a file or socket or whatever (using open(), read()), it will come in as a "byte string", in exactly the same encoding that the file uses. It's then up to you to decode that from whatever the file uses into unicode by telling Python what it is:

      ustr = bstr.decode( 'iso-8859-1', 'replace' )

      The second parameter tells Python what to do if it can't convert a character. 'ignore' will skip it; 'replace' will replace it with a question mark; the default is to crash your program with a UnicodeDecodeError.

      Similarly, when you write to a file, it's up to you to encode it into something that whoever is going to read the file will understand:

      bstr = ustr.encode( 'utf-8' )

      Implicit encoding/decoding

      Python also provides some methods in the codecs module that will do this for you:

      file = codecs.open( 'foobar.txt', encoding='iso-8859-1' )

      Note that this doesn't do anything magical; it simply calls the encode() or decode() function for you whenever you do a read() or write(). If you don't know the encoding of the file until after you've opened it and read a header or something from it, then using codecs.open() is probably a bad idea.

      Note that if you use codecs.open() and then Python can't encode or decode a character into the encoding you've specified, it will crash your program.

      Output to the terminal

      When you write to a terminal (using print), Python tries to determine what the encoding of the terminal is, and then to encode the string for that terminal. Unfortunately, Python doesn't always seem to do a good job at determining what your terminal can do. This is why you saw this error above:

      >>> print page[:100] 
      Traceback (most recent call last): 
      File "<stdin>", line 1, in ? 
      UnicodeEncodeError: 'ascii' codec can't encode characters in position 19-23: ordinal not in range(128) 
      >>> 

      Python has decided that your terminal supports 'ascii', and tried to do the conversion when you've done the print. There is one way that I've found to tell Python what your terminal can do; this is by setting an environment variable when you run your program:

      LANG=en_US.UTF-8 python foobar.py

      For this reason, I've found that 'print' is generally not a good way of debugging Python programs that use unicode.

      MySQL encoding

      MySQL 4.1 allows you to specify the encoding of individual columns. You can also specify the encoding used for the connection to MySQL.

      I am not certain about this, but I am guessing that MySQL converts from the column encoding to the connection encoding whenever you read something from the database, and from the connection encoding to the column encoding whenever you write something to it.

      What this means is that when you get a string back from MySQL, it ought to be in the connection encoding. It's then up to you to decode this into unicode using the decode() method, exactly the same as if you'd read this from a file (see above).

      If you don't know what encoding your database columns use, or if they're mixed, then it would seem that the best solution is to use utf-8 as the connection encoding. It's not going to break anything, whereas using any other encoding may lose characters going to/from the database.

       
      • Andy Dustman

        Andy Dustman - 2006-03-01

        MySQL-4.1 and newer definitely gives you a lot of rope to hang yourself with, as you can set the character set of:

        • the connection
        • default for databases
        • default within a table
        • set for individual columns
        • prepend "hints" to literals

        and you may do all this with multiple character sets simultaneously. But yes, whatever MySQL is returning to you should be in the connection's character set, and I agree utf-8 ought to be the way to go in theory. However I wonder if utf-16 (or ucs2 or ucs4) may end up giving better performance when there are a lot of multi-byte characters involved, since there would be less encoding and decoding going on, and utf-8 encoding is relatively complex.

        Also note that it seems that column sizes (i.e. VARCHAR(255)) are always expressed in bytes, which makes it kind of hard when you are using a variable-length encoding like utf-8.

         
    • Andy Dustman

      Andy Dustman - 2006-03-01

      So, where are we a year later? The roadpmap applies to 2.0, which I am not significantly closer to, but here are some additional comments.

      • Still no closer to prepared statement support.

      • Placeholders: Most likely you'll be able to set what style to use, and it'll adjust. Going from %s to ? is easy; the reverse direction is harder.

      • There are a couple small 5.0 API extensions that will go into 1.2.1. I looked at 5.1 a bit and I don't think there are any major additions. Since 5.1 is already beta, I will probably start testing with it immediately after 1.2.1.

      • Python-2.3 or newer will be required for 1.2.1. 2.2 was the only other old version supported for 1.2.0 and there's just too much good stuff in 2.3 like datetime and sets to justify supporting 2.2.

      • .callproc() (stored procedures) should work in 1.2.1.

      • I've written some unit tests, but haven't committed all of them yet. They'll be in 1.2.1.

      • ctypes will be in the Python-2.5 core library. However I think there are still some issues that will prevent it's use instead of C.

       
    • Brett Powley

      Brett Powley - 2006-03-01

      Performance will probably depend upon whether a conversion is necessary, and this of course depends on whether the connection and column encodings are different.
      Using UTF-16 for the connection would only give better performance if the column were stored in UTF-16. If the database column encoding is UTF-8 (or anything other than UTF-16) then it is still going to have to do a conversion.
      So, choosing the 'best' encoding for the connection might mean guessing which is going to be the most common encoding used for database columns. I'm guessing that would be UTF-8. As a bonus, if Latin-1 is widely used, then conversion of that to UTF-8 is trivial too since Latin-1 is a subset of UTF-8.

       
    • TomLesters

      TomLesters - 2006-03-13

      Someone reported mysql gone away problem with Mysql5 and there is a new
      MYSQL_OPT_RECONNECT parameter for myql_options() of mysql C API,

      http://mail.mems-exchange.org/durusmail/quixote-users/5367/
      ................
      from http://dev.mysql.com/doc/refman/5.0/en/upgrading-from-4-1.html :

      &quot;... The reconnect flag in the MYSQL structure is set to 0 by
      

      mysql_real_connect().
      Only those client programs which did not explicitly set this
      flag to 0 or 1 after
      mysql_real_connect() experience a change. Having automatic
      reconnection
      enabled by default was considered too dangerous (due to the fact
      that table
      locks, temporary tables, user variables, and session variables
      are lost after
      reconnection). "

       
    • Mario Frasca

      Mario Frasca - 2006-07-24

      in the project ibo (http://sourceforge.net/projects/ibo) I'm using (mainly) MySQLdb, pgdb, cx_Oracle and I'm trying to hide the differences between the modules and the database engines... one point of difficulty is the treatment of datetime objects. MySQLdb was the first to offer a distinction between datetime.date and datetime.datetime... but what is the deal about datetime.timedelta?

      >>> import MySQLdb
      >>> import datetime
      >>> db = MySQLdb.connect(...)
      >>> cr = db.cursor()
      >>> cr.execute('create table test (f time)')
      0L
      >>> cr.execute('insert into test values (%s)', datetime.timedelta(1,2,3))
      1L
      >>> cr.execute('select * from test')
      1L
      >>> cr.fetchone()
      (datetime.timedelta(0, 1),)
      >>>

      am I using the wrong data type in the database? which one should it be?

      the result I would like to have is like this...

      Python 2.4.4c0 (#2, Jun 14 2006, 23:22:07)
      [GCC 4.1.2 20060613 (prerelease) (Debian 4.1.1-4)] on linux2
      Type "help", "copyright", "credits" or "license" for more information.
      >>> import pgdb, datetime
      >>> db = pgdb.connect(...)
      >>> cr = db.cursor()
      >>> cr.execute('create table test (f interval)')
      >>> cr.execute('insert into test values (%(f)s)', {'f':datetime.timedelta(1,2,3)})
      >>> cr.execute('select * from test')
      >>> cr.fetchone()
      [datetime.timedelta(1, 2, 3)]
      >>> pgdb.version
      '3.8'
      >>> pgdb.revision
      '3.8.2.5'
      >>>

      (I did have to review pgdb to achieve this...)
      thanks in advance,
      Mario Frasca

       
      • Andy Dustman

        Andy Dustman - 2006-07-24

        The issue is, MySQL's TIME columns are really timedeltas, not time-of-day.

        http://dev.mysql.com/doc/refman/5.0/en/time.html

        If you are writing a portabillity layer, you can override this behavior by modifying the default conversion dictionary, if that will make your life easier.

         
        • Mario Frasca

          Mario Frasca - 2006-07-24

          yes, timedelta, not time-of-day, that is exactly what I wanted, so I'm not doing anything too wrong... have a closer look at the log: I am trying to write a datetime.timedelta into the database and I get this:

          >>> cr.execute('insert into test values (%s)', datetime.timedelta(1,2,3))
          1L
          >>> cr.execute('select * from test')
          1L
          >>> cr.fetchone()
          (datetime.timedelta(0, 1),)
          >>>

          the whole situation is in the previous message.

           
          • Andy Dustman

            Andy Dustman - 2006-07-24

            By the looks of things, timedeltas larger than a day don't work right. A special converter function is needed to get it into the format MySQL expects.

             
          • Andy Dustman

            Andy Dustman - 2006-07-24

            That looks wrong. Put in a bug, please.

             
            • Mario Frasca

              Mario Frasca - 2006-07-24

              sorry, we wrote at the same time... will post a bug and the possible patch...

               
          • Mario Frasca

            Mario Frasca - 2006-07-24

            I have redefined format_TIME in times.py and it works for me... (with the limitations about second fractions)

            def format_TIME(v):
            seconds = int(v.seconds) % 60
            minutes = int(v.seconds / 60) % 60
            hours = int(v.seconds / 3600) % 24

            return '%d %d:%d:%d' % (v.days, hours, minutes, seconds)
            
             

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.