Menu

Database collation issues

Help
Greg Roach
2009-02-01
2013-05-30
  • Greg Roach

    Greg Roach - 2009-02-01

    Since there have been several threads concerning this, I think some background information will help.

    A "collation" sequence is a set of rules that determine how to sort letters.  If you are english, then the rules are pretty simple.  Ignore accents and sort A-Z.

    But for other languages, there are more complicated rules.  For example, Ä can mean the same as "AA", the same as "AE" or a separate letter after Z.

    When you ask MySQL to compare two strings, it needs to convert them both to the same collation.  After all, a scandivian Ä sorts differently to a german Ä.

    By default, PGV does not specify a collation.  It just uses the database default and assumes it hasn't changed.  Thus when we create new tables (as we did for PGV4.2), PGV relies on the fact that the tables will be created with the same collation as the earlier tables.

    It doesn't really matter which collation sequence you use - as long as they are all the same.  If they are different, then one of two things will happen.  Either a lot of extra processing will be required to convert them each time you query the database, or the database will encounter a situation where it can't work out which one takes priority, and which one needs converting.

    If your collations are mis-matched, then something/someone must have changed the database default or changed the existing tables.

    To fix this, simply convert to a consistent form.  I would strongly recommend utf8_unicode_ci.  Note that you should only change from one utf8 collation to another.  You should *NOT* convert from latin to utf8.  The reason for this is a little complicated....

    By default, MySQL accepts queries and returns results in ASCII (latin collation).  However, PGV sends data in UTF8.  MySQL will convert this data from ASCII->UTF8, even though the data is already UTF8!  This can corrupt accented and non-latin characters.  You'll see this if you browse your database using phpMyAdmin.  The good news is that the reverse translation happens when you fetch data, unscrambling it.  If you try to convert the database collation while it contains data, then you'll corrupt your data, as it won't receive the symmetrical encoding/decoding.  Your only "safe" solution is to export all your data, convert all your tables, then import all your data.  I have written an impor/export utility to help with this.  It is in the Patches tracker.

    This double-encoding causes lots of problems.  It affects performance and prevents searching on accented letters.

    As some of you have noticed, PGV4.2 has a new configuration option in the site-config settings - USE_DB_COLLATION.  When selected, PGV will do a number of things.  It will specify UTF8 for communicating with the database (saving processing), it will create tables explicitly as utf8_unicode_ci (overriding any defaults), and will use the collation for searching (allowing matching of accented letters).

    But you can't simply enable this option on an existing system.  As the help text explains, you can only select it if you do so when setting up a new system.  Otherwise you will get the mis-matched conversions described above.  If you want to use it (you should), then you can use the export/import process described above.

    However, this change requires a bit of technical knowledge.  For this reason, this option is disabled by default.

     
    • Wes Groleau

      Wes Groleau - 2009-02-02

      Questions and comments:

      1. Does the script verify admin login?  If not, we'll want to delete it as soon as we're done.

      2. Instead of deleting, I merely zipped it (making it no longer a PHP script and having a different filename).  However, I forgot to sync this change to my webserver.  Consequently, when I clicked the back button, I re-imported all the tables.  Did that break anything?

      3. I know this is a for techies/own risk utility, but if it ever migrates to being a regular part of the admin tools, it probably will need internal checks to prevent stunts like I just pulled.

      4. I am concerned that some how UTF-8 is not the charset of my DB.  The MySQLAdmin defaults to Latin 1 for its UI, and even though I explicitly change to UTF-8, I see weird chars in some cells.  I specified UTF-8 when creating, but wasn't happy with the feedback.  So I explicitly told it to set each table to UTF-8.  Could that be added to this script?  Not the collation, which is already there, but the encoding.

       
      • Greg Roach

        Greg Roach - 2009-02-02

        <<1. Does the script verify admin login? If not, we'll want to delete it as soon as we're done.>>

        Yes

        <<2. Instead of deleting, I merely zipped it (making it no longer a PHP script and having a different filename). However, I forgot to sync this change to my webserver. Consequently, when I clicked the back button, I re-imported all the tables. Did that break anything?>>

        Re-importing is just like restoring from a backup.  If you'd made any edits between exporting and importing, these would have been reverting.  Otherwise no harm done.

        <<4. I am concerned that some how UTF-8 is not the charset of my DB.>>

        To be effective, you need to set UTF8 end-to-end.  You must use it for the querying, the storage, the results, etc.

        Historically, PGV never specified anything.  I guess the original authors didn't consider it.  The querying/results default to latin.  The tables/columns often default to utf8.  It's a bit of a mess - hence I'm trying to clear it up.  The changes I've made in 4.2 do this *FOR NEW INSTALLATIONS ONLY*.  It is not possible to detect what settings/encodings have already been used, so any existing database must be manually changed.

        Note also that PGV gives you the choice of UTF8 or ASCII.  (IMHO, we should never have offered this choice).  This means we have to allow for people who want everything stored/processed in ASCII.

         
    • Gerry Kroll

      Gerry Kroll - 2009-02-02

      Greg:
      We should get rid of the ability to store everything in ASCII.  I don't think it's nececessary to attempt to conserve disk space by using the more compact (but less useful) ASCII character set.

       
      • Greg Roach

        Greg Roach - 2009-02-02

        Since the conversion from iso8859 to utf8 is done on gedcom import, we'll have to combine this with the next round of database changes.  I can only see one problem....

        There are many variants of "extended ascii" - iso8859-1 to iso8859-15.

        Since someone may have used a particular variant to encode their characters (e.g. iso8859-6 for arabic or iso8859-7 for greek), then we'd need to specify the right one when converting to utf8.

        I've done something similar in another project.  That program picked some examples of extended-ascii text, and tried converting to utf8 using each of the available encodings.  It then asked the user to pick the one that "looked right".

        Should we also keep the "convert to ASCII" option on download?  If so, then for completeness we ought to allow the user to convert to any of the encodings supported by PHP.

        I'm not sure how many users this might apply to.  It could be a lot of work for potentially zero users ;-)

         
    • Anton Largiader

      Anton Largiader - 2009-02-02

      The only 'strange' characters I use are:

      è - ASCII 138
      û - ASCII 150

      There may be a few others eventually, but most likely will be western European. I type these using alt-138 or whatever, but my Swiss relatives have them on the keyboard.

      Would my DB performance be better, or would I be less likely to run into future problems, if I reinstalled with collation other than latin1-swedish-ci ?  That is what I got by default.

       
      • Greg Roach

        Greg Roach - 2009-02-02

        <<Would my DB performance be better, or would I be less likely to run into future problems, if I reinstalled with collation other than latin1-swedish-ci ? That is what I got by default.>>

        It depends on what your other defaults are.

        You've probably got a "communication" character set of latin1-swedish-ci as well.

        PGV does everything in UTF8.  For è, it would send a sequence of two bytes (one utf8 character).  But if MySQL is set to communicate in latin, it will treat these as two separate extended ascii characters.

        Now, if you just convert your tables to utf8, MySQL will convert these two "extended ascii" characters to utf8 - giving 4 bytes in total.

        You can get round this by converting your columns from latin to binary and then from binary to utf8.

        Now, if you know what encodings you have used at each step, you can do an on-line conversion.  But, for most people, the simplest option is to export all their data, convert all their tables, then import all their data (as per the patch in the tracker).

        There are too many things that can go wrong with any attempt to automate or simplify this.  I wouldn't recommend this for the average user, because it is hard to diagnose/fix remotely, and we have the potential for lots of people to lose their data.

        So, as per the instructions on the config page - only select the "use DB collation" option if you are:

        a) doing a brand new installation
        b) know how to convert your existing data

        The default option is to use the defaults.  It may not give optimal performance, but if you don't fiddle, everything should just work.

         

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.