From: Ann H. <aha...@nu...> - 2012-02-19 20:49:20
|
On Sat, Jan 21, 2012 at 12:52 PM, Mariusz Nogala <mn...@am...> wrote: Quite possibly I'm responding to a month-old message that's been answered elsewhere, but it appeared for me on Firebird-tests and belongs to Firebird-support, so I'm responding in two places. The message was a response to a bug report (CORE-3738) about the collation of some Polish characters under the default UTF-8 collation. > > I have tried the solution with CREATE COLLATION, > but it does not solve my problem. You may not have followed all the steps in creating and using a collation. > (moreover the attribute LOCALE does not work with my > Firebird 2.5.1. I don't know why). Some databases intuit a lot of information from the LOCALE of the server, including character set and collation for string data, date format, whether fractions are separated from integer parts of numbers with a "," or a "." and inversely whether hundreds are separated by a ".' or a ",". Firebird does not. And that's a good thing, in my opinion. When server determines all those things from its locale, writing multi-lingual applications becomes a nightmare. My career in databases spans the old days of "ASCII is good enough for me, it should be good enough for you" and "If God had meant upper and lower case to sort near each other, He wouldn't have put them so far apart in the ASCII code set" through "Why don't Americans care about rational sorting of upper and lower case?" and "My character set is better than your character set" to the relative peace of Unicode and UTF-8. Then from time to time there's a collation problem. We all have cultural biases - people who live at the edges of several cultures are more aware of them than others. A bias that's hard to recognize is "When I learned my alphabet (at the age of 4 or 5), I learned the one true order of characters in the world. My kindergarten teacher would never lie to me." Alas, no. Nearly every country orders words differently - non-phonetic character sets being the most confusing to those of us who learned them late in life. But even in phonetic languages with Latin character sets, there are huge differences in the "right" sorting of characters. In some Iberian langauges, "ll" (double L) sorts after "y" and nowhere near a single lower case L. And, in some cases, the same character ("a" with a circle over it in Nordic languages for one) sorts in different places in different languages. > > But I am not convinced. I believe you. You want a classic Polish collation. That requires a table - possibly quite complex - that matches each character's binary representation with a weighted version that sorts naturally in the way you expect. The weighted versions are often larger than the binary representation of the character - often two bytes, sometimes three for multi-level collations. The Unicode Standard Website probably has one - start at Unicode.org. Then build that table into a collation object that Firebird can use, then introduce it through the collations table, then use that collation name as the default for the database, or for each column that you want to have ordered that way. > I steel think that something is wrong with collate UNICODE. Unfortunately, the correct collation for Unicode is whatever the Unicode standard organization says it is, however illogical it may seem and however much it may violate your kindergarten teacher's rules. > The point is that it sorts single character texts properly and > longer texts IN A DIFFERENT WAY. The sorting order should > mainly depend on the first character in a column. Sometimes it depends on the first two. Sometimes it depends on the last accented character in the string (French is accent sensitive, but considers accents from the back forward.) > If accented characters follow the non-accented in single character texts, > so why it does not happen when sorting longer texts? > > See the example: > > ORDER BY gives: > a, ą, ab (here 'ą' follows 'a', but not 'ab' - it makes no sense...) > > I am sure it should be: > a, ab, ą (because UNICODE collation is accent sensitive) > That is the effect of a secondary ordering. In the primary ordering a and ą are equal, so the next letters are considered. Once the whole string has been compared and the two are still equal, the secondary characteristics are considered. In the secondary ordering a comes before ą. To extend your example a, ą, ab, ąb, abc, ac, ąc, The correct Polish collation may consider a and ą to be different in primary ordering, but the ordering in the Unicode collation does not. Good luck, Ann |