From: Balassa M. <fu...@ma...> - 2003-08-23 12:43:55
|
Try this on a database with character set WIN1250: SELECT 1 FROM rdb$database WHERE UPPER('abc') = UPPER('AbC') The above query returns 1 as expected. Now, try it with strings containing non-us characters (á, é, etc). The result is null. BALASSA Marton |
From: Dmitry Y. <di...@us...> - 2003-08-23 14:33:30
|
Balassa, > Try this on a database with character set WIN1250: > > SELECT 1 FROM rdb$database WHERE UPPER('abc') = UPPER('AbC') > > The above query returns 1 as expected. Now, try it with > strings containing > non-us characters (á, é, etc). The result is null. Use appropriate collation explicitly, e.g.: SELECT 1 FROM RDB$DATABASE WHERE UPPER(<value1> COLLATE PXW_HUN) = UPPER(<value2> COLLATE PXW_HUN) (for Hungarian language) Dmitry |
From: Dimitry S. <SD...@to...> - 2003-08-25 06:41:44
|
On 23 Aug 2003 at 17:01, Peter Jacobi wrote: >At some pre-historic time in Interbase development, the decision was >taken to let the default collation of every (?) character only >uppercase ASCII letters. Perhaps the developers didn't know any language but English... >This is currently seen as unfortunate by many, and Nickolay had used >his commit access already to hack the fix for cp1251 into current >builds. And I don't know any russian developer who would blame him for this action. SY, Dimitry Sibiryakov. |
From: Dmitry Y. <di...@us...> - 2003-08-26 15:50:41
|
Dimitry, > >This is currently seen as unfortunate by many, and Nickolay had used > >his commit access already to hack the fix for cp1251 into current > >builds. > > And I don't know any russian developer who would blame him for this > action. If we'd have different collations for Russian, Ukrainian etc, then I'd call this change very questionable. E.g. which collation should be considered a default one for WIN1250? Hungarian? Slovakian? I think that's why UPPER uses binary collation by default. Dmitry |
From: Nickolay S. <sk...@bs...> - 2003-08-26 16:23:05
|
Hello, Dmitry ! >> >This is currently seen as unfortunate by many, and Nickolay had used >> >his commit access already to hack the fix for cp1251 into current >> >builds. >> >> And I don't know any russian developer who would blame him for this >> action. > If we'd have different collations for Russian, Ukrainian etc, then I'd call > this change very questionable. E.g. which collation should be considered a > default one for WIN1250? Hungarian? Slovakian? I think that's why UPPER uses > binary collation by default. AFAIU, uppercasing is locale-insensitive operation. Case (none, uppercase, lowercase or titlecase) is the property of character and may be changed appropriately. Files UnicodeData.txt, SpecialCasing.txt and CaseFolding.txt from Unicode standard clearly specify case conversion rules. This means that default case conversion may be safely added to all Firebird character sets as they are all have direct UNICODE mappings. > Dmitry -- Nickolay Samofatov |
From: Peter J. <pj...@wa...> - 2003-08-26 16:53:53
|
Hi Nickolay, > AFAIU, uppercasing is locale-insensitive operation. Case (none, uppercas= e, > lowercase or titlecase) is the property of character and may be changed > appropriately. Files UnicodeData.txt, SpecialCasing.txt and CaseFolding.= txt > from Unicode standard clearly specify case conversion rules. This means = that > default case conversion may be safely added to all Firebird character se= ts > as they are all have direct UNICODE mappings. Generally speaking I agree with you, and changing the uppercasing to match the default non-normative UNICODE rules, will be the next step after code cleanup. Wait some minutes, ...err... days to see this happen. But... 1a. There are very few cases of locale sensitive uppercasing, involving letters like LATIN SMALL LETTER DOTLESS I and LATIN CAPITAL LETTER I WITH DOT. 1b. There are rumours about locales, where accents should go away when uppercasing, e.g. "French Traditional" if I guess right. 2. What about deployed database with column constraints "s =3D UPPER(s)", and having unfortunately some "=E4" inside? Is it fair to go south on such data? 3. You cannot (now) localize UNICODE_FSS, see such gems as case ttype_none: case ttype_ascii: case ttype_unicode_fss: dest =3D src; while (len--) { *dest++ =3D UPPER7(*src); src++; } break; (in jrd/intl.cpp) Regards, Peter Jacobi |
From: Claudio V. C. <cv...@us...> - 2003-08-27 11:46:08
|
Dmitry Yemanov wrote: > Dimitry, > >>> This is currently seen as unfortunate by many, and Nickolay had used >>> his commit access already to hack the fix for cp1251 into current >>> builds. >> >> And I don't know any russian developer who would blame him for this >> action. > > If we'd have different collations for Russian, Ukrainian etc, > then I'd call > this change very questionable. E.g. which collation should be > considered a default one for WIN1250? Hungarian? Slovakian? I think > that's why > UPPER uses > binary collation by default. Yes, it's easy to default to a non-binary collation when a charset supports only one language. To give a western example, look at WIN1252 or ISO8859_1, what should be the non-binary default collation??? Do we select the collation based on number of speakers for the related language, alphabetic precedence, number of FB users for that collation, etc? C. |
From: Peter J. <pj...@wa...> - 2003-08-27 17:36:39
|
Hi Claudio, All, > Yes, it's easy to default to a non-binary collation when a charset supports > only one language. To give a western example, look at WIN1252 or ISO8859_1, > what should be the non-binary default collation??? As I've already posted to firebird-devel, I'll prefer one of two solution: A) Default collation for every charset other than NONE and OCTETS has: A1) uppercasing conforming to default UNICODE tables A2) sorting accoring to *UNICODE* codepoint order A1 gives 99.98% correct behavoiur because having to give a COLLATE clause A2 gives consistent sorting behaviour across all character sets B) ("easy mode") B1) In addition to the existing behaviour (explicit CHARACTER SET and COLLATE clauses), the datatype NCHAR is always mapped to UNICODE UTF16BE and the data type is always mapped to ISO-8859-x with x=1..16 configurable. B2) Collation when no collation clause is given, is a configurable <language>_<country> collation. B3) Win32 (and maybe other) installers will set x and <language>_<country> according to the Regional Options found on the system. I'm aware that both solutions may have compatibility problems to such a degree that they cannot be deployed, but I wanted to state what I find desirable. Regards, Peter Jacobi |