|
From: Balassa M. <fu...@ma...> - 2003-08-23 12:43:55
|
Try this on a database with character set WIN1250:
SELECT 1 FROM rdb$database WHERE UPPER('abc') = UPPER('AbC')
The above query returns 1 as expected. Now, try it with strings containing
non-us characters (á, é, etc). The result is null.
BALASSA Marton
|
|
From: Dmitry Y. <di...@us...> - 2003-08-23 14:33:30
|
Balassa,
> Try this on a database with character set WIN1250:
>
> SELECT 1 FROM rdb$database WHERE UPPER('abc') = UPPER('AbC')
>
> The above query returns 1 as expected. Now, try it with
> strings containing
> non-us characters (á, é, etc). The result is null.
Use appropriate collation explicitly, e.g.:
SELECT 1
FROM RDB$DATABASE
WHERE UPPER(<value1> COLLATE PXW_HUN) = UPPER(<value2> COLLATE PXW_HUN)
(for Hungarian language)
Dmitry
|
|
From: Dimitry S. <SD...@to...> - 2003-08-25 06:41:44
|
On 23 Aug 2003 at 17:01, Peter Jacobi wrote: >At some pre-historic time in Interbase development, the decision was >taken to let the default collation of every (?) character only >uppercase ASCII letters. Perhaps the developers didn't know any language but English... >This is currently seen as unfortunate by many, and Nickolay had used >his commit access already to hack the fix for cp1251 into current >builds. And I don't know any russian developer who would blame him for this action. SY, Dimitry Sibiryakov. |
|
From: Dmitry Y. <di...@us...> - 2003-08-26 15:50:41
|
Dimitry, > >This is currently seen as unfortunate by many, and Nickolay had used > >his commit access already to hack the fix for cp1251 into current > >builds. > > And I don't know any russian developer who would blame him for this > action. If we'd have different collations for Russian, Ukrainian etc, then I'd call this change very questionable. E.g. which collation should be considered a default one for WIN1250? Hungarian? Slovakian? I think that's why UPPER uses binary collation by default. Dmitry |
|
From: Nickolay S. <sk...@bs...> - 2003-08-26 16:23:05
|
Hello, Dmitry ! >> >This is currently seen as unfortunate by many, and Nickolay had used >> >his commit access already to hack the fix for cp1251 into current >> >builds. >> >> And I don't know any russian developer who would blame him for this >> action. > If we'd have different collations for Russian, Ukrainian etc, then I'd call > this change very questionable. E.g. which collation should be considered a > default one for WIN1250? Hungarian? Slovakian? I think that's why UPPER uses > binary collation by default. AFAIU, uppercasing is locale-insensitive operation. Case (none, uppercase, lowercase or titlecase) is the property of character and may be changed appropriately. Files UnicodeData.txt, SpecialCasing.txt and CaseFolding.txt from Unicode standard clearly specify case conversion rules. This means that default case conversion may be safely added to all Firebird character sets as they are all have direct UNICODE mappings. > Dmitry -- Nickolay Samofatov |
|
From: Peter J. <pj...@wa...> - 2003-08-26 16:53:53
|
Hi Nickolay,
> AFAIU, uppercasing is locale-insensitive operation. Case (none, uppercas=
e,
> lowercase or titlecase) is the property of character and may be changed
> appropriately. Files UnicodeData.txt, SpecialCasing.txt and CaseFolding.=
txt
> from Unicode standard clearly specify case conversion rules. This means =
that
> default case conversion may be safely added to all Firebird character se=
ts
> as they are all have direct UNICODE mappings.
Generally speaking I agree with you, and changing the uppercasing
to match the default non-normative UNICODE rules, will be the next
step after code cleanup. Wait some minutes, ...err... days to see
this happen.
But...
1a. There are very few cases of locale sensitive uppercasing, involving
letters like LATIN SMALL LETTER DOTLESS I and LATIN CAPITAL LETTER I WITH
DOT.
1b. There are rumours about locales, where accents should go away
when uppercasing, e.g. "French Traditional" if I guess right.
2. What about deployed database with column constraints
"s =3D UPPER(s)", and having unfortunately some "=E4" inside? Is
it fair to go south on such data?
3. You cannot (now) localize UNICODE_FSS, see such gems as
case ttype_none:
case ttype_ascii:
case ttype_unicode_fss:
dest =3D src;
while (len--) {
*dest++ =3D UPPER7(*src);
src++;
}
break;
(in jrd/intl.cpp)
Regards,
Peter Jacobi
|
|
From: Claudio V. C. <cv...@us...> - 2003-08-27 11:46:08
|
Dmitry Yemanov wrote: > Dimitry, > >>> This is currently seen as unfortunate by many, and Nickolay had used >>> his commit access already to hack the fix for cp1251 into current >>> builds. >> >> And I don't know any russian developer who would blame him for this >> action. > > If we'd have different collations for Russian, Ukrainian etc, > then I'd call > this change very questionable. E.g. which collation should be > considered a default one for WIN1250? Hungarian? Slovakian? I think > that's why > UPPER uses > binary collation by default. Yes, it's easy to default to a non-binary collation when a charset supports only one language. To give a western example, look at WIN1252 or ISO8859_1, what should be the non-binary default collation??? Do we select the collation based on number of speakers for the related language, alphabetic precedence, number of FB users for that collation, etc? C. |
|
From: Peter J. <pj...@wa...> - 2003-08-27 17:36:39
|
Hi Claudio, All,
> Yes, it's easy to default to a non-binary collation when a charset supports
> only one language. To give a western example, look at WIN1252 or ISO8859_1,
> what should be the non-binary default collation???
As I've already posted to firebird-devel, I'll prefer one of
two solution:
A) Default collation for every charset other than NONE
and OCTETS has:
A1) uppercasing conforming to default UNICODE tables
A2) sorting accoring to *UNICODE* codepoint order
A1 gives 99.98% correct behavoiur because having to
give a COLLATE clause
A2 gives consistent sorting behaviour across all character
sets
B) ("easy mode")
B1) In addition to the existing behaviour (explicit CHARACTER
SET and COLLATE clauses), the datatype NCHAR is always
mapped to UNICODE UTF16BE and the data type is always mapped
to ISO-8859-x with x=1..16 configurable.
B2) Collation when no collation clause is given, is a configurable
<language>_<country> collation.
B3) Win32 (and maybe other) installers will set x and
<language>_<country> according to the Regional Options found
on the system.
I'm aware that both solutions may have compatibility
problems to such a degree that they cannot be deployed,
but I wanted to state what I find desirable.
Regards,
Peter Jacobi
|