From: Alexandre B. S. <ib...@th...> - 2004-06-28 22:24:18
|
Hi James ! James K. Lowden wrote: > This is at least partly a function of the application. > >It's standard practice to use UPPER() to defeat case-sensitive searches. > > As you know, if you use UPPER, you cannot use the index anymore. Maybe when indices based on functions are available, the problem goes away, but I still have some doubts... see below. >Perhaps a similar function "NOACCENT()", is needed for accents? I'm not >familiar with languages whose accented characters are sometimes not >considered distinct. (In Swedish, for example, 'a', and 'å' are two very >different letters.) How is this solved by other implementations? > > The NOACCENT function could solve the problem, But all queries should be changed to: Select * from Customer where UPPER(NOACCENT(Name)) = UPPER(NOACCCENT(:aValue)); If I have an index such: create index SK_Name on Customer UPPER(NOACCENT(Name)); This query will be indexed, but if I just mystyped the order of the functions [NOACCENT(UPPER(Name))] the query will be "natural" again, not a big problem if I just adopt a common way of use... In my main apllication no problem (just change my Data Classes), but in third party tools (Crystal Reports for example), this become a problem... And in every query like when you are doing some checking in ISQL, etc, you have to put the UPPER(NOACCENT(something)) to get the desired result. I think a more elegant approach is if it can be defined on the table definition, so nobody should remember to put the double function everywhere. As you pointed out, this is not the default for every language, so a collation with this caracteristic should be choosen by the developer when it applies to his needs. Other Implementations: I will say about MSSQL (version 6.5 is the last I have used, a really old one....). I don't think that MSSQL approach is the better one, but... When you install the Server, you choose what dictionary (I think that was the term used) you want to work (very bad to choose a global one, in my opnion this should be at least for each DB, the ideal is for each column !). MSSQL has a similarity table, something like this: A = a = Á = á = Ã = ã = À = à = Â = â B = b C = c = Ç = ç D = d E = e = É = é = Ê = ê This table is used for comparisons. MSSQL stays that when one uses a non binary dictionary the speeds could be 20% slow. The approach specified in the UNICODE doc's Peter pointed out is much powerfull, generating the sortkey as 3 or more 2 bytes values, and then compare the sort key values seens good and powerfull to me (more than one char could be mapped to one char, etc.), don't know about performance penalty, but since this will be the developer choice and will not be a default to every db around the world, I think I could expect (are are warned about) a slower performance when I define a column with as case/accent insensitive data. >It seems to me the problem with using collations that don't distinguish, >say, between accented and non-accented characters is that the applications >are then prevented from distinguishing, every time. Language contructs >that allow per-use overrides are more flexible. > > > I think the collation should be defined in the column (as it is), I could have different columns that should behave diferently. But as mentioned on the paper point out by Peter, the behaviour should be consistent with the cultural aspects of the user, even if Swedish people differs "a" and "å" if I (a Brazilian guy) was searching for Sweden names I expect the mentioned chars to be close to each other, and if I just type "a" I expect the records with "å" to be found any way. As I understood this will have a consistent behaviour across any table/column, then it should be based on connection. But I don't have experience with this kind of problem... >--jkl > > see you ! -- Alexandre Benson Smith Development THOR Software e Comercial Ltda. Santo Andre - Sao Paulo - Brazil www.thorsoftware.com.br |