From: James K. L. <jkl...@sc...> - 2004-06-26 22:26:24
|
On Sat, 26 Jun 2004, "Carlos H. Cantu" <fb...@wa...> wrote: > JKL>It's standard practice to use UPPER() to defeat case-sensitive > JKL>searches. > > The major problem of using UPPER/NOACCENT is that it does not allow > indexes usage in the search. Sure, but that's an artifact of the implementation. Indexes are not controlled by the SQL standard, and Firebird is free to implement them however it wishes, in as many ways as it wishes. --jkl |
From: CarlosMacao <CM...@su...> - 2004-06-24 10:34:12
|
Hi Peter, > H) Whoever needs no-case/no-accent very bad, can try my pj_colkit LOADABLE > collation. > http://www.jodelpeter.de/i18n/fbarch/loadable.txt > http://www.jodelpeter.de/i18n/fbarch/ Do you know of any solution for the Linux platform? Regards, Carlos Macao |
From: Ivan P. <Iva...@se...> - 2004-06-24 15:55:03
|
> A) (and very important from my POV): > For the majority of application, no-case no-accent collations don't make > sense, and the 'normal' multi-level collations would fulfill all > requirements, if only they are used. As you are mentioning in your parallel > post, in STARTING WITH and LIKE 'foo%' > > STARTING WITH 'ABC' should select 'ABCDE' and 'abcde', when a multilevel > collation is used. Yes, but it (Starting With) will become accent-insensitive at the same time, somebody may want it, somebody do not. Nearly everybody (I believe) is content with sorting capabilities provided by multi-level collations, it is searching capabilities that cause troubles. This is why I said that we need more operators/functions, not collations. (e.g. new case-insensitive STARTING, instead of changing its current behaviour) In other words, ordering is property of data, but case/accent-insensitivity should be property of operation on those data (at least in some cases. Of course =,<,> must match the ordering, but I see no reason why the user should not be able to choose from case/accent-sens/insens variants of Containing). As Peter suggested, combination of expression-indexes and Noaccent() variant of Upper() can satisfy such needs too. > > B) The existing multi-level collations are painfully wasteful on key > storage space, limiting the maximally indexable buffer size to a third of > the general limit. For this reasons, triplicating them into no-case and > no-case / no-accent variants is -at least to me- wasted effort. Yes, and I noticed that many existing collations are in fact two-level only, yet they require three bytes. > D) Only with FB1.5.1 it has become an options to consistently use > connection charset NONE (ane be it only for the lack of a better solution) (I probably already asked this, but forgot the answer:) Does this change apply to client-server communication only, or is it consistent everywhere (e.g. Update Tab Set Iso8859Column = NoneColumn;) > H) Whoever needs no-case/no-accent very bad, can try my pj_colkit LOADABLE > collation. > http://www.jodelpeter.de/i18n/fbarch/loadable.txt > http://www.jodelpeter.de/i18n/fbarch/ > > J) As rdb$collations and rdb$character_sets are hardcoded into the engine's > code, instead of being deferred from the DLLs, it's unnecessarily > complicated to add charscter sets and collations (and all tools will ignore > them) I tried to write some external collation drivers too (using Delphi), and it is not much difficult, but such approach has its drawbacks too: - more complicated installation - even custom collations can't solve many requirements - narrow range for external collation_ids (250..254). These Charset_id/Collation_id numeric values are used directly e.g. in SP/Trigers blr, and there is no additional mapping (e.g. through system tables) between external/internal ids, thus it is dangerous to expect that nobody else has taken these numbers by different driver. Ivan |
From: Ivan P. <Iva...@se...> - 2004-06-24 15:55:03
|
Why the need of case insensivite search: I have database that collects data from several independent sources. I have no control over their rules, whether everything is in upper/lower/proper/whatever case, all I know is that in most cases I have to store data in their original form, and I have no right to normalize them. Why the need of accent insensitive search: * sometimes accent/diacritical marks are missing in data because - user forgot to write it (spelling error) - user does not know what the correct mark is, e.g. in foreign names - it was written in system (or passed through the system) that does not support such character, e.g. west european journalist could hardly write my name correctly, because "r with caron" (the correct second letter of my lastname) does not exist in iso8859_1/win1252. * data are stored correctly, but the person entering the query - can be standing at counter, with only one hand free, thus using just basic 26 letters can reduce required typing - can be ... not much good at grammar to say it mildly - can be afraid of computers, can be perplexed by laptop's keyboard layout, can be used to other search engines, etc... In many cases even Dave's case/accent insensitive collations do not help, because they are - accent insensitive only for secondary differences, but they still treat several letters with caron as primary difference (and such collation is of little use for *search*, at least for Czech. Either all marks should be considered, or all should be ignored). - accent sensitive with Containing operator (perhaps just fn_to_lower() should be rewritten. Is it used only by Containing?) I always prefer if I have a choice. Despite wanting *-insensitive behaviour in many cases, sometimes *-sensitive behaviour is also desirable. E.g. CONTAINING is *always* case-insensitive (or semi-case-insensitive for binary collations), but its case-sensitive variant is sometimes needed too (e.g. when searching for short abbreviations like AP, that can be part of many words), or its diacritic-insensitive variant (for many already mentioned reasons). So I see good reasons to have all 3 variants of CONTAINING. (But, the Containing is the least problematic operation, because it does not use index, and can be easily replaced by UDF. Perhaps even more effectively, because right operand of Containing is uppercased unnecessarily again and again.) Now I see I chose wrongly the Subject. People have some requirements, and these can be achieved by many ways - collations are only one of them. However, most solutions are probably long-term, while adding few new collations could be done in relatively short time. Ivan |