From: Ivan P. <Iva...@se...> - 2004-06-23 12:18:47
|
What is the reason that there are still no case/accent/diacritic-insensitive built-in collations in Firebird ? Is it that some bugs prevent it, or that it is more difficult than I think, or that it could be easy done but difficult to ensure that it conforms to some norms, or that this area will be completely reworked in FB2 so doing it now would be wasting of time, or is the reason as simple as that nobody just created them ? Ivan |
From: Peter J. <pj...@wa...> - 2004-06-23 13:02:50
|
Hi Ivan, All, > What is the reason that there are still no case/accent/diacritic-insensitive > built-in collations in Firebird ? Is it that some bugs prevent it, or that > it is more difficult than I think, or that it could be easy done but > difficult to ensure that it conforms to some norms, or that this area will > be completely reworked in FB2 so doing it now would be wasting of time, or > is the reason as simple as that nobody just created them ? All of the above and then some. A) (and very important from my POV): For the majority of application, no-case no-accent collations don't make sense, and the 'normal' multi-level collations would fulfill all requirements, if only they are used. As you are mentioning in your parallel post, in STARTING WITH and LIKE 'foo%' STARTING WITH 'ABC' should select 'ABCDE' and 'abcde', when a multilevel collation is used. B) The existing multi-level collations are painfully wasteful on key storage space, limiting the maximally indexable buffer size to a third of the general limit. For this reasons, triplicating them into no-case and no-case / no-accent variants is -at least to me- wasted effort. C) All future collation should better be seen and (partially) implemented as restrictions of collations on Unicode, but the Unicode support of FB is still lacking. D) Only with FB1.5.1 it has become an options to consistently use connection charset NONE (ane be it only for the lack of a better solution) E) I am not that productive as I hoped. F) Implementing Thai was more fun. G) Doing some serious work on charset/collations always looks somewhat wasted, due to the lack feedback - and the possibility that everything will be thrown out eventually and linked against ICU. H) Whoever needs no-case/no-accent very bad, can try my pj_colkit LOADABLE collation. http://www.jodelpeter.de/i18n/fbarch/loadable.txt http://www.jodelpeter.de/i18n/fbarch/ I) I neither have a MSVC build environment nor a working Linux and never managed to get MinGW working. J) As rdb$collations and rdb$character_sets are hardcoded into the engine's code, instead of being deferred from the DLLs, it's unnecessarily complicated to add charscter sets and collations (and all tools will ignore them) Regards, Peter Jacobi |
From: Carlos H. C. <fb...@wa...> - 2004-06-23 16:40:53
|
Peter, PJ> For the majority of application, no-case no-accent collations don't make PJ> sense, and the 'normal' multi-level collations would fulfill all PJ> requirements, if only they are used. As you are mentioning in your parallel PJ> post, in STARTING WITH and LIKE 'foo%' Despite the fact that I do not fully understand what you mean with 'multi-level collations', I tend to disagree with you... at last in Brazil, people are requesting for a long time an accent/case insensitive collation. Almost everyone here comes from the xBase world where we had case-insensive indexes and this feature is really missed by those users. Even an unoficial special Firebird binary was created by a brazilian guy just to implement a case-insensitive collate called PT_BR. Dave was contacted some years ago to develop an official case/accent insensitive collation based on WIN1252 and ISO8859_1. Many people here contributed with money to pay him for this, but due to lack time (from him) the work wasn't done. It seems that FB suffers a collation/charset general mess and I think it deserves a good cleanup in this area. In the results of my "Firebird Users Wish List Survey" running at http://infopoll.net/live/surveys/s25748.htm, this feature is among the top 4 mostly wanted features. The final results will be published next week. []s Carlos http://www.warmboot.com.br FireBase - http://www.FireBase.com.br |
From: Ivan P. <Iva...@se...> - 2004-06-23 16:48:31
|
One of the problems of collations is that they are All In One - they are used for searching, sorting, uppercasing, but this functionality is accessible in fixed and not always desired or consistent way. E.g. when using accent-insensitive collations, I can apply accent insensitive search (but just for some operations, not e.g. for Containing), but I will loose nice sorting. Would not it be better instead of adding new collations to add new operators ? E.g. to have case-sensitive / case-insensitive / accent-insensitive variant CONTAINING. Or be able to specify for multi-level collations how many levels are significant for required operation. (thus, depending on collation, e.g. Starting(1) could be case/accent insensitive, Starting(2) could be accent-sensitive/case-insensitive, etc.) Ivan ----- Original Message ----- From: "Peter Jacobi" <pj...@wa...> To: <fir...@li...> Sent: Wednesday, June 23, 2004 3:12 PM Subject: Re: [Firebird-devel] Case-insensitive collations > Hi Ivan, All, > > > What is the reason that there are still no case/accent/diacritic-insensitive > > built-in collations in Firebird ? Is it that some bugs prevent it, or that > > it is more difficult than I think, or that it could be easy done but > > difficult to ensure that it conforms to some norms, or that this area will > > be completely reworked in FB2 so doing it now would be wasting of time, or > > is the reason as simple as that nobody just created them ? > > All of the above and then some. > > A) (and very important from my POV): > For the majority of application, no-case no-accent collations don't make > sense, and the 'normal' multi-level collations would fulfill all > requirements, if only they are used. > As you are mentioning in your parallel > post, in STARTING WITH and LIKE 'foo%' > > STARTING WITH 'ABC' should select 'ABCDE' and 'abcde', when a multilevel > collation is used. > B) The existing multi-level collations are painfully wasteful on key > storage space, limiting the maximally indexable buffer size to a third of > the general limit. For this reasons, triplicating them into no-case and > no-case / no-accent variants is -at least to me- wasted effort. > > C) All future collation should better be seen and (partially) implemented > as restrictions of collations on Unicode, but the Unicode support of FB is > still lacking. > > D) Only with FB1.5.1 it has become an options to consistently use > connection charset NONE (ane be it only for the lack of a better solution) > > E) I am not that productive as I hoped. > > F) Implementing Thai was more fun. > > G) Doing some serious work on charset/collations always looks somewhat > wasted, due to the lack feedback - and the possibility that everything will > be thrown out eventually and linked against ICU. > > H) Whoever needs no-case/no-accent very bad, can try my pj_colkit LOADABLE > collation. > http://www.jodelpeter.de/i18n/fbarch/loadable.txt > http://www.jodelpeter.de/i18n/fbarch/ > > I) I neither have a MSVC build environment nor a working Linux and never > managed to get MinGW working. > > J) As rdb$collations and rdb$character_sets are hardcoded into the engine's > code, instead of being deferred from the DLLs, it's unnecessarily > complicated to add charscter sets and collations (and all tools will ignore > them) > > Regards, > Peter Jacobi |
From: Peter J. <pj...@wa...> - 2004-06-23 17:19:59
|
Hi Carlos, All, > Despite the fact that I do not fully understand what you mean with > 'multi-level collations', There is no substitute for leartning. If you want to know about state-of-the-art collations, there is no substitute for starting here: http://www.unicode.org/reports/tr10/ Dave has a somewhat gentler introduction in his CollationKit. > I tend to disagree with you... at last in > Brazil, people are requesting for a long time an accent/case > insensitive collation. For what use case? The existing multi level collations already implement nearly everything a no-case/no-accent collation can do. Except assuring uniqueness regarding columns. But I consider it a somewhat strange design, to a) not allow more than one of "RED", "Red" and "red" in a column but b) not to care about which spelling is desired > Almost everyone here comes from the xBase world > where we had case-insensive indexes and this feature is really missed > by those users. Hehe, a lot of users get the answer "No, you don't want to do this" for some questions on Firebird. This is not exactly good propaganda for Firebird, even when considered to be educational. So, whereas I'm somewhat tempted to react like this, ... > Even an unoficial special Firebird binary was created by a brazilian > guy just to implement a case-insensitive collate called PT_BR. Dave > was contacted some years ago to develop an official case/accent > insensitive collation based on WIN1252 and ISO8859_1. Many people here > contributed with money to pay him for this, but due to lack time (from > him) the work wasn't done. Give me the requirements spec. What exact behaviour do you want for this collation? > It seems that FB suffers a collation/charset general mess and I think > it deserves a good cleanup in this area. IMHO it suffers from some great design ideas and really wonderful generality of approach, but then having stopped development when everything was 90% ready. > In the results of my "Firebird Users Wish List Survey" running at > http://infopoll.net/live/surveys/s25748.htm, this feature is among the > top 4 mostly wanted features. The final results will be published next > week. Users don't know how they want to sort ;-) Regards, Peter Jacobi |
From: Carlos H. C. <fb...@wa...> - 2004-06-23 19:30:42
|
Peter, PJ> There is no substitute for leartning. If you want to know about PJ> state-of-the-art collations, there is no substitute for starting here: PJ> http://www.unicode.org/reports/tr10/ Thanks, I will check it. >> I tend to disagree with you... at last in >> Brazil, people are requesting for a long time an accent/case >> insensitive collation. PJ> For what use case? Example: Here we have a very common name for people: "João" - this is the correct way of writing it, BUT, people used to write it in many diferent ways, so we can have in the same "table": João joão joao Joao JOAO .... The same thing happens with many other names and words (ie: José, cachaça, etc.) People here would like to have all the variations ordered and compared as the same. Ie: c = C = ç = Ç a = A = á = Á = ã = Ã = ä = Ä = â = Â = â = Â = à = À etc... PJ> Give me the requirements spec. What exact behaviour do you want for this PJ> collation? I think my previous explanation gives you a good idea of what we want/need. BTW, isnt that a default expected behavior of a case/accent insensitive collation? []s Carlos http://www.warmboot.com.br FireBase - http://www.FireBase.com.br |
From: Alexandre B. S. <ib...@th...> - 2004-06-23 20:10:32
|
Peter Jacobi wrote: > ...snip... > > >For what use case? > > Hi Peter/All Just us the Brazilians want this ??? :-) and the spanishes, mexicans, bolivians, frenchs, and all other latin derived languages... I don't generally post on this list, but this subject are very important to me... I have a lot of customers where the data was pump in my databases from legacy applications: These applications or are DOS based, or even windows based but all fields are changed to upper case. I think that is much more readable, and more elegante to have mixed cases texts, and encourage the users of my systems to use it. But then, they have a problem, if they search for like "Alexandre%" he will not macth the legacy records that was imported as "ALEXANDRE" if he mistypes "aLexandre" the same occurs. So case insensitive is important. Because of this DOS applications or the lazy users don't put correct accents, so now he does not know if he should search for João or Joao or JOÃO or JOAO Anyway those words are the same name, and every brazillian could know and read any of the above variations as the correct one that is "João" in commercial applications (that is what I develop) the rule are: The case/accent does not matter on searching and ordering the exception is: The case matters. The letters 'a', 'á, 'à', 'ã', 'A', 'Á', 'À', 'Ã' should be considered the same in compararions and sorting. Each of my clients asks why the system cares about case (the accent part he understands, but the case part no) I know about proxy columns, but I don't like the idea, I wish to have the choice to make a column case/accent insensitive >The existing multi level collations already implement nearly everything a >no-case/no-accent collation can do. Except assuring uniqueness >regarding columns. But I consider it a somewhat strange design, to >a) not allow more than one of "RED", "Red" and "red" in a column >but >b) not to care about which spelling is desired > > > I have no experience with multi-level collation, I will read the link will sent to Carlos. AFAIK, the multi-level collation will not work for "like" and "starts with" in the majority of the search for names I use starts with, I read about your sugestion of using between "something" and "somethingzz", but like are much more powerfull... :-( As I said above in general does not matter if will insert "Red" or "red" or "RED", if will put an unique constraint or PK on this column, I must be sure that the case variations should not be considered if I define this field as case insensitive, and for the user, he will not have problens since if he searchs for "red" or "RED" or "Red" the record will be found anyway. > > > >>Even an unoficial special Firebird binary was created by a brazilian >>guy just to implement a case-insensitive collate called PT_BR. Dave >>was contacted some years ago to develop an official case/accent >>insensitive collation based on WIN1252 and ISO8859_1. Many people here >>contributed with money to pay him for this, but due to lack time (from >>him) the work wasn't done. >> >> > >Give me the requirements spec. What exact behaviour do you want for this >collation? > > > I have contact with this guy, he has on the last days adjusted his patch for FB 1.5. he is a member of CFLP (Portuguese Spoken Firebird Community). His patch can do a case insensitive/accent insensitive search, columns with up to 250 chars can be indexed, the "like", "containing" and "starts with" works. >>It seems that FB suffers a collation/charset general mess and I think >>it deserves a good cleanup in this area. >> >> > >IMHO it suffers from some great design ideas and really wonderful >generality of approach, but then having stopped development when >everything was 90% ready. > > > >>In the results of my "Firebird Users Wish List Survey" running at >>http://infopoll.net/live/surveys/s25748.htm, this feature is among the >>top 4 mostly wanted features. The final results will be published next >>week. >> >> > >Users don't know how they want to sort ;-) > >Regards, >Peter Jacobi > > > > I will be glad if I can help you to better understand this situation. see you ! -- Alexandre Benson Smith Development THOR Software e Comercial Ltda. Santo Andre - Sao Paulo - Brazil www.thorsoftware.com.br |
From: Peter J. <pj...@wa...> - 2004-06-24 07:30:26
|
Hi Carlos, > c =3D C =3D =E7 =3D =C7 > a =3D A =3D =E1 =3D =C1 =3D =E3 =3D =C3 =3D =E4 =3D =C4 =3D =E2 =3D =C2 = =3D =E2 =3D =C2 =3D =E0 =3D =C0 etc... Please try LOADABLE_NC_NA in pjcolkit http://www.jodelpeter.de/i18n/fbarch/index.htm If this is what you want, it only must be merged into the CVS tree. > I think my previous explanation gives you a good idea of what we > want/need. BTW, isnt that a default expected behavior of a case/accent > insensitive collation? Not always. I'll discuss later. Peter |
From: Carlos H. C. <fb...@wa...> - 2004-06-24 12:24:13
|
PJ> Please try LOADABLE_NC_NA in pjcolkit PJ> http://www.jodelpeter.de/i18n/fbarch/index.htm PJ> If this is what you want, it only must be merged into the CVS tree. Initial tests with ISO8859_NCNA were sucessfull! The only problem is that STARTING WITH and CONTAINING doesnt work with it. I guess that is the reason why Paulo (the Brazilian guy who created PT_BR) had to modify the executables of Firebird to make it work with SW and CONTAINING. So I think the collations could me merged, but together with that, we would need code changes to make Starting With and Containing working with them. Also a collation NCNA for WIN1252 would be apreciated (it can be based on PXW_INTL850). []s Carlos http://www.warmboot.com.br FireBase - http://www.FireBase.com.br |
From: Peter J. <pj...@wa...> - 2004-06-24 12:51:45
|
Hi Carlos, "Carlos H. Cantu" <fb...@wa...> wrote: > Initial tests with ISO8859_NCNA were sucessfull! The only problem is > that STARTING WITH and CONTAINING doesnt work with it. As I never miss to mention, as a workaround for STARTING WITH you can use BETWEEN 'foobar' AND 'foobarzzz'. > I guess that is > the reason why Paulo (the Brazilian guy who created PT_BR) had to > modify the executables of Firebird to make it work with SW and > CONTAINING. The STARTING WITH and LIKE 'foo%' case is a nearly trivial patch, which was discussed some hours ago. CONTAINING and the general LIKE case would need some additional work. Also it seems that the codebase of Firebird 1.5 and CVS HEAD have already diverged siginificantly, so that it's a tough choice, which version to address. > So I think the collations could me merged, but together with that, we > would need code changes to make Starting With and Containing working > with them. You are now at the point, were you have to either: - start working on this - motivate someone to work on this OR - pay someone to work on this > Also a collation NCNA for WIN1252 would be apreciated (it can be based > on PXW_INTL850). The LOADABLE collation in my pjcolkit is called LOADABLE, because you can tailor it yourself to any charset and collation by creating one data file. Regards, Peter Jacobi |
From: Carlos H. C. <fb...@wa...> - 2004-06-24 14:31:38
|
PJ> As I never miss to mention, as a workaround for STARTING WITH PJ> you can use BETWEEN 'foobar' AND 'foobarzzz'. Sure, but a true solution is always better than workarounds ;) PJ> You are now at the point, were you have to either: PJ> - start working on this No way, since I do not code in C. PJ> - motivate someone to work on this I'm trying... PJ> OR PJ> - pay someone to work on this As I told you, some years ago there was a money collection in Brazil (done by Paulo Henrique Albanez and announced in my discussion list) to pay Dave for the work. I think this money is still here, waiting for someone. But I think the destination of the money must be decided by everyone who contributed with it... I think Paulo (PHA) or Artur Anjos can give you more details. I'm copying this message to them. []s Carlos http://www.warmboot.com.br FireBase - http://www.FireBase.com.br |
From: Artur A. <ar...@ar...> - 2004-06-25 01:47:14
|
> As I told you, some years ago there was a money collection in Brazil > (done by Paulo Henrique Albanez and announced in my discussion list) > to pay Dave for the work. I think this money is still here, waiting > for someone. But I think the destination of the money must be decided > by everyone who contributed with it... I think Paulo (PHA) or Artur > Anjos can give you more details. I'm copying this message to them. Here is the story: The money was collected to pay Dave. When we finish the collect, Dave was not available to do the job. Paulo Henrique, the person who's job whas just to collect the money, feel that it wasn't ok for him to be with the money, so he did the job himself. :-) A great job. Then the problem arrived: to complete all tasks, Paulo need to make some specific fixes t collate PT_BR that will not work with other collates. That's a Firebird limitation, as discussed here, but Paulo wasn't aware of this until the time the problem appears. That's the reason that the Firebird project refuses to integrate the new collate in the tree. Source code and binarys are available from CFLP and any one can use it. CFLP has taken the responsability to make available code and binaries for each time the Firebird project releases a new version. It's available for Fb 1 and 1.5, and we will try to make it available for next versions. I have a talk with Nickolay in Fulda about this subject. He explain to me the exact problem that we have to solve, as he already explained in this list: " It is expected that at some point collations will be converted to streamed interface to work correctly with BLOBs. So unfortunately using string_to_key will not help in generic case when streamed filter templates (StartsEvaluator<>, LikeEvaluator<>, ContainsEvaluator<>, etc) are used to implement pattern matching. " So, we divide this task in two tasks: For collate PT_BR, we are working now to have a version without the fixes, standard with Firebird code, that could be integrated in the tree. That will give you, Brazilian users, the task done. With, of course, all the limitations that all collations have in Firebird. At CFLP, we are available to take the task to convert collations to streamed interface, and Nickolay already told us that he will help us giving us the directions to follow. But one thing at the time: for now, we are testing the new collate, and we don't have time to do everything. Regarding the money: the money isn't there 'waiting for someone'. All the people that contributed decided to give Paulo the money. He really did the job. But Paulo didn't accept this situation, because the code isn't integrated in the tree (one of the target's when we did the collect). But that's his problem now: the money belongs to him, so it's up to him to decide what to do with it. Artur |
From: Carlos H. C. <fb...@wa...> - 2004-06-25 11:17:18
|
AA> Regarding the money: the money isn't there 'waiting for someone'. All AA> the people that contributed decided to give Paulo the money. He really AA> did the job. But Paulo didn't accept this situation, because the code AA> isn't integrated in the tree (one of the target's when we did the AA> collect). But that's his problem now: the money belongs to him, so it's AA> up to him to decide what to do with it. Persoanlly I think the money should be given to him, but I did not contributed with the money collection in that time so I will not talk about that money anymore. I was not aware that a decision was already made to give the money to Paulo. He didnt comment that with me in our lastest talk. Anyway, I think it was a fair decision! []s Carlos http://www.warmboot.com.br FireBase - http://www.FireBase.com.br |
From: Pha-Listas <li...@ph...> - 2004-06-27 23:54:10
|
But I did not accept the decision of the majority. I probably will return the money to who contributes and each one will do what think better. PHA Nova Odessa / SP - Brazil ----- Original Message ----- From: "Carlos H. Cantu" <fb...@wa...> To: "Artur Anjos" <fir...@li...> Sent: Friday, June 25, 2004 8:23 AM Subject: Re: [Firebird-devel] Case-insensitive collations & collate PT_BR > AA> Regarding the money: the money isn't there 'waiting for someone'. All > AA> the people that contributed decided to give Paulo the money. He really > AA> did the job. But Paulo didn't accept this situation, because the code > AA> isn't integrated in the tree (one of the target's when we did the > AA> collect). But that's his problem now: the money belongs to him, so it's > AA> up to him to decide what to do with it. > > Persoanlly I think the money should be given to him, but I did not > contributed with the money collection in that time so I will not talk > about that money anymore. I was not aware that a decision was already > made to give the money to Paulo. He didnt comment that with me in our > lastest talk. Anyway, I think it was a fair decision! > > []s > Carlos > http://www.warmboot.com.br > FireBase - http://www.FireBase.com.br > > > > > ------------------------------------------------------- > This SF.Net email sponsored by Black Hat Briefings & Training. > Attend Black Hat Briefings & Training, Las Vegas July 24-29 - > digital self defense, top technical experts, no vendor pitches, > unmatched networking opportunities. Visit www.blackhat.com > Firebird-Devel mailing list, web interface at https://lists.sourceforge.net/lists/listinfo/firebird-devel > |
From: Peter J. <pj...@wa...> - 2004-06-25 06:08:36
|
Hi Artur, Artur Anjos <ar...@ar...> wrote: > Then the problem arrived: to complete all tasks, Paulo need to make some > specific fixes t collate PT_BR that will not work with other collates. > That's a Firebird limitation, as discussed here, but Paulo wasn't aware of > this until the time the problem appears. That's the reason that the Firebird > project refuses to integrate the new collate in the tree. > > Source code and binarys are available from CFLP and any one can use it. CFLP > has taken the responsability to make available code and binaries for each > time the Firebird project releases a new version. It's available for Fb 1 > and 1.5, and we will try to make it available for next versions. I'm interested in looking at the source code for this, but had difficulties finding it at CFLP. Perhaps I should take this as an incentive to learn Portugese, but for the short term it would help if you can give me a direct download link, or send me the affected files by email. Regards, Peter Jacobi |
From: Peter J. <pj...@wa...> - 2004-06-24 14:57:23
|
Hi Carlos, All, > As I told you, some years ago there was a money collection in Brazil > (done by Paulo Henrique Albanez and announced in my discussion list) > to pay Dave for the work. I think this money is still here, waiting > for someone. But I think the destination of the money must be decided > by everyone who contributed with it... I think Paulo (PHA) or Artur > Anjos can give you more details. I'm copying this message to them. You should contact our Wise Rulers from the Firebound Foundation. And to clarify: Within my limited foresight, I'm not applying for the job I'll prefer to do the bits that are fun to me and give advice if asked. Regards, Peter Jacobi |
From: Peter J. <pj...@wa...> - 2004-06-24 08:37:57
|
Hi Alexandre, > But then, they have a problem, if they search for > like "Alexandre%" > > he will not macth the legacy records that was imported as "ALEXANDRE" It is (in my opinion) a defect in the Firebird code, that like "Alexandre%" (and equivalently STARTING WITH "Alexandre") doesn't work for you. For every multi-level collation it should matches ALEXANDRE and alexandre and Alexandr=E9 etc. I don't judege it to be wise, to add a large number of collations to make up for a code defect, which can easily changed, if only we agree that it is a defect. > in commercial applications (that is what I develop) the rule are: > The case/accent does not matter on searching and ordering Then why don't you implement one or let one commercial programmer spent four commercially paid hours to make your commercial application work commercially? I'm doing this as a hobby of mine and I am more interested in linguistically correct sorting. > The letters 'a', '=E1, '=E0', '=E3', 'A', '=C1', '=C0', '=C3' should be = considered the > same in compararions and sorting. The most generally rule I found, is that 'foreign' characters should be mapped to their nearest ASCII equivalent, but that some or all of the non-ASCII characters of your own language are considered distinct. So a Polish dictionary or phone book has separate entries for U+0141 LATIN CAPITAL LETTER L WITH STROKE but not for U+0153 LATIN CAPITAL LETTER O WITH DIAERESIS And in Denmark it's just the other way around. If you expect users, who will only want to enter ASCII characters for searching, are the same users doing the data entry? Then can you trust the= m the enter the non-ASCII characters correctly or should the database better store only ASCII characters. So the remaining use case for a very aggressive no-accent collation seems to be an application, where data entry is done very carefully by users, wh= o are aware of character details, and searching by users who only know ASCII or are forced to use as system where it is hard to enter non-ASCII characters. > AFAIK, the multi-level collation will not work for "like" and "starts > with" in the majority of the search for names I use starts with, I read > about your sugestion of using between "something" and "somethingzz", but > like are much more powerfull... :-( > > As I said above in general does not matter if will insert "Red" or "red"= or > "RED", if will put an unique constraint or PK on this column, I must be = sure > that the case variations should not be considered if I define this field= as > case insensitive, and for the user, he will not have problens since if h= e > searchs for "red" or "RED" or "Red" the record will be found anyway. > I have contact with this guy, he has on the last days adjusted his patch= for > FB 1.5. he is a member of CFLP (Portuguese Spoken Firebird Community). > > His patch can do a case insensitive/accent insensitive search, columns w= ith > up to 250 chars can be indexed, the "like", "containing" and "starts wit= h" > works. Fine. So you see, Firebird INTL architecture allows easy additions specifi= c to your needs. The above can also be achieved using the LOADABLE collation of my pjcolkit= , but as residing in fbintl2 and not in fbintl, it is somewhat more awkward to use. (http://www.jodelpeter.de/i18n/fbarch/loadable.txt) > I will be glad if I can help you to better understand this situation. There is a non technical point to consider: Some aspects of collations are just tedious, stupid work. So you can expec= t a lack of volunteering in OSS projects. It's like the situation with fonts= : There are a big number of free fonts, but almost none of them look good at small point sizes (some even look ugly at all point sizes!), because this would require a large amount of "hinting", which is a very, very tedious and stupid work. Regards, Peter Jacobi |
From: Horvath, S. <san...@fr...> - 2004-06-24 09:21:24
|
Hi All! Please forgive me for my english! > Hi Alexandre, >> But then, they have a problem, if they search for >> like "Alexandre%" >> >> he will not macth the legacy records that was imported as "ALEXANDRE" > It is (in my opinion) a defect in the Firebird code, that > like "Alexandre%" (and equivalently STARTING WITH "Alexandre") > doesn't work for you. For every multi-level collation it should matches > ALEXANDRE and alexandre and Alexandr=E9 etc. Be carefull with collations! The "alexandre =3D Alexandr=E9" is not so simple. Well, in hungarian language this is not right! Firebird 1.5 has a character set (WIN1250) and a collation (PXW_HUNDC) for hungarian but it is a mistake. For example: In hun. alphabet there are letters a=E1b or A=C1B, the order is important but the case is doesn't matter. Users don't want to press shift to try to find a name! Firebird solves this problem the following way: a=3DA, a=3D=E1, a=3D=C1! This is not good! The order is wrong, becouse "=E1" doesn't follow "a" and it doesn't work in indexed search but in UPPER(). So I ask, what is the reason that there's no really case-insensitive national support is Firebird? Maybe this is the weakest point of it! BTW: I love Firebird! It's the best SQL server! Small, simple and powerfull... Sandor |
From: Peter J. <pj...@wa...> - 2004-06-24 09:51:50
|
Hi Sandor, > Please forgive me for my english! Thanks for contributing to the discussion, there is no need to apologize for your English. If you want to invest more times into your postings, don't use it to improve your english put to add more SQL. What I mean with this semi-cryptic comment: It is sometimes hard (for me), to learn from a narrative explanation, what is really expected in different culture's sorting and searching. So in an optimal posting, you should include a complete, working example to demonstrate the difference between expected and actual behaviour. I.e. -- DDL to create the table CREATE TABLE FOO.... ... CHARACTER SET .... COLLATE -- Inserting test data INSERT INTO FOO .... -- Select statements SELECT ... WHERE ... LIKE ... ORDER And then your comments, what went wrong: - too many columns selected (which) - too few columns selected (which) - wrong order (show desired order) As non ISO-8859-1 characters seldom survive the mailing list, you should create zipped testcases and upload them to the file areas of firebird- architect or firebird-i18n > Be carefull with collations! The "alexandre =3D Alexandr=E9" is not so > simple. Yes, and this is the reason that using full multi-level collations and STARTING WITH is more powerfull and versatile than just having one, very aggressive no-case/no-accent collation. A correctly working "STARTING WITH 'alexandre'" (which can be emulated by "BETWEEN 'alexandre' AND 'alexandreZZZ'), would take differences between languages into account > Well, in hungarian language this is not right! Firebird 1.5 has a > character set (WIN1250) and a collation (PXW_HUNDC) for hungarian but > it is a mistake. The four multi-level collations which exist in Firebird, implement two rather different behaviours. AFAIK nobody was willing and able to a) really explain the differences and b) decide which of these are better, or whether both variants should be maintained From my shallow understanding, one variant encodes the 'common use' and th= e other encodes the 'librarian's use'. > For example: > In hun. alphabet there are letters a=E1b or A=C1B, the order is importan= t > but the case is doesn't matter. Users don't want to press shift to > try to find a name! > Firebird solves this problem the following way: a=3DA, a=3D=E1, a=3D=C1! I dont't think that this is really the case. Please provide a test case as explained above. > This is not good! The order is wrong, becouse "=E1" doesn't follow "a" > and it doesn't work in indexed search but in UPPER(). Please give a complete example. > So I ask, what is the reason that there's no really case-insensitive > national support is Firebird? See for examples my points A) ... J) in my second answer to Ivan or my ramblings on http://www.jodelpeter.de/i18n/fbarch/ Regards, Peter Jacobi |
From: Alexandre B. S. <ib...@th...> - 2004-06-24 21:35:35
|
Hi Peter, Thanks for your time reading and answering... Peter Jacobi wrote: >It is (in my opinion) a defect in the Firebird code, that >like "Alexandre%" (and equivalently STARTING WITH "Alexandre") >doesn't work for you. For every multi-level collation it should matches >ALEXANDRE and alexandre and Alexandré etc. > >I don't judege it to be wise, to add a large number of collations to make >up for a code defect, which can easily changed, if only we agree that >it is a defect. > > > Agreed ! I don't know if the solution are create distinct collations to solve the problem. I don't know very well where the problems lives. The problem I encounter is: I'd like that firebird search/order in a case/accent insensitive manner, and that the search works as expected with like, containing, starts with, or any other future operator created. >Then why don't you implement one or let one commercial programmer spent >four commercially paid hours to make your commercial application work >commercially? > > > I have no knowlodge of C or knew how FB works internally, I have never been involved in dbms design/coding, so I think I cannot do it by my self. As Carlos mentioned Paulo Henrique Albanez, have done a modified version of FB that order/search as we Brazilians wish, we like it to be implemented on main FB Tree. He did some fix and "like", "starts with" and "containing" works as expected. But AFAIK the change he made are not in acordance with FB code rules. So it cannot be merged on main source tree. I think I can raise some funds if someone wish to solve the problem, don't even dream how much will be this cost, maybe what I think I can raise will be way low then what someone wish to do the job. In the other way, I have dealed with it for some years, so if the developers think that this issue should be addressed on FB 2.0 that's ok for me, I prefer a "good solution" that works than a "perfect solution" that will just be avaible on FB 5.0, but if the time/cost involved in developing a "good solution" are almost the same as develop the "ideal solution" I understand it perfectly, and just hold my breath for sometime. Just a side note: Don't take my words in any negative or destructive criticism, I don't have full control of the language, so I think sometimes my comments could be interpreted in the wrong way. :-) >I'm doing this as a hobby of mine and I am more interested in >linguistically correct sorting. > > > I noted it in your comments about the Thai implementation you did. :-) >The most generally rule I found, is that 'foreign' characters should be >mapped to their nearest ASCII equivalent, but that some or all of the >non-ASCII characters of your own language are considered distinct. > > Did understand what you meant here... >If you expect users, who will only want to enter ASCII characters for >searching, are the same users doing the data entry? Then can you trust them >the enter the non-ASCII characters correctly or should the database >better store only ASCII characters. > > They will search with both ascii/non ascii, what we wish is that these words João Joao JOAO JOÃO jOão joÃo etc. Will be considered the same for sorting/searching, does not matter how the user input the field or how he asks in the search, any one of this forms should be equal to all the others. The database should stored the data as the user typed it, but are far more readable, elegant and correct the mixed form with the proper accents. But if the user type "JOAO" the data should be stored as "JOAO", when he search for "João" the record should be returned, but as typed (without the accent and upper case). The users should type it correctly. >So the remaining use case for a very aggressive no-accent collation seems >to be an application, where data entry is done very carefully by users, who >are aware of character details, and searching by users who only know ASCII >or are forced to use as system where it is hard to enter non-ASCII >characters. > > > The idea is: Does not matter if one user typed ith accent and mixed case, the other type with accent in upper case, and a lot of records was pumped in the database from a DOS app that just uses upper ASCII, he can search it in any way. The ability to enter accent chars in mixed case make the reports more candy to read are with a professional look, when you send an invoice to a costumer with all caps withou accent it looks ugly. So, the ability to enter the especial chars are very desireable, but have a drawback, that if there is no pattern, or if the data could have legacy records, one should search in every form to find a desired record. >Fine. So you see, Firebird INTL architecture allows easy additions specific >to your needs. > >The above can also be achieved using the LOADABLE collation of my pjcolkit, >but as residing in fbintl2 and not in fbintl, it is somewhat more awkward >to use. (http://www.jodelpeter.de/i18n/fbarch/loadable.txt) > > I have readed it, found it very usefull, I am waiting for the release of the "brazilian version" of FB 1.5 that Paulo is working to do some tests, and decide wich option to adopt. I have looked in the a long time ago about Dave's collation kit's, but the kit's was just avaiable on Win, and the majority of my customers use Linux, I wish a consistent behaviour in all of then. >There is a non technical point to consider: > >Some aspects of collations are just tedious, stupid work. So you can expect >a lack of volunteering in OSS projects. It's like the situation with fonts: >There are a big number of free fonts, but almost none of them look good >at small point sizes (some even look ugly at all point sizes!), because >this would require a large amount of "hinting", which is a very, very >tedious and stupid work. > > I have this feeling, and this feature are not so important to a greate number of people (english users, etc.) but I think the latin derived language users wish it a lot. >Regards, >Peter Jacobi > > > > see you ! -- Alexandre Benson Smith Development THOR Software e Comercial Ltda. Santo Andre - Sao Paulo - Brazil www.thorsoftware.com.br |
From: James K. L. <jkl...@sc...> - 2004-06-26 19:51:31
|
On Thu, 24 Jun 2004 <ib...@th...> wrote: > They will search with both ascii/non ascii, what we wish is that these > words Jo=E3o > Joao > JOAO > JO=C3O > jO=E3o > jo=C3o > etc. ... > But if the user type > "JOAO" the data should be stored as "JOAO", when he search for "Jo=E3o"= =20 > the record should be returned, but as typed (without the accent and=20 > upper case). The users should type it correctly. This is at least partly a function of the application. =20 It's standard practice to use UPPER() to defeat case-sensitive searches. = =20 Perhaps a similar function "NOACCENT()", is needed for accents? I'm not familiar with languages whose accented characters are sometimes not=20 considered distinct. (In Swedish, for example, 'a', and '=E5' are two ve= ry different letters.) How is this solved by other implementations? =20 It seems to me the problem with using collations that don't distinguish, say, between accented and non-accented characters is that the application= s are then prevented from distinguishing, every time. Language contructs that allow per-use overrides are more flexible. =20 --jkl |
From: Alexandre B. S. <ib...@th...> - 2004-06-28 22:24:18
|
Hi James ! James K. Lowden wrote: > This is at least partly a function of the application. > >It's standard practice to use UPPER() to defeat case-sensitive searches. > > As you know, if you use UPPER, you cannot use the index anymore. Maybe when indices based on functions are available, the problem goes away, but I still have some doubts... see below. >Perhaps a similar function "NOACCENT()", is needed for accents? I'm not >familiar with languages whose accented characters are sometimes not >considered distinct. (In Swedish, for example, 'a', and 'å' are two very >different letters.) How is this solved by other implementations? > > The NOACCENT function could solve the problem, But all queries should be changed to: Select * from Customer where UPPER(NOACCENT(Name)) = UPPER(NOACCCENT(:aValue)); If I have an index such: create index SK_Name on Customer UPPER(NOACCENT(Name)); This query will be indexed, but if I just mystyped the order of the functions [NOACCENT(UPPER(Name))] the query will be "natural" again, not a big problem if I just adopt a common way of use... In my main apllication no problem (just change my Data Classes), but in third party tools (Crystal Reports for example), this become a problem... And in every query like when you are doing some checking in ISQL, etc, you have to put the UPPER(NOACCENT(something)) to get the desired result. I think a more elegant approach is if it can be defined on the table definition, so nobody should remember to put the double function everywhere. As you pointed out, this is not the default for every language, so a collation with this caracteristic should be choosen by the developer when it applies to his needs. Other Implementations: I will say about MSSQL (version 6.5 is the last I have used, a really old one....). I don't think that MSSQL approach is the better one, but... When you install the Server, you choose what dictionary (I think that was the term used) you want to work (very bad to choose a global one, in my opnion this should be at least for each DB, the ideal is for each column !). MSSQL has a similarity table, something like this: A = a = Á = á = Ã = ã = À = à = Â = â B = b C = c = Ç = ç D = d E = e = É = é = Ê = ê This table is used for comparisons. MSSQL stays that when one uses a non binary dictionary the speeds could be 20% slow. The approach specified in the UNICODE doc's Peter pointed out is much powerfull, generating the sortkey as 3 or more 2 bytes values, and then compare the sort key values seens good and powerfull to me (more than one char could be mapped to one char, etc.), don't know about performance penalty, but since this will be the developer choice and will not be a default to every db around the world, I think I could expect (are are warned about) a slower performance when I define a column with as case/accent insensitive data. >It seems to me the problem with using collations that don't distinguish, >say, between accented and non-accented characters is that the applications >are then prevented from distinguishing, every time. Language contructs >that allow per-use overrides are more flexible. > > > I think the collation should be defined in the column (as it is), I could have different columns that should behave diferently. But as mentioned on the paper point out by Peter, the behaviour should be consistent with the cultural aspects of the user, even if Swedish people differs "a" and "å" if I (a Brazilian guy) was searching for Sweden names I expect the mentioned chars to be close to each other, and if I just type "a" I expect the records with "å" to be found any way. As I understood this will have a consistent behaviour across any table/column, then it should be based on connection. But I don't have experience with this kind of problem... >--jkl > > see you ! -- Alexandre Benson Smith Development THOR Software e Comercial Ltda. Santo Andre - Sao Paulo - Brazil www.thorsoftware.com.br |
From: Peter J. <pj...@wa...> - 2004-06-25 06:28:27
|
Hi Alexandre, Alexandre Benson Smith <ib...@th...> wrote: > >Then why don't you implement one or let one commercial programmer spent > >four commercially paid hours to make your commercial application work > >commercially? > I have no knowlodge of C or knew how FB works internally, I have never been > involved in dbms design/coding, so I think I cannot do it by my self. But adding a nocase/naccent collation is simple. If it is such important commercially, why hasn't one of the commercial vendors added it for their own use as a competitive advantage? Isn't capitalism working in this case? It's straightforward C code. It's documented. Samples exist. > Just a side note: > Don't take my words in any negative or destructive criticism, I don't > have full control of the language, so I think sometimes my comments > could be interpreted in the wrong way. :-) And I hope you are not offended by my sometimes sarcastic postings. > Does not matter if one user typed ith accent and mixed case, the other type > with accent in upper case, and a lot of records was pumped in the database > from a DOS app that just uses upper ASCII, he can search it in any way. > > The ability to enter accent chars in mixed case make the reports more > candy to read are with a professional look, when you send an invoice to a > costumer with all caps withou accent it looks ugly. I do understand now what you want, but I'm still perplexed about the situation. In my eyes, a mixture of all-uppercase and mixed case, accented and pure ASCII variants in the database sounds like a data integrity nightmare and not as an improvement over all ASCII uppercase. Regards, Peter Jacobi |
From: Alexandre B. S. <ib...@th...> - 2004-06-25 20:39:48
|
Hi Peter, Peter Jacobi wrote: >Hi Alexandre, > >But adding a nocase/naccent collation is simple. If it is such important >commercially, why hasn't one of the commercial vendors added it >for their own use as a competitive advantage? Isn't capitalism working >in this case? > >It's straightforward C code. It's documented. Samples exist. > > > Paulo Henrique Albanez already did it. And make it available for everyone, as source code or binary format. >And I hope you are not offended by my sometimes sarcastic postings. > > As I said, I have no control of English, so, your sarcastic comments don't offend me, perhaps I have even not noticed then ;-) >I do understand now what you want, but I'm still perplexed about the >situation. In my eyes, a mixture of all-uppercase and mixed case, accented >and pure ASCII variants in the database sounds like a data integrity >nightmare and not as an improvement over all ASCII uppercase. > > ok... I will try to explain again... The ideal situation is to the data be in the "correct format", the correct format is Proper Case and with the correct accent used. Why the users don't use the correct format ? 1.) Because they are lazy and think that is easy to type if they just press the caps lock key, and just type, instead of press sometimes the shift key 2.) Because they just mistyped the word 3.) Because they forget (or don't know) how to write correctly a word (they are a bunch of rules to determine if a word should/shouldn't be with accent in Portuguese) 4.) Part of the data was imported from an old system that only allows upper case/ascii chars. 5.) Data exchange between other systems (even old COBOL systems) where just ASCII/upper case are allowed The problem is one should correct when find a word mistyped, or as the system was being used, will converting the ALL CAPS to Mixed Case, but is hard to tell someone to correct every record on the legacy records so he can be sure he can find the records. So to make the data consistent, or you comvert all your old records, or continue to use only ALL CAPS ASCII, wich is very ugly. The goal is when searching for customer names one could just type Viação Jaraguá or Viacao Jaragua or VIACAO JARAGUA or VIAÇÃO JARAGUÁ and any of the above forms find the possible variations Will be more user friendly if the user forgets some accent when search for the record and the "system" is "smart enough" to find the accented version of the same word and vice-versa. If the system returns "no record found", the user will do one of these two things: 1.) He is a smart user, and will check if he typed correctly, and will notice that he missed an accent, he will correct the word and search again. 2.) He will assume that there is no record and will try to insert that record on the next steps they will be blocked by some database unique constraints. So they will think, "why the system did not know that I just forgot an accent ????" Did you get the picture now ? >Regards, >Peter Jacobi > > > See you ! -- Alexandre Benson Smith Development THOR Software e Comercial Ltda. Santo Andre - Sao Paulo - Brazil www.thorsoftware.com.br |
From: Carlos H. C. <fb...@wa...> - 2004-06-26 20:21:48
|
JKL>It's standard practice to use UPPER() to defeat case-sensitive JKL>searches. The major problem of using UPPER/NOACCENT is that it does not allow indexes usage in the search. []s Carlos http://www.warmboot.com.br FireBase - http://www.FireBase.com.br |