Hello,
is there any setting or would it be possible to make search independent on diacritical marks?
My language use them and sometimes I make the entry title with diacritics and sometimes without...
Now when I search for something I usually type it without diacritics and sometimes it doesn't return anything because the entry is with diacritics. It would be easier with diacritical independent search.
For example if I would type "cesky" it would also return "český" with diacritics and vice versa.
Thanks.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
For this, you can search using a regular expression [1]. For example, if you open the 'Find' dialog, turn on the 'Regular expression' checkbox and search for [cč]esk[yý], both entries with 'cesky' and 'český' will match.
Last week, I wouldn't have had an opinion on this, but as it happens, I recently had trouble finding a record because I copied text from a Web site and didn't realize the copied text included an accented character.
I realize that it might be difficult to extend the standard MS RE interpreter in this way, but I would appreciate the addition of an accent-insensitive search, because the workaround given assumes that the user is aware of all the alternative characters that might be used. (In my case, I wasn't.)
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
This subject is... tricky. I was considering it for my AutoTypeSearch plugin, but I'm not sure there's a good way to do it.
"c" may well equal "č", but there's a lot that is not so simple. Should "ß" equal "ss" (for German), "sz" (for Austrian), "B" (because it looks like it)? If we allow "ß" == "ss" then presumably "ö" should match "oe". But only in German. In Swedish, "ö" is regaded as a separate letter and cannot be written as "oe". I guess you could match it against "o", in that case, although the Swedes might disagree.
I'm sure with enough time and effort a big list of equality rules could be developed, but frankly I'm not sure it's worth it.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Well this is certainly not a simple matter, but I think there are some approaches that might work in most cases. I'm no expert in Unicode and character encodings so I'll probably use the wrong terms in the following paragraphs, sorry for that.
For the diacritics one solution could be normalizing the string to its Unicode canonical decomposition (e.g. the single č character is decomposed as the character c plus the caron diacritic) and then removing all diacritic characters from the resulting normalized string before comparing. This way ö becomes o, č becomes c and ñ becomes n. This can be easily done using some .NET Framework string methods. In cases as the swedish ö or the spanish ñ, the worst case is that a few spurious matches could be returned as genuine, but they would be exactly the same as using the regex c[aá][nñ][oó] which involves more typing. I'm aware removing diacritics could change (or completely strip) the meaning of a word, my mother language is one of them, but I think the pros outweighs the cons in such cases.
For the problem of different representations of the same letter/phoneme like the eszett, I think this is automatically handled by the .NET Framework if the system is configured to use the correct locale and string comparison is properly done. When the CurrentCulture is set to de-DE, "ß".Equals("ss", StringComparison.CurrentCulture) is true.
Regards
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
You're right, there is better support in .net for this than I was aware of. In particular, there is a CompareOptions.IgnoreNonSpace that appears to designed for exactly this situation. I have updated AutoTypeSearch v0.11 to use this flag. If you'd like to give it a try, it's at https://sourceforge.net/projects/autotypesearch/
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
You're right. Though String.Compare is not aimed at checking if two strings are equal (that's what String.Equals is for) it lets you specify not only the culture used for the comparison but also to ignore diacritics and casing in one simple call. So calling String.Compare("Grüßen", "grussen", CultureInfo.CurrentCulture, CompareOptions.IgnoreNonSpace | CompareOptions.IgnoreCase) will return 0 (meaning the two strings are considered equal) which is what the OP asked for.
Regards
Last edit: Juan Porta 2019-01-19
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hello,
is there any setting or would it be possible to make search independent on diacritical marks?
My language use them and sometimes I make the entry title with diacritics and sometimes without...
Now when I search for something I usually type it without diacritics and sometimes it doesn't return anything because the entry is with diacritics. It would be easier with diacritical independent search.
For example if I would type "cesky" it would also return "český" with diacritics and vice versa.
Thanks.
For this, you can search using a regular expression [1]. For example, if you open the 'Find' dialog, turn on the 'Regular expression' checkbox and search for
[cč]esk[yý], both entries with 'cesky' and 'český' will match.Best regards,
Dominik
[1] https://msdn.microsoft.com/en-us/library/az24scfc.aspx
Last week, I wouldn't have had an opinion on this, but as it happens, I recently had trouble finding a record because I copied text from a Web site and didn't realize the copied text included an accented character.
I realize that it might be difficult to extend the standard MS RE interpreter in this way, but I would appreciate the addition of an accent-insensitive search, because the workaround given assumes that the user is aware of all the alternative characters that might be used. (In my case, I wasn't.)
This subject is... tricky. I was considering it for my AutoTypeSearch plugin, but I'm not sure there's a good way to do it.
"c" may well equal "č", but there's a lot that is not so simple. Should "ß" equal "ss" (for German), "sz" (for Austrian), "B" (because it looks like it)? If we allow "ß" == "ss" then presumably "ö" should match "oe". But only in German. In Swedish, "ö" is regaded as a separate letter and cannot be written as "oe". I guess you could match it against "o", in that case, although the Swedes might disagree.
I'm sure with enough time and effort a big list of equality rules could be developed, but frankly I'm not sure it's worth it.
Well this is certainly not a simple matter, but I think there are some approaches that might work in most cases. I'm no expert in Unicode and character encodings so I'll probably use the wrong terms in the following paragraphs, sorry for that.
For the diacritics one solution could be normalizing the string to its Unicode canonical decomposition (e.g. the single č character is decomposed as the character c plus the caron diacritic) and then removing all diacritic characters from the resulting normalized string before comparing. This way ö becomes o, č becomes c and ñ becomes n. This can be easily done using some .NET Framework string methods. In cases as the swedish ö or the spanish ñ, the worst case is that a few spurious matches could be returned as genuine, but they would be exactly the same as using the regex
c[aá][nñ][oó]which involves more typing. I'm aware removing diacritics could change (or completely strip) the meaning of a word, my mother language is one of them, but I think the pros outweighs the cons in such cases.For the problem of different representations of the same letter/phoneme like the eszett, I think this is automatically handled by the .NET Framework if the system is configured to use the correct locale and string comparison is properly done. When the CurrentCulture is set to de-DE,
"ß".Equals("ss", StringComparison.CurrentCulture)is true.Regards
You're right, there is better support in .net for this than I was aware of. In particular, there is a CompareOptions.IgnoreNonSpace that appears to designed for exactly this situation. I have updated AutoTypeSearch v0.11 to use this flag. If you'd like to give it a try, it's at https://sourceforge.net/projects/autotypesearch/
You're right. Though
String.Compareis not aimed at checking if two strings are equal (that's whatString.Equalsis for) it lets you specify not only the culture used for the comparison but also to ignore diacritics and casing in one simple call. So callingString.Compare("Grüßen", "grussen", CultureInfo.CurrentCulture, CompareOptions.IgnoreNonSpace | CompareOptions.IgnoreCase)will return 0 (meaning the two strings are considered equal) which is what the OP asked for.Regards
Last edit: Juan Porta 2019-01-19