Re: [Gramps-devel] Name sorting (Collation)

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

On 20 Jan 2013, at 18:00, John Ralls wrote:

>
> On Jan 20, 2013, at 8:19 AM, Tim Lyons <guy...@gm...> wrote:
>
>> Summary
>> -------
>>
>> Sorting is a mess because of trying to support different platforms
>> which have different results and different bugs.
>>
>> I have changed gramps34/trunk to use ICU/PyICU so we have a well
>> defined common sort for all platforms.
>>
>> In gramps34 the code falls-back to the existing code if PyICU is not
>> there - I suggest for trunk/gramps40, we only offer PyICU.
>>
>> I know that increasing the number of dependencies is generally
>> undesirable, but it seems that incorporating PyICU/ICU should be
>> pretty simple for all packagers, and I think this is justified by the
>> number of problems that it fixes. Hope this is OK.
>>
>>
>> Details
>> -------
>>
>> I have been trying to resolve a number of bugs relating to the sort
>> order for names etc. in both the UI and in NarWeb.
>>
>> At present, sorting gives many different results, for example for bug
>> 5088 [0] see the picture below.
>>
>> http://www.gramps-project.org/bugs/file_download.php?file_id=4653&type=bug
>>
>> The correct Unicode sorting order is the one with the heavy outline
>> (E&OE).
>>
>> The problem is that
>> - the specification of the locale,
>> - the determination of the locale,
>> - the interface to call the collation sort key routines,
>> - the collation sort key routines themselves,
>> - the (mainly undocumented) specification of the collation sort key
>> routines and
>> - the bugs in the collation sort key routines
>> are all platform dependent.
>>
>> Trying to fix the bugs is trying to hit a moving target, where the
>> target is moving in the above 6 dimensions. There are multiple
>> discussions on the internet about problems and bugs in sorting (see
>> for example [5-9])
>>
>> I have therefore changed to use the International Components for
>> Unicode [1] (ICU) and PyICU, which provide a widely portable
>> implementation of the Unicode Collation Algorithm [2] including the
>> Default Unicode Collation Element Table [3] (DUCET) as the data
>> specifying the default collation order for all Unicode characters and
>> the locale specific Unicode Common Locale Data Repository [4] (CLDR).
>>
>> This means that the interface, the (well defined) specification, the
>> implementation and the bugs will be the same for all platforms. I  
>> have
>> coded ICU as the principal option, with the existing code as the  
>> fall-
>> back option if PyICU is not found.
>>
>> This needs to be tested in Windows, and ICU/PyICU included in  
>> packages.
>>
>> Note that, at present, the GUI applies strxfrm to each character,
>> rather than to the name as a whole. The whole point of the locale
>> specific collation is that the key is derived from the name as a
>> whole, so I have changed this, except for Mac [8], where it seems  
>> that
>> strxfrm gives some pretty weird results for things like greek
>> characters [9].
>>
>> [0] http://www.gramps-project.org/bugs/view.php?id=5088
>> [1] http://site.icu-project.org/
>> [2] http://www.unicode.org/reports/tr10/
>> [3] http://www.unicode.org/Public/UCA/latest/allkeys.txt
>> [4]http://cldr.unicode.org/
>> [5] http://stackoverflow.com/questions/3412933/python-not-sorting-unicode-properly-strcoll-doesnt-help
>> [6] http://code.activestate.com/recipes/576507-sort-strings-containing-german-umlauts-in-correct-/
>> [7] http://support.apple.com/kb/ta22935?viewlocale=en_us
>> [8] Although Apple is reported as a supporter of Unicode and ICU, the
>> supplied strxfrm (part of the C library) does not seem to be any good
>> - the Finder does sort correctly, and Apple suggest using
>> localizedStandardCompare if you want to sort like the finder. But  
>> even
>> so, Apple do not seem to say whether this would sort according to the
>> published Unicode spec, and it would be yet another platform  
>> dependency.
>> [9] http://www.gramps-project.org/bugs/view.php?id=5645
>>
>>
>> I have put this discussion under [0]
>
>
> Apple uses ICU for all localization tasks. Of course it's not in  
> libc, which is an implementation of the C Standard Library, ISO/IEC  
> 9899:1999. It's in CoreFoundation [1]. Naturally, though, Apple has  
> wrapped it in their own API and don't provide the ICU headers, so if  
> you want to access the icu bits natively you need to at least get  
> the headers and link against /usr/lib/libicucore.dylib.

Yes, I was aware of localizedStandardCompare, not sure if that is the  
same as the library you are talking about. I decided I really didn't  
want to use that on the Mac because I couldn't see any guarantee that  
it was actually the same as (or even the same spec as) the ICU.  
Whereas with using ICU, I would know exactly what I was going to get,  
and would know it was the same on all platforms.

>
> Looking over your changeset in trunk [2], it appears that you've  
> handled exactly one sorting case (albeit one that handles all of  
> the  live lists), but there are plenty of others scattered through  
> the reports and associated dialogs.

Yes, I was aware of that, just wanted to fix the principle for a  
start, and the live lists are the ones that appear in the bugs (as far  
as I can see).

>
> Rather than testing the environment variables, call  
> locale.getlocale(locale.LC_COLLATE).
> Provide a locale argument so that reports that are writing in a  
> different locale can specify what locale to use.

Not sure where I would provide such an argument. Help please.

>
> Maybe the whole thing should be moved into the GrampsLocale class  
> with a sort_key (self, string, locale=None) to replace  
> conv_unicode_tosrtkey() and conv_str_tosrtkey().

I meant to ask you to provide a function in GrampsLocale that would  
return the LC_COLLATE value, then I will use that in cast.py in trunk.  
Not sure about moving the whole thing into GrampsLocale.

>
> As long as we're bringing in ICU (assuming no one chokes on it), we  
> should replace the date and number formatters with ICU-based ones,  
> at least in trunk. All localization except translations, in fact.  
> That moves the maintenance to someone else's problem and I expect  
> greatly expands the number of supported locales. That would, of  
> course, require making PyICU a required dependency.

Yes, I was going to suggest that too. As I say I think PyICU should be  
a required dependency for gramps4.x onwards. The only thing I was not  
sure about was whether there are functions in Gramps to parse dates in  
different languages? I don't think there are such functions in ICU  
(only functions to go the other way)?

Thanks,
Tim.

>
> Regards,
> John Ralls
>
>
> [1]
> https://developer.apple.com/library/mac/#documentation/ 
> CoreFoundation/Reference/CFLocaleRef/Reference/reference.html
> [2] https://sourceforge.net/p/gramps/code/21175/

Re: [Gramps-devel] Name sorting (Collation)

Gramps, the open source genealogy program

Re: [Gramps-devel] Name sorting (Collation)