From: Tim L. <guy...@gm...> - 2013-01-20 18:36:43
|
On 20 Jan 2013, at 18:00, John Ralls wrote: > > On Jan 20, 2013, at 8:19 AM, Tim Lyons <guy...@gm...> wrote: > >> Summary >> ------- >> >> Sorting is a mess because of trying to support different platforms >> which have different results and different bugs. >> >> I have changed gramps34/trunk to use ICU/PyICU so we have a well >> defined common sort for all platforms. >> >> In gramps34 the code falls-back to the existing code if PyICU is not >> there - I suggest for trunk/gramps40, we only offer PyICU. >> >> I know that increasing the number of dependencies is generally >> undesirable, but it seems that incorporating PyICU/ICU should be >> pretty simple for all packagers, and I think this is justified by the >> number of problems that it fixes. Hope this is OK. >> >> >> Details >> ------- >> >> I have been trying to resolve a number of bugs relating to the sort >> order for names etc. in both the UI and in NarWeb. >> >> At present, sorting gives many different results, for example for bug >> 5088 [0] see the picture below. >> >> http://www.gramps-project.org/bugs/file_download.php?file_id=4653&type=bug >> >> The correct Unicode sorting order is the one with the heavy outline >> (E&OE). >> >> The problem is that >> - the specification of the locale, >> - the determination of the locale, >> - the interface to call the collation sort key routines, >> - the collation sort key routines themselves, >> - the (mainly undocumented) specification of the collation sort key >> routines and >> - the bugs in the collation sort key routines >> are all platform dependent. >> >> Trying to fix the bugs is trying to hit a moving target, where the >> target is moving in the above 6 dimensions. There are multiple >> discussions on the internet about problems and bugs in sorting (see >> for example [5-9]) >> >> I have therefore changed to use the International Components for >> Unicode [1] (ICU) and PyICU, which provide a widely portable >> implementation of the Unicode Collation Algorithm [2] including the >> Default Unicode Collation Element Table [3] (DUCET) as the data >> specifying the default collation order for all Unicode characters and >> the locale specific Unicode Common Locale Data Repository [4] (CLDR). >> >> This means that the interface, the (well defined) specification, the >> implementation and the bugs will be the same for all platforms. I >> have >> coded ICU as the principal option, with the existing code as the >> fall- >> back option if PyICU is not found. >> >> This needs to be tested in Windows, and ICU/PyICU included in >> packages. >> >> Note that, at present, the GUI applies strxfrm to each character, >> rather than to the name as a whole. The whole point of the locale >> specific collation is that the key is derived from the name as a >> whole, so I have changed this, except for Mac [8], where it seems >> that >> strxfrm gives some pretty weird results for things like greek >> characters [9]. >> >> [0] http://www.gramps-project.org/bugs/view.php?id=5088 >> [1] http://site.icu-project.org/ >> [2] http://www.unicode.org/reports/tr10/ >> [3] http://www.unicode.org/Public/UCA/latest/allkeys.txt >> [4]http://cldr.unicode.org/ >> [5] http://stackoverflow.com/questions/3412933/python-not-sorting-unicode-properly-strcoll-doesnt-help >> [6] http://code.activestate.com/recipes/576507-sort-strings-containing-german-umlauts-in-correct-/ >> [7] http://support.apple.com/kb/ta22935?viewlocale=en_us >> [8] Although Apple is reported as a supporter of Unicode and ICU, the >> supplied strxfrm (part of the C library) does not seem to be any good >> - the Finder does sort correctly, and Apple suggest using >> localizedStandardCompare if you want to sort like the finder. But >> even >> so, Apple do not seem to say whether this would sort according to the >> published Unicode spec, and it would be yet another platform >> dependency. >> [9] http://www.gramps-project.org/bugs/view.php?id=5645 >> >> >> I have put this discussion under [0] > > > Apple uses ICU for all localization tasks. Of course it's not in > libc, which is an implementation of the C Standard Library, ISO/IEC > 9899:1999. It's in CoreFoundation [1]. Naturally, though, Apple has > wrapped it in their own API and don't provide the ICU headers, so if > you want to access the icu bits natively you need to at least get > the headers and link against /usr/lib/libicucore.dylib. Yes, I was aware of localizedStandardCompare, not sure if that is the same as the library you are talking about. I decided I really didn't want to use that on the Mac because I couldn't see any guarantee that it was actually the same as (or even the same spec as) the ICU. Whereas with using ICU, I would know exactly what I was going to get, and would know it was the same on all platforms. > > Looking over your changeset in trunk [2], it appears that you've > handled exactly one sorting case (albeit one that handles all of > the live lists), but there are plenty of others scattered through > the reports and associated dialogs. Yes, I was aware of that, just wanted to fix the principle for a start, and the live lists are the ones that appear in the bugs (as far as I can see). > > Rather than testing the environment variables, call > locale.getlocale(locale.LC_COLLATE). > Provide a locale argument so that reports that are writing in a > different locale can specify what locale to use. Not sure where I would provide such an argument. Help please. > > Maybe the whole thing should be moved into the GrampsLocale class > with a sort_key (self, string, locale=None) to replace > conv_unicode_tosrtkey() and conv_str_tosrtkey(). I meant to ask you to provide a function in GrampsLocale that would return the LC_COLLATE value, then I will use that in cast.py in trunk. Not sure about moving the whole thing into GrampsLocale. > > As long as we're bringing in ICU (assuming no one chokes on it), we > should replace the date and number formatters with ICU-based ones, > at least in trunk. All localization except translations, in fact. > That moves the maintenance to someone else's problem and I expect > greatly expands the number of supported locales. That would, of > course, require making PyICU a required dependency. Yes, I was going to suggest that too. As I say I think PyICU should be a required dependency for gramps4.x onwards. The only thing I was not sure about was whether there are functions in Gramps to parse dates in different languages? I don't think there are such functions in ICU (only functions to go the other way)? Thanks, Tim. > > Regards, > John Ralls > > > [1] > https://developer.apple.com/library/mac/#documentation/ > CoreFoundation/Reference/CFLocaleRef/Reference/reference.html > [2] https://sourceforge.net/p/gramps/code/21175/ |