|
From: Michael S. <sm...@xm...> - 2005-10-30 13:05:31
|
Mark Davis <mar...@ic...> writes: > You can build a collation engine that would take the CLDR rules plus UCA= =20 > data with any (Turing-complete) programming language. It would, however,= =20 > be a fairly sizable undertaking. So my recommendation would be to use=20 > the existing ICU (C or J) implementation. Well, for the Java XSLT engine case, our group has already discussed the idea of providing users with an option to have indexes sorted using a method that relies on the Isogen I18N Support Library that Eliot Kimber put together - http://www.innodata-isogen.com/resources/tools_downloads/i18nsupport That in turn relies on ICU4J. At this point, it only works with Saxon 6, but that's fine because most DocBook users are either using Saxon or xsltproc. For the xsltproc case, I believe the libxslt source distro already includes source for a xsltICUSortFunction. But it seems to not be part of the default build, but requires compiling libxslt with a certain build flag set. There is a note in the source: xsltICUSort.c: module provided by Richard Jinks to provide a sort function replacement using ICU, it is not included in standard due to the size of the ICU library That seems to imply that if you build it with support for that xsltICUSortFunction, it must statically link the ICU libraries. Why it would need to do that, I do not know... As an alternative, Jirka Kosek from our project opened a libxslt bug about a year ago requesting that libxslt "use collating provided by underlying OS or libc". But I don't think that has actually been implemented in libxslt yet. --Mike > Michael Smith wrote: >=20 > >Has anyone implemented code in XSLT (with our without EXSLT > >extension functions) for doing collation/sorting of XML content > >using the CLDR collation data? (That is, the data in the files in > >the common/collation/ directory in the CLDR cvs source tree.) > > > >If not, is there any documentation available (for example, a > >functional specification or design specification) that describes > >the behavior of an application which uses that data? Or is there a > >reference implementaion of some kind? > > > >The context for my question is: I am one of the developers > >involved with the DocBook Project[1], and I am interested in > >trying to see if we can find a way to make use of the CLDR > >collation data in the part of the DocBook XSL stylesheets code > >that deals with generating indexes. > > > >The DocBook XSL stylesheets make use of locale data (we currently > >have locale files for about 60 locales/languages). Some of that > >locale data is sort of DocBook-specific -- for example, for > >generating localized text for the equivalents of "Table of > >Contents", "Chapter", "Section", etc. -- but some of it is just > >general locale data; for example, data for generating localized > >date strings. > > > >In the case of the date-string data, a while back I revised our > >build setup so that instead of having the date-string data > >maintained in the source for our locale files, the build now picks > >it up from the CLDR locale files. So we and our translators don't > >need to maintain it separately any longer. > > > >But the bigger issue we have is with generating collated indexes. > >The DocBook XSL stylesheets automatically generate indexes based > >on instances of indexterm markers in DocBook XML source content. > >However, XSLT 1.0 does not itself provide a means for doing > >locale-aware collation, so we needed to add a means for handling > >collation in indexterms in indexes for non-English locales. > > > >One of the project developers, Jirka Kosek, came up with a method. > >It is described in a paper he presented at XML 2004[2]. > > > >However, at the time he wrote it, he was not aware of the > >availability of the CLDR collation data, and the method he > >developed uses data in a form that is quite different from the > >CLDR data (our data is basically just a number list of characters > >for all characters in the locale; characters that should be > >grouped together have the same number). > > > >That method is so far only supported for less than 10 or so of the > >60 locales we have data for. (The reason is that to get it > >supported in a particular locale, we need to ask our translators > >to add the data in the numbered-list for Jirka's method requires, > >and we so far have not done that for many locales). > > > >So, I think our project and our users would be much better off if > >we could figure out a way to replace Jirka's method with on that > >relies instead on the CLDR collation data. > > > >One big limitation we have is that the DocBook XSL stylesheets are > >meant to be a "pure XSLT" solution that allows users to generate > >HTML and XSL-FO output just using any XSLT engine they choose > >(whether that be a C-based one such as xsltproc/libxslt, or a > >Java-based one such as Saxon 6 or Xalan, or an engine implemented > >in any other language). > > > >That said, we do already make use of EXSLT extensions to XSLT 1.0 > >that are widely supported in most common XSLT engines (for > >example, the EXSLT node-set() function). In fact, Jirka's current > >index-collation method makes use of an EXSLT extension function. > > > >So, ideally, I would hope that any replacement method we came up > >with would still be just XSLT+EXSLT-based, except that it would > >use the CLDR data instead of our current ad-hoc system. > > > > --Mike > > > >[Apologies for posting here if this is not the appropriate list > >for questions of this type. I couldn't find a specific CLDR > >mailing list.] > > > >[1] http://sourceforge.net/projects/docbook > > The current focus of the DocBook Project is work on a set of > > XSLT stylesheets for transforming DocBook XML source content > > into HTML and XSL-FO output. > > > >[2] "Using XSLT for getting back-of-the-book indexes" > > http://www.idealliance.org/proceedings/xml04/papers/77/xslindex.html --=20 Michael Smith http://sideshowbarker.net/ |