Menu

nfd/Armn.js which nfkc/Armn.js depends upon

2015-04-22
2015-05-06
  • Nathan Phillip Brink

    Hi, I checkouted the trunk. I created my own meta-JavaScript file
    called blah.js to try experimenting with the collation API. I want to
    see how ridiculously big the full collation set is:

    $ cat blah.js
    /*
      !depends ilib-full-inc.js
    */
    /*
      !data collation/ducet
    */
    

    When I run jsa as instructed in the tutorial linked from
    https://sourceforge.net/p/i18nlib/wiki/installation/ , I get the
    following error:

    ohnob@WIN-F5A6PNAUAJ2 ~/repos/ilib/js/src
    $ ../../bin/jsa -o blah-compiled.js blah.js
    C:\Program Files (x86)\JDK/bin/java -cp ../../java/output/debug/*:../../java/lib:../../java/lib/* com.ilib.tools.jsa.JSAssemble -o blah-compiled.js blah.js
    DEBUG main class com.ilib.tools.jsa.JSAssemble - Current directory is c:\Users\ohnob\repos\ilib\js\src
     WARN main class com.ilib.tools.jsa.JSAssemble - Warning: unrecognized argument -o blah-compiled.js blah.js. Ignoring...
    DEBUG main class com.ilib.tools.jsa.JSAssemble - Searching directory .
    DEBUG main class com.ilib.tools.jsa.JSAssemble - Searching directory .\address
    DEBUG main class com.ilib.tools.jsa.JSAssemble - Searching directory .\address\test
    DEBUG main class com.ilib.tools.jsa.JSAssemble - Searching directory .\calendar
    

    … (snipped) …

     INFO main class com.ilib.tools.jsa.JSAssemble - Reading dependencies for file .\nfkc\Armn.js
    ERROR main class com.ilib.tools.jsa.JSAssemble - Cannot write to file iliball.js
    java.lang.Exception: Cannot find file nfd/Armn.js which .\nfkc\Armn.js depends upon.
            at com.ilib.tools.jsa.JSFile.find(JSFile.java:88)
            at com.ilib.tools.jsa.JSFile.process(JSFile.java:395)
            at com.ilib.tools.jsa.JSAssemble.main(JSAssemble.java:172)
    

    When I look at nfd/Latn.js, I see

    /* WARNING: THIS IS A FILE GENERATED BY gennorm.js. DO NOT EDIT BY HAND. */
    

    and I can see that there is, for example, an nfk/Armn.js file—so I
    assume Armn.js should able to be generated from somewhere. I just don’t know how to trigger its generation.

    I have run ant and its report is BUILD SUCCESSFUL. svn says that there
    are a bunch of unversioned files (mostly the result of ant building
    things) and that I modified bin/jsa (I had to poke at it to get it to
    interact with my java/CLASSPATH stuff right). Otherwise, I’m at revision r1755.

    Is this a bug? Am I doing something obviously wrong in how I’m trying
    to invoke jsa?

     
  • Edwin H

    Edwin H - 2015-04-22

    Hi Nathan,

    The problem with the command line is the missing -I arguments to tell it where to look for things to include and -l to give the list of locales to use when looking for locale data. Take a look at the runassemble macro definition in the build.xml as an example of how to run the tool.

    Please note that despite what the documentation says, the DUCET collation stuff is not implemented yet. As you suspected, it turned out to be really huge. I should probably edit the docs to make that clear. I don't even know if it is useful, as collation with multiple languages is probably not that common, especially in Javascript. Large collations should be done on the database side for efficiency anyways. I was planning not to implement it unless I get some urgent requests about it, and focus more on expanding the locale-specific collations instead.

    Collation for a few languages are implemented in the trunk, but you'll probably be more interested in the "collation3" branch where a few more have been implemented, including Korean, Japanese, and Chinese. I derived the data from the Unicode CLDR tailorings, so the results should be similar to (but not exactly the same as) what ICU produces.

    Here are the sizes of the compressed collation.json files in js/data/localetemp which will give you an idea of how big the collation data is:

    -rw-rw-r-- 1 edwin edwin   2473 Jan 25 13:01 collation.json (root - see below)
    -rw-rw-r-- 1 edwin edwin   2806 Jan 25 13:01 cs/collation.json (Czech)
    -rw-rw-r-- 1 edwin edwin   2624 Jan 25 13:01 da/collation.json (Danish)
    -rw-rw-r-- 1 edwin edwin   2462 Jan 25 13:01 de/collation.json (German)
    -rw-rw-r-- 1 edwin edwin   1357 Jan 25 13:01 el/collation.json (Greek)
    -rw-rw-r-- 1 edwin edwin   2458 Jan 25 13:01 es/collation.json (Spanish)
    -rw-rw-r-- 1 edwin edwin   2487 Jan 25 13:01 et/collation.json (Estonian)
    -rw-rw-r-- 1 edwin edwin   2756 Jan 25 13:01 fi/collation.json (Finnish)
    -rw-rw-r-- 1 edwin edwin   2356 Jan 25 13:01 fo/collation.json (Faroese)
    -rw-rw-r-- 1 edwin edwin   1076 Jan 25 13:01 he/collation.json (Hebrew)
    -rw-rw-r-- 1 edwin edwin  55157 Jan 25 13:01 ja/collation.json (Japanese including the kanas)
    -rw-rw-r-- 1 edwin edwin   7204 Jan 25 13:01 ko/collation.json (Korean)
    -rw-rw-r-- 1 edwin edwin   2902 Jan 25 13:01 lt/collation.json (Lithuanian)
    -rw-rw-r-- 1 edwin edwin   3101 Jan 25 13:01 lv/collation.json (Latvian)
    -rw-rw-r-- 1 edwin edwin   2644 Jan 25 13:01 nb/collation.json (Norwegian Bokmal)
    -rw-rw-r-- 1 edwin edwin   2646 Jan 25 13:01 nn/collation.json (Norwegian Nynorsk)
    -rw-rw-r-- 1 edwin edwin   2630 Jan 25 13:01 no/collation.json (Norwegian)
    -rw-rw-r-- 1 edwin edwin   4111 Jan 25 13:01 ru/collation.json (Russian)
    -rw-rw-r-- 1 edwin edwin   2526 Jan 25 13:01 sv/collation.json (Swedish)
    -rw-rw-r-- 1 edwin edwin   3510 Jan 25 13:01 zh/collation.json (Chinese pinyin + bopomofo)
    -rw-rw-r-- 1 edwin edwin 145162 Jan 25 13:01 zh/Hans/collation.json (Chinese Simplified)
    -rw-rw-r-- 1 edwin edwin 422533 Jan 25 13:01 zh/Hant/collation.json (Chinese Traditional)
    

    The root collation supports:

    af (Afrikaans)
    ca (Catalan)
    en (English)
    eu (Basque)
    gl (Galician)
    it (Italian)
    id (Indonesian)
    ms (Malaysian)
    nl (Dutch)
    pt (Portuguese)
    sw (Swahili)

    Hopefully that answers your question about the data size. Let me know if you need more information.

    For anyone else out there reading this, please respond here and let us know if you need collations for any other languages currently not listed above.

     
  • Nathan Phillip Brink

    Thanks for the information.

    Honestly, I am unlikely in the foreseeable future to need something more fancy than en-US latin collation. I just wanted to be able to have a locale/collation based string matching tool that would guarantee the same results when run either server-side (seems like the C#-hosted JavaScript engines I could find do not even include string.localeCompare()) and client (where the primary client is iOS which doesn’t even have a locale-happy string.localeCompare() or Intl.Collator.compare()—and even if the client had the native locale implementation, I might still want to use the reference JS implementation from ilib to increase the chances of getting the same results on the client as on the server side).

    Since the provided ilib-full.js already has pretty good support for the latin locales, I will just go with that for now. If I ever need to, I can look at building the heavier-duty collations and dynanmic data loading. If I ever need better locale coverage, my app will know at initialization time what locales need to be loaded. I think the dynamic data loading should let me distribute a single build of my app which supports all the locales without forcing the client to actually load everything. And if I go that route, I can get the prebuilt dynamically-loadable data from the collation3 branch.

    Again, thanks for the information.

     
  • Edwin H

    Edwin H - 2015-04-25

    Actually, that is good to hear. That is one of the reasons for ilib: to have the same implementation that works the same way on multiple platforms. Just remember to pass in "useNative: false" to the collator constructor so that it doesn't use the Intl object when it detects that it is available.

    BTW - in the "refactor2" branch, I am refactoring everything to be in a CommonJS format where each class is its own module and modules explicitly require() the things they depend on. In that sort of scenario, you don't need all of ilib just to do collation, which should be a little more efficient in terms of loading time and memory usage. That branch will be merged to trunk in the next week or so and I'll start making version 11 builds out of it.

     
  • Edwin H

    Edwin H - 2015-05-02

    Just a note to let you know the collation3 branch has been merged to trunk, and all those new collations will be in build 11.0.002 coming up soon.

     
  • Serkan Durusoy

    Serkan Durusoy - 2015-05-04

    Hmm, I wish I read this thread earlier.

    I'm trying to develop a collation package for meteorjs (a node based reactive platform with isomorphic javascript and database apis on both client and server) and my goal is to provide a seamless and transparent means for app developers to store sortkeys on mongodb along with their documents, thus overcoming mongodb's weakness about lacking language based case insensitive sorting.

    To be able to do that, I need only the sortkey function (or perhaps some other pieces of the collator class) and it would be extremely better if I had it available (along with some predefined set of locale data) on both the server and the client.

     
  • Edwin H

    Edwin H - 2015-05-04

    Wow, well if that is what you are doing, then mongodb has a lot more internationalization problems than just proper locale-sensitive collation for indexes. I talked some time ago to one of the founders of MongoDB, Eliot Horowitz, (my wife worked at Mongo for some time) and he said basically that they have much bigger things to worry about than internationalization. He said that MongoDB already supports UTF-8 and that should be good enough for most applications.

    In reality, I think most applications that are truely international just work around the lack of proper i18n in MongoDB, which is what I think you are trying to do with Meteor.

    Anyways, if you are interested, send me an email to marketing@jedlsoft.com and I can explain the list of problems I see with it that need to be solved/worked around.

    (Maybe I should do a talk at Mongo World about this?)

     
  • Serkan Durusoy

    Serkan Durusoy - 2015-05-05

    I'm not experienced much with mongodb but currently in Meteor world, we are kind of stuck with it and instead of complaining about its shortcomings, I tend to work around them and make best use of what's available.

    Sorting is a big frustration shared by a lot of meteor devs and after reading the open issues on mongodb's own tracker, I came across a blog post at http://derickrethans.nl/mongodb-collation.html which led me to do the same with meteor.

    After shopping around for an i18n library (including the native js libs) I found out that ilib is the only one giving me sortkeys.

    So I first wrapped ilib https://atmospherejs.com/serkandurusoy/ilib to make it available on atmosphere, meteor's package repository. (Side note: if you want to take it over and publish it officially, I'd be happy to lend a hand. It is very straightforward and you can integrate it into your build process.)

    And now I am trying to implement the solution by means of hooks that wrap mongodb insert/update methods which enhance the document with sortkeys and I'll also be wrapping find and findone to sort by the sortkey instead of the given sort key.

    Since meteor is isomorphic javascript which has almost the same apis on both the client and the server, what would have been great is a small javascript sortkey library that works consistently on browser/client.

    I'll email you separately about mongodb i18n issues and thank you very much for offering the help.

    A talk sounds great!

     
  • Serkan Durusoy

    Serkan Durusoy - 2015-05-06

    Edwin, in light of your most recent versions shifting towards a module based build system, do you think we'll be able to have a lightweight collator class in the browser (with dynamic ad-hoc loading of locale data) any time soon?

     

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.