#916 Implement Lao word breaking

future
open
nobody
Lao (1)
5
2013-11-26
2013-11-13
No

Is it possible to add support for Lao word breaking? Lao is similar to Thai in that it does not seperate words with spaces, but uses spaces similarly to punctuation of phrases. ICU for Java 52.1 has Lao and Thai word break iterator support, and may be the best option for providing that kind of support.

Related

Feature Requests: #916

Discussion

  • Didier Briel
    Didier Briel
    2013-11-18

    • summary: Lao --> Implement Lao word breaking
     
  • Didier Briel
    Didier Briel
    2013-11-18

    Is there a Hunspell dictionary for Lao?

    If yes, have you checked what happens with a tokenizer?

    Didier

     
  • Not that I know of. Was working on a spellcheck like mechanism myself, but was of no use without ICU 52.1 since the spellcheck mechanism couldn't figure out to do with a language that doesn't use spaces or anything to separate words.

    Here is the dictionary file I helped make: http://code.google.com/p/lao-dictionary/source/browse/

    It's a word list with most Lao words (it's what we submitted to ICU).

    What would be need to make a Hunspell dictionary? Would it detect word boundaries in Lao correctly?

     
  • Didier Briel
    Didier Briel
    2013-11-19

    What would be need to make a Hunspell dictionary?

    No idea.

    Would it detect word boundaries in Lao correctly?

    It works for Khmer, but they insert zero-width spaces between words.

    On another thought: have you checked what might be done with Lucene? There is a Thai tokenizer. I haven't tested it, but detecting words without spaces works fine with Chinese, Japanese and Korean.

    Didier

     
  • I just submitted what could work to LibreOffice.

    https://gerrit.libreoffice.org/#/c/6774/

    Would this do the trick?

    I will check out Lucene and see what they have - I think it's based upon
    ICU if I am not mistaken... Will let you know.

    Respectfully,

    Robert M Campbell
    IT Specialist for ADRA Laos & Open Source Advocate
    Lao Cell: +856 207 616 7299
    US Phone: +1 270 681 0399
    robert.rcampbell@gmail.com
    rcampbell@adralaos.org

    Visit ADRA Lao's Facebook Page at facebook.com/ADRALaos
    <www.facebook.com ADRALaos="">
    On 11/19/2013 04:37 PM, Didier Briel wrote:

    What would be need to make a Hunspell dictionary?
    

    No idea.

    Would it detect word boundaries in Lao correctly?
    

    It works for Khmer, but they insert zero-width spaces between words.

    On another thought: have you checked what might be done with Lucene?
    There is a Thai tokenizer. I haven't tested it, but detecting words
    without spaces works fine with Chinese, Japanese and Korean.

    Didier


    [feature-requests:#916]
    http://sourceforge.net/p/omegat/feature-requests/916/ Implement Lao
    word breaking

    Status: open
    Labels: Lao
    Created: Wed Nov 13, 2013 11:44 AM UTC by Robert M Campbell
    Last Updated: Mon Nov 18, 2013 01:19 PM UTC
    Owner: nobody

    Is it possible to add support for Lao word breaking? Lao is similar to
    Thai in that it does not seperate words with spaces, but uses spaces
    similarly to punctuation of phrases. ICU for Java 52.1 has Lao and
    Thai word break iterator support, and may be the best option for
    providing that kind of support.


    Sent from sourceforge.net because you indicated interest in
    https://sourceforge.net/p/omegat/feature-requests/916/

    To unsubscribe from further messages, please visit
    https://sourceforge.net/auth/subscriptions/

     

    Related

    Feature Requests: #916

  • Didier Briel
    Didier Briel
    2013-11-25

    I just submitted what could work to LibreOffice.
    https://gerrit.libreoffice.org/#/c/6774/
    Would this do the trick?

    I don't know. What you submitted is a Hunspell dictionary, right?

    Have you tried it in OmegaT?

    Didier

     
  • Yes, it's Hunspell.

    Not yet but will test it soon. Are the dictionaries bundled with OmegaT
    or do they have to be loaded from external sources?

    Respectfully,

    Robert M Campbell
    IT Specialist for ADRA Laos & Open Source Advocate
    Lao Cell: +856 207 616 7299
    US Phone: +1 270 681 0399
    robert.rcampbell@gmail.com
    rcampbell@adralaos.org

    Visit ADRA Lao's Facebook Page at facebook.com/ADRALaos
    <www.facebook.com ADRALaos="">
    On 11/25/2013 08:47 PM, Didier Briel wrote:

    I just submitted what could work to LibreOffice.
    https://gerrit.libreoffice.org/#/c/6774/
    Would this do the trick?
    

    I don't know. What you submitted is a Hunspell dictionary, right?

    Have you tried it in OmegaT?

    Didier


    [feature-requests:#916]
    http://sourceforge.net/p/omegat/feature-requests/916/ Implement Lao
    word breaking

    Status: open
    Labels: Lao
    Created: Wed Nov 13, 2013 11:44 AM UTC by Robert M Campbell
    Last Updated: Tue Nov 19, 2013 09:37 AM UTC
    Owner: nobody

    Is it possible to add support for Lao word breaking? Lao is similar to
    Thai in that it does not seperate words with spaces, but uses spaces
    similarly to punctuation of phrases. ICU for Java 52.1 has Lao and
    Thai word break iterator support, and may be the best option for
    providing that kind of support.


    Sent from sourceforge.net because you indicated interest in
    https://sourceforge.net/p/omegat/feature-requests/916/

    To unsubscribe from further messages, please visit
    https://sourceforge.net/auth/subscriptions/

     

    Related

    Feature Requests: #916

  • Didier Briel
    Didier Briel
    2013-11-26

    Are the dictionaries bundled with OmegaT
    or do they have to be loaded from external sources?

    No dictionaries are provided with OmegaT.
    They can be loaded from an external source or simply copied into a local folder.

    Didier