Menu

#253 best way to process a large hindi wordlist for hunspell

v1.0 (example)
open
nobody
None
5
2014-12-16
2014-10-18
shreeshrii
No

I want to create an affix file and dictionary from a large hindi wordlist. I see a number of tools - munch, wordlist2hunspell, affixcompress, docubleaffixcompress, makealias etc.

Which is the recommended tool for handling a a large word list in utf8 in a complex script like devanagri which is used for hindi language?

BTW, munch does not seem to handle utf8 for devanagari and create an aff file.

Thanks.

Related

Bugs (archive): #253

Discussion

  • Ruud Baars

    Ruud Baars - 2014-10-28

    I could help in return for learning a bit about the Hindi language.
    Contact me at info at taaltik.nl

     
  • Ruud Baars

    Ruud Baars - 2014-10-28

    A good start is to think about the affix and prefix methods the language has.
    It is better to start form the language logic then by pure statistics.
    e.g. writing: abcd(ef|gh|je) means the words abcdef is the base word, abcdgh and abcdje are derivatives. There are probably multiple word types, each with their own mechanism.

     
  • shreeshrii

    shreeshrii - 2014-11-02

    The wordlist is over 300000 words, it would take a while to apply the language logic. I was looking for a quick way to an affix file, assuming that it will speedup the spellchecking.

    For Hindi, there is a minimal affix file in
    https://anishpatil.fedorapeople.org/hi_in.1.0.0.tar.gz

    I have used affixcompress and makealias on the wordlist, the generated .dic and .aff file file for Hindi is linked from http://sanskritdocuments.org/hindi/hunspell

     

    Last edit: shreeshrii 2014-11-03
  • Mashru

    Mashru - 2014-12-16

    sdk,
    I am working on Gujarati dictionary. I have taken a different approach. And I don't think large word list approach can work. Can you write to me at mashru2@gmail.com?

     
    • shreeshrii

      shreeshrii - 2014-12-17

      Please see https://github.com/Shreeshrii/hindi-hunspell
      for newer files regarding Hindi Hunspell

      On Wed, Dec 17, 2014 at 12:45 AM, Mashru rmashruwala@users.sf.net wrote:

      sdk,
      I am working on Gujarati dictionary. I have taken a different approach.
      And I don't think large word list approach can work. Can you write to me at
      mashru2@gmail.com?


      Status: open
      Group: v1.0 (example)
      Created: Sat Oct 18, 2014 09:18 AM UTC by sdk
      Last Updated: Sun Nov 02, 2014 02:29 PM UTC
      Owner: nobody

      I want to create an affix file and dictionary from a large hindi wordlist.
      I see a number of tools - munch, wordlist2hunspell, affixcompress,
      docubleaffixcompress, makealias etc.

      Which is the recommended tool for handling a a large word list in utf8 in
      a complex script like devanagri which is used for hindi language?

      BTW, munch does not seem to handle utf8 for devanagari and create an aff
      file.

      Thanks.

      Sent from sourceforge.net because you indicated interest in
      https://sourceforge.net/p/hunspell/bugs/253/

      To unsubscribe from further messages, please visit
      https://sourceforge.net/auth/subscriptions/

       

      Related

      Bugs (archive): #253

      • Mashru

        Mashru - 2014-12-18

        Dear sdk,

        I see you have done a lot of work to create large corpus of Hindi words. By looking at the affix rules and dictionary, it appears that you have decided to enumerate word compounding (दीन, विजय both are valid words by their own right as well as can be added to create दिनविजय) and noun forms. I count about 50 variation of word लडका, -की in your dictionary file.

        I found two issues with this approach
        1. Accuracy: Many common day words used in newspaper, magazine or websites are not truly correct word per Hindi dictionary. One needs to decide if spell check should recognize true Hindi words from commonly used incorrect words. I will give two specific examples: regional variation of words in MP, Jharkhand etc. Blatantly incorrect hindi words used in Mumbai (I am from Mumbai so I can vouch for its incorrectness). When trying to collect a large corpus, one ends up collecting incorrect words too. I am curious on how you solved this issue.
        2. Hunspell rules make it extremely simple to handle compounding (दिनविजय) and gender, tense, proposition (लडकीका, -की, -मे) suffixes. You know Indic languages allow a writer to create compound words. It is just not possible to enumerate all possible compound words. best we can do is to define which words can be used to create compound words and let hunspell figure out valid compounds. In case of nouns suffices, the two fold affix rules create compact affix rules. I reproduce one such example for "overt gender marker nouns" like लडका. This set of rules will generate about 90 word forms from a noun like लडका.  The affix file is for Gujarati noun. I hope you can read that much Gujarati.

        With these two items handled, I estimate my Gujarati dic file has no more than 40,000 entries but it can handle at least million different word forms. And because it is a smaller dictionary, I can get scholars to review the root (or stem) words for accuracy.

        I am still going through words so I don’t have final version to share with you.

        It is possible you have gone too far in your approach to change. If so, I do wish you best. You are doing a great service to the language by providing comprehensive spell checker that has been missing for so long. I would be glad to help in any way I can.

        All best

        Raj

        SET UTF-8
        FLAG long

        Noun, pronoun case markers

        -એ, -ને, -થી, -માં, -માંથી, (-નો, -ની, -નું, -ના, -નાં)

        SFX AA Y 10
        SFX AA 0 એ
        SFX AA 0 ને
        SFX AA 0 થી
        SFX AA 0 માં
        SFX AA 0 માંથી
        SFX AA 0 નો
        SFX AA 0 ની
        SFX AA 0 નું
        SFX AA 0 ના
        SFX AA 0 નાં

        Overt gender marker, number marker nouns

        example: છોકર્

        -ઓ, -આઓ, -ઇ, -ઇઓ, -ઉં, -આંઓ

        special cases: -આ, -આં

        masculine singular

        masculine singular when adding case marker changes from -ઓ to આ. Ex: છોકરો to છોકરાએ

        masculine plural

        feminine singular

        feminine plural

        neuter singular

        neuter singular when adding case marker changes from -ઉં to આ. Ex: છોકરું to છોકરાએ

        This generates same forms as masculine plural. Repeated for clarity

        neuter plural

        SFX BA Y 9
        SFX BA  ્  ો
        SFX BA  ્  ા/AA
        SFX BA  ્  ાઓ/AA
        SFX BA  ્  ી/AA
        SFX BA  ્  ીઓ/AA
        SFX BA  ્   ું
        SFX BA  ્  ા/AA
        SFX BA  ્  ાંઓ/AA
        SFX BA  ્  ાં/AA

        On December 16, 2014 at 9:15:43 PM, sdk (shreeshrii@users.sf.net) wrote:

        Please see https://github.com/Shreeshrii/hindi-hunspell
        for newer files regarding Hindi Hunspell

        On Wed, Dec 17, 2014 at 12:45 AM, Mashru rmashruwala@users.sf.net wrote:

        sdk,
        I am working on Gujarati dictionary. I have taken a different approach.
        And I don't think large word list approach can work. Can you write to me at
        mashru2@gmail.com?

        [bugs:#253] http://sourceforge.net/p/hunspell/bugs/253 best way to
        process a large hindi wordlist for hunspell*
        Status: open
        Group: v1.0 (example)
        Created: Sat Oct 18, 2014 09:18 AM UTC by sdk
        Last Updated: Sun Nov 02, 2014 02:29 PM UTC
        Owner: nobody

        I want to create an affix file and dictionary from a large hindi wordlist.
        I see a number of tools - munch, wordlist2hunspell, affixcompress,
        docubleaffixcompress, makealias etc.

        Which is the recommended tool for handling a a large word list in utf8 in
        a complex script like devanagri which is used for hindi language?

        BTW, munch does not seem to handle utf8 for devanagari and create an aff
        file.

        Thanks.

        Sent from sourceforge.net because you indicated interest in
        https://sourceforge.net/p/hunspell/bugs/253/

        To unsubscribe from further messages, please visit
        https://sourceforge.net/auth/subscriptions/

        [bugs:#253] best way to process a large hindi wordlist for hunspell

        Status: open
        Group: v1.0 (example)
        Created: Sat Oct 18, 2014 09:18 AM UTC by sdk
        Last Updated: Tue Dec 16, 2014 07:15 PM UTC
        Owner: nobody

        I want to create an affix file and dictionary from a large hindi wordlist. I see a number of tools - munch, wordlist2hunspell, affixcompress, docubleaffixcompress, makealias etc.

        Which is the recommended tool for handling a a large word list in utf8 in a complex script like devanagri which is used for hindi language?

        BTW, munch does not seem to handle utf8 for devanagari and create an aff file.

        Thanks.

        Sent from sourceforge.net because you indicated interest in https://sourceforge.net/p/hunspell/bugs/253/

        To unsubscribe from further messages, please visit https://sourceforge.net/auth/subscriptions/

         
        👍
        1

        Related

        Bugs (archive): #253