Please see https://github.com/Shreeshrii/hindi-hunspell
for newer files regarding Hindi Hunspell

On Wed, Dec 17, 2014 at 12:45 AM, Mashru rmashruwala@users.sf.net wrote:

sdk,
I am working on Gujarati dictionary. I have taken a different approach.
And I don't think large word list approach can work. Can you write to me at
mashru2@gmail.com?

[bugs:#253] http://sourceforge.net/p/hunspell/bugs/253 best way to
process a large hindi wordlist for hunspell*

Status: open
Group: v1.0 (example)
Created: Sat Oct 18, 2014 09:18 AM UTC by sdk
Last Updated: Sun Nov 02, 2014 02:29 PM UTC
Owner: nobody

I want to create an affix file and dictionary from a large hindi wordlist.
I see a number of tools - munch, wordlist2hunspell, affixcompress,
docubleaffixcompress, makealias etc.

Which is the recommended tool for handling a a large word list in utf8 in
a complex script like devanagri which is used for hindi language?

BTW, munch does not seem to handle utf8 for devanagari and create an aff
file.

Thanks.

Sent from sourceforge.net because you indicated interest in
https://sourceforge.net/p/hunspell/bugs/253/

To unsubscribe from further messages, please visit
https://sourceforge.net/auth/subscriptions/

Dear sdk,

I see you have done a lot of work to create large corpus of Hindi words. By looking at the affix rules and dictionary, it appears that you have decided to enumerate word compounding (दीन, विजय both are valid words by their own right as well as can be added to create दिनविजय) and noun forms. I count about 50 variation of word लडका, -की in your dictionary file.

I found two issues with this approach
1. Accuracy: Many common day words used in newspaper, magazine or websites are not truly correct word per Hindi dictionary. One needs to decide if spell check should recognize true Hindi words from commonly used incorrect words. I will give two specific examples: regional variation of words in MP, Jharkhand etc. Blatantly incorrect hindi words used in Mumbai (I am from Mumbai so I can vouch for its incorrectness). When trying to collect a large corpus, one ends up collecting incorrect words too. I am curious on how you solved this issue.
2. Hunspell rules make it extremely simple to handle compounding (दिनविजय) and gender, tense, proposition (लडकीका, -की, -मे) suffixes. You know Indic languages allow a writer to create compound words. It is just not possible to enumerate all possible compound words. best we can do is to define which words can be used to create compound words and let hunspell figure out valid compounds. In case of nouns suffices, the two fold affix rules create compact affix rules. I reproduce one such example for "overt gender marker nouns" like लडका. This set of rules will generate about 90 word forms from a noun like लडका. The affix file is for Gujarati noun. I hope you can read that much Gujarati.

With these two items handled, I estimate my Gujarati dic file has no more than 40,000 entries but it can handle at least million different word forms. And because it is a smaller dictionary, I can get scholars to review the root (or stem) words for accuracy.

I am still going through words so I don’t have final version to share with you.

It is possible you have gone too far in your approach to change. If so, I do wish you best. You are doing a great service to the language by providing comprehensive spell checker that has been missing for so long. I would be glad to help in any way I can.

All best

Raj

SET UTF-8
FLAG long

Noun, pronoun case markers

-એ, -ને, -થી, -માં, -માંથી, (-નો, -ની, -નું, -ના, -નાં)

SFX AA Y 10
SFX AA 0 એ
SFX AA 0 ને
SFX AA 0 થી
SFX AA 0 માં
SFX AA 0 માંથી
SFX AA 0 નો
SFX AA 0 ની
SFX AA 0 નું
SFX AA 0 ના
SFX AA 0 નાં

Overt gender marker, number marker nouns

example: છોકર્

-ઓ, -આઓ, -ઇ, -ઇઓ, -ઉં, -આંઓ

special cases: -આ, -આં

masculine singular

masculine singular when adding case marker changes from -ઓ to આ. Ex: છોકરો to છોકરાએ

masculine plural

feminine singular

feminine plural

neuter singular

neuter singular when adding case marker changes from -ઉં to આ. Ex: છોકરું to છોકરાએ

This generates same forms as masculine plural. Repeated for clarity

neuter plural

SFX BA Y 9
SFX BA ્ ો
SFX BA ્ ા/AA
SFX BA ્ ાઓ/AA
SFX BA ્ ી/AA
SFX BA ્ ીઓ/AA
SFX BA ્ ું
SFX BA ્ ા/AA
SFX BA ્ ાંઓ/AA
SFX BA ્ ાં/AA

On December 16, 2014 at 9:15:43 PM, sdk (shreeshrii@users.sf.net) wrote:

Please see https://github.com/Shreeshrii/hindi-hunspell
for newer files regarding Hindi Hunspell

On Wed, Dec 17, 2014 at 12:45 AM, Mashru rmashruwala@users.sf.net wrote:

sdk,
I am working on Gujarati dictionary. I have taken a different approach.
And I don't think large word list approach can work. Can you write to me at
mashru2@gmail.com?

[bugs:#253] http://sourceforge.net/p/hunspell/bugs/253 best way to
process a large hindi wordlist for hunspell*
Status: open
Group: v1.0 (example)
Created: Sat Oct 18, 2014 09:18 AM UTC by sdk
Last Updated: Sun Nov 02, 2014 02:29 PM UTC
Owner: nobody

I want to create an affix file and dictionary from a large hindi wordlist.
I see a number of tools - munch, wordlist2hunspell, affixcompress,
docubleaffixcompress, makealias etc.

Which is the recommended tool for handling a a large word list in utf8 in
a complex script like devanagri which is used for hindi language?

BTW, munch does not seem to handle utf8 for devanagari and create an aff
file.

Thanks.

Sent from sourceforge.net because you indicated interest in
https://sourceforge.net/p/hunspell/bugs/253/

To unsubscribe from further messages, please visit
https://sourceforge.net/auth/subscriptions/

[bugs:#253] best way to process a large hindi wordlist for hunspell

Status: open
Group: v1.0 (example)
Created: Sat Oct 18, 2014 09:18 AM UTC by sdk
Last Updated: Tue Dec 16, 2014 07:15 PM UTC
Owner: nobody

I want to create an affix file and dictionary from a large hindi wordlist. I see a number of tools - munch, wordlist2hunspell, affixcompress, docubleaffixcompress, makealias etc.

Which is the recommended tool for handling a a large word list in utf8 in a complex script like devanagri which is used for hindi language?

BTW, munch does not seem to handle utf8 for devanagari and create an aff file.

Thanks.

Sent from sourceforge.net because you indicated interest in https://sourceforge.net/p/hunspell/bugs/253/

To unsubscribe from further messages, please visit https://sourceforge.net/auth/subscriptions/

👍

Bugs (archive): #253

best way to process a large hindi wordlist for hunspell

Group

Searches

Help

#253 best way to process a large hindi wordlist for hunspell

Related

Discussion

Thanks.

Related

Noun, pronoun case markers

-એ, -ને, -થી, -માં, -માંથી, (-નો, -ની, -નું, -ના, -નાં)

Overt gender marker, number marker nouns

example: છોકર્

-ઓ, -આઓ, -ઇ, -ઇઓ, -ઉં, -આંઓ

special cases: -આ, -આં

masculine singular

masculine singular when adding case marker changes from -ઓ to આ. Ex: છોકરો to છોકરાએ

masculine plural

feminine singular

feminine plural

neuter singular

neuter singular when adding case marker changes from -ઉં to આ. Ex: છોકરું to છોકરાએ

This generates same forms as masculine plural. Repeated for clarity

neuter plural

Related