I want to create an affix file and dictionary from a large hindi wordlist. I see a number of tools - munch, wordlist2hunspell, affixcompress, docubleaffixcompress, makealias etc.
Which is the recommended tool for handling a a large word list in utf8 in a complex script like devanagri which is used for hindi language?
BTW, munch does not seem to handle utf8 for devanagari and create an aff file.
Thanks.
I could help in return for learning a bit about the Hindi language.
Contact me at info at taaltik.nl
A good start is to think about the affix and prefix methods the language has.
It is better to start form the language logic then by pure statistics.
e.g. writing: abcd(ef|gh|je) means the words abcdef is the base word, abcdgh and abcdje are derivatives. There are probably multiple word types, each with their own mechanism.
The wordlist is over 300000 words, it would take a while to apply the language logic. I was looking for a quick way to an affix file, assuming that it will speedup the spellchecking.
For Hindi, there is a minimal affix file in
https://anishpatil.fedorapeople.org/hi_in.1.0.0.tar.gz
I have used affixcompress and makealias on the wordlist, the generated .dic and .aff file file for Hindi is linked from http://sanskritdocuments.org/hindi/hunspell
Last edit: shreeshrii 2014-11-03
sdk,
I am working on Gujarati dictionary. I have taken a different approach. And I don't think large word list approach can work. Can you write to me at mashru2@gmail.com?
Please see https://github.com/Shreeshrii/hindi-hunspell
for newer files regarding Hindi Hunspell
On Wed, Dec 17, 2014 at 12:45 AM, Mashru rmashruwala@users.sf.net wrote:
Related
Bugs (archive): #253
Dear sdk,
I see you have done a lot of work to create large corpus of Hindi words. By looking at the affix rules and dictionary, it appears that you have decided to enumerate word compounding (दीन, विजय both are valid words by their own right as well as can be added to create दिनविजय) and noun forms. I count about 50 variation of word लडका, -की in your dictionary file.
I found two issues with this approach
1. Accuracy: Many common day words used in newspaper, magazine or websites are not truly correct word per Hindi dictionary. One needs to decide if spell check should recognize true Hindi words from commonly used incorrect words. I will give two specific examples: regional variation of words in MP, Jharkhand etc. Blatantly incorrect hindi words used in Mumbai (I am from Mumbai so I can vouch for its incorrectness). When trying to collect a large corpus, one ends up collecting incorrect words too. I am curious on how you solved this issue.
2. Hunspell rules make it extremely simple to handle compounding (दिनविजय) and gender, tense, proposition (लडकीका, -की, -मे) suffixes. You know Indic languages allow a writer to create compound words. It is just not possible to enumerate all possible compound words. best we can do is to define which words can be used to create compound words and let hunspell figure out valid compounds. In case of nouns suffices, the two fold affix rules create compact affix rules. I reproduce one such example for "overt gender marker nouns" like लडका. This set of rules will generate about 90 word forms from a noun like लडका. The affix file is for Gujarati noun. I hope you can read that much Gujarati.
With these two items handled, I estimate my Gujarati dic file has no more than 40,000 entries but it can handle at least million different word forms. And because it is a smaller dictionary, I can get scholars to review the root (or stem) words for accuracy.
I am still going through words so I don’t have final version to share with you.
It is possible you have gone too far in your approach to change. If so, I do wish you best. You are doing a great service to the language by providing comprehensive spell checker that has been missing for so long. I would be glad to help in any way I can.
All best
Raj
SET UTF-8
FLAG long
Noun, pronoun case markers
-એ, -ને, -થી, -માં, -માંથી, (-નો, -ની, -નું, -ના, -નાં)
SFX AA Y 10
SFX AA 0 એ
SFX AA 0 ને
SFX AA 0 થી
SFX AA 0 માં
SFX AA 0 માંથી
SFX AA 0 નો
SFX AA 0 ની
SFX AA 0 નું
SFX AA 0 ના
SFX AA 0 નાં
Overt gender marker, number marker nouns
example: છોકર્
-ઓ, -આઓ, -ઇ, -ઇઓ, -ઉં, -આંઓ
special cases: -આ, -આં
masculine singular
masculine singular when adding case marker changes from -ઓ to આ. Ex: છોકરો to છોકરાએ
masculine plural
feminine singular
feminine plural
neuter singular
neuter singular when adding case marker changes from -ઉં to આ. Ex: છોકરું to છોકરાએ
This generates same forms as masculine plural. Repeated for clarity
neuter plural
SFX BA Y 9
SFX BA ્ ો
SFX BA ્ ા/AA
SFX BA ્ ાઓ/AA
SFX BA ્ ી/AA
SFX BA ્ ીઓ/AA
SFX BA ્ ું
SFX BA ્ ા/AA
SFX BA ્ ાંઓ/AA
SFX BA ્ ાં/AA
On December 16, 2014 at 9:15:43 PM, sdk (shreeshrii@users.sf.net) wrote:
Please see https://github.com/Shreeshrii/hindi-hunspell
for newer files regarding Hindi Hunspell
On Wed, Dec 17, 2014 at 12:45 AM, Mashru rmashruwala@users.sf.net wrote:
sdk,
I am working on Gujarati dictionary. I have taken a different approach.
And I don't think large word list approach can work. Can you write to me at
mashru2@gmail.com?
[bugs:#253] http://sourceforge.net/p/hunspell/bugs/253 best way to
process a large hindi wordlist for hunspell*
Status: open
Group: v1.0 (example)
Created: Sat Oct 18, 2014 09:18 AM UTC by sdk
Last Updated: Sun Nov 02, 2014 02:29 PM UTC
Owner: nobody
I want to create an affix file and dictionary from a large hindi wordlist.
I see a number of tools - munch, wordlist2hunspell, affixcompress,
docubleaffixcompress, makealias etc.
Which is the recommended tool for handling a a large word list in utf8 in
a complex script like devanagri which is used for hindi language?
BTW, munch does not seem to handle utf8 for devanagari and create an aff
file.
Thanks.
Sent from sourceforge.net because you indicated interest in
https://sourceforge.net/p/hunspell/bugs/253/
To unsubscribe from further messages, please visit
https://sourceforge.net/auth/subscriptions/
[bugs:#253] best way to process a large hindi wordlist for hunspell
Status: open
Group: v1.0 (example)
Created: Sat Oct 18, 2014 09:18 AM UTC by sdk
Last Updated: Tue Dec 16, 2014 07:15 PM UTC
Owner: nobody
I want to create an affix file and dictionary from a large hindi wordlist. I see a number of tools - munch, wordlist2hunspell, affixcompress, docubleaffixcompress, makealias etc.
Which is the recommended tool for handling a a large word list in utf8 in a complex script like devanagri which is used for hindi language?
BTW, munch does not seem to handle utf8 for devanagari and create an aff file.
Thanks.
Sent from sourceforge.net because you indicated interest in https://sourceforge.net/p/hunspell/bugs/253/
To unsubscribe from further messages, please visit https://sourceforge.net/auth/subscriptions/
Related
Bugs (archive): #253