you usually use finite-state machines (xfst/openfst) to do the last task you mention here. we've got a fairly comprehensive list of 'lemmas' for this in xfst which we need to convert to a re-usable resource...


On 07/06/2010 12:58 AM, Buddhika Laknath wrote:
Great, thanks for the patch mate. There sure to be many bugs and will 
need to make this stable before going public.

  
I played around with your app for a little bit. However, unicode 
characters are not displayed in the java GUI for some reason (my box 
is ubuntu 9.10 x64 with sun java6).
    
Did you try selecting an Unicode font from Edit > Preferences > Font ? 
I'm using Iskola Potha if that helps.

  
Anyway, the crawler is quite useful for getting a word list. There 
were some bugs with the crawler, I fixed some (the patch is attached) 
- However, there is a bug in the Trie data structure which causes an 
ArrayOutOfBounds exception, I simply disabled ommitWords. you might 
want to check into that.

The word list is not displayed properly due to that display bug I 
talked about - but I managed to save the words successfully into a 
file. I used the wikipedia home page (http://si.wikipedia.org/) and 
there are 402 words :)
    
Crawler needs some more testing and I'll look into this issue. Thanks 
for letting me know.

  
As I see, there are lot of people involved in sinhala unicode these 
days. So, may be we can created a simple wiki like interface, get a 
huge word list and use a crowd sourcing system for filtering out the 
correctly spelled words.
    
Yes, I was also thinking of such a system because frankly this is out of 
the scope of one person or even a group. We can only provide tools and 
may be an initial wordlist (combining freely available wordlists) and 
then it's up to all others to make it improve.

Another major issue with current Sinhala wordlists is that they don't 
have all forms of words (ex: verbs). So I'm trying to device a way to 
make it possible to create Hunspell rules easily by using a UI (table) 
so people can give a base word and then generate all combinations of a 
word and add it the the list. Let's see how well it goes.

Cheers,
Laknath