you usually use finite-state machines (xfst/openfst)
to do the last task you mention here. we've got a fairly comprehensive
list of 'lemmas' for this in xfst which we need to convert to a
On 07/06/2010 12:58 AM, Buddhika Laknath wrote:
Great, thanks for the patch mate. There sure to be many bugs and will
need to make this stable before going public.
I played around with your app for a little bit. However, unicode
characters are not displayed in the java GUI for some reason (my box
is ubuntu 9.10 x64 with sun java6).
Did you try selecting an Unicode font from Edit > Preferences > Font ?
I'm using Iskola Potha if that helps.
Anyway, the crawler is quite useful for getting a word list. There
were some bugs with the crawler, I fixed some (the patch is attached)
- However, there is a bug in the Trie data structure which causes an
ArrayOutOfBounds exception, I simply disabled ommitWords. you might
want to check into that.
The word list is not displayed properly due to that display bug I
talked about - but I managed to save the words successfully into a
file. I used the wikipedia home page (http://si.wikipedia.org/) and
there are 402 words :)
Crawler needs some more testing and I'll look into this issue. Thanks
for letting me know.
As I see, there are lot of people involved in sinhala unicode these
days. So, may be we can created a simple wiki like interface, get a
huge word list and use a crowd sourcing system for filtering out the
correctly spelled words.
Yes, I was also thinking of such a system because frankly this is out of
the scope of one person or even a group. We can only provide tools and
may be an initial wordlist (combining freely available wordlists) and
then it's up to all others to make it improve.
Another major issue with current Sinhala wordlists is that they don't
have all forms of words (ex: verbs). So I'm trying to device a way to
make it possible to create Hunspell rules easily by using a UI (table)
so people can give a base word and then generate all combinations of a
word and add it the the list. Let's see how well it goes.