Thread: [sinhala-technical] hunspell-si
Brought to you by:
aratnaweera,
harshula
From: Harshula <har...@gm...> - 2009-11-28 15:50:17
|
Hi Sandaruwan, I noticed you created a word list which can be used for spell checking on GNU/Linux: https://addons.mozilla.org/en-US/firefox/addon/13981/ Would you be interested in maintaining a hunspell-si project? We have a CVS repository on sourceforge that you could use: http://sinhala.cvs.sourceforge.net/viewvc/sinhala/sinhala/spell/ cya, # |
From: Gihan D. <gi...@uo...> - 2009-11-28 16:08:57
Attachments:
smime.p7s
|
28-11-2009 20:49 දින, Harshula ලිව්වා: > Hi Sandaruwan, > > I noticed you created a word list which can be used for spell checking > on GNU/Linux: > https://addons.mozilla.org/en-US/firefox/addon/13981/ > > Would you be interested in maintaining a hunspell-si project? > > We have a CVS repository on sourceforge that you could use: > http://sinhala.cvs.sourceforge.net/viewvc/sinhala/sinhala/spell/ > One of our students, Laknath, is working on Sinhala support for hunspell. I will ask him to contact you. Gihan |
From: Harshula <har...@gm...> - 2009-11-29 13:55:30
|
Hi Sandaruwan & Buddhika, Could you both communicate with each other and decide how to proceed? I'd like to see the software maintained in a globally accessible repository so that the distros can package it. You are both welcome to use the sinhala.sourceforge.net CVS repository. cya, # On Sat, 2009-11-28 at 21:24 +0530, Gihan Dias wrote: > 28-11-2009 20:49 දින, Harshula ලිව්වා: > > Hi Sandaruwan, > > > > I noticed you created a word list which can be used for spell checking > > on GNU/Linux: > > https://addons.mozilla.org/en-US/firefox/addon/13981/ > > > > Would you be interested in maintaining a hunspell-si project? > > > > We have a CVS repository on sourceforge that you could use: > > http://sinhala.cvs.sourceforge.net/viewvc/sinhala/sinhala/spell/ > > > One of our students, Laknath, is working on Sinhala support for > hunspell. I will ask him to contact you. > > Gihan |
From: Buddhika L. <bla...@gm...> - 2009-11-29 19:07:57
|
Hi Harshula, First of all thanks for the fast response. I know Sandaruwn through local Twitter community and will discuss this with him and let you know how we'd go on this. I'd like to express bit on my project goals. The original idea was to do a syntax editor using Hunspell (for both FF and OO) and to do a grammer validator for Sinhala. But since Sandaruwan has already done a FF addon it may give me the chance to expand my original scope. However I think you get the general idea. I'd like to use the current CVS repository as you have suggested given that it doesn't introduce any issues for the university to evaluate the my project and I'll let you know on this. But again thanks for the kind offer. Cheers, Laknath PS: As an afterthought, this was my group's last year university project. It was to add more support for Java, including a Locale and a English to Sinhala phonetic IME. You can get it from this GIT repo - http://github.com/buddhika/java_sinhalization Harshula wrote: > Hi Sandaruwan & Buddhika, > > Could you both communicate with each other and decide how to proceed? > I'd like to see the software maintained in a globally accessible > repository so that the distros can package it. You are both welcome to > use the sinhala.sourceforge.net CVS repository. > > cya, > # > > On Sat, 2009-11-28 at 21:24 +0530, Gihan Dias wrote: > >> 28-11-2009 20:49 දින, Harshula ලිව්වා: >> >>> Hi Sandaruwan, >>> >>> I noticed you created a word list which can be used for spell checking >>> on GNU/Linux: >>> https://addons.mozilla.org/en-US/firefox/addon/13981/ >>> >>> Would you be interested in maintaining a hunspell-si project? >>> >>> We have a CVS repository on sourceforge that you could use: >>> http://sinhala.cvs.sourceforge.net/viewvc/sinhala/sinhala/spell/ >>> >>> >> One of our students, Laknath, is working on Sinhala support for >> hunspell. I will ask him to contact you. >> >> Gihan >> > > -- I blog at http://mytechgossips.com |
From: Harshula <har...@gm...> - 2010-07-04 16:30:11
|
Hi Sandaruwan, I've been informed that the word list from EnSiTip had a licensing problem. Have you found a replacement word list? cya, # On Sun, 2009-11-29 at 02:19 +1100, Harshula wrote: > Hi Sandaruwan, > > I noticed you created a word list which can be used for spell checking > on GNU/Linux: > https://addons.mozilla.org/en-US/firefox/addon/13981/ > > Would you be interested in maintaining a hunspell-si project? > > We have a CVS repository on sourceforge that you could use: > http://sinhala.cvs.sourceforge.net/viewvc/sinhala/sinhala/spell/ > > cya, > # |
From: Sandaruwan G. <san...@gu...> - 2010-07-04 16:32:06
|
What about the sinhala words list on UCSC language lab page? http://www.ucsc.cmb.ac.lk/ltrl/?page=downloads I switched the word list to that in spellchecker version 0.2. Regards, Sandaruwan On Sun, Jul 4, 2010 at 9:49 PM, Harshula <har...@gm...> wrote: > Hi Sandaruwan, > > I've been informed that the word list from EnSiTip had a licensing > problem. Have you found a replacement word list? > > cya, > # > > On Sun, 2009-11-29 at 02:19 +1100, Harshula wrote: > > Hi Sandaruwan, > > > > I noticed you created a word list which can be used for spell checking > > on GNU/Linux: > > https://addons.mozilla.org/en-US/firefox/addon/13981/ > > > > Would you be interested in maintaining a hunspell-si project? > > > > We have a CVS repository on sourceforge that you could use: > > http://sinhala.cvs.sourceforge.net/viewvc/sinhala/sinhala/spell/ > > > > cya, > > # > > > -- Best Regards, Sandaruwan Gunathilake |
From: Harshula <har...@gm...> - 2010-07-04 18:28:01
|
Hi Sandaruwan, On Sun, 2010-07-04 at 22:01 +0530, Sandaruwan Gunathilake wrote: > What about the sinhala words list on UCSC language lab page? > > http://www.ucsc.cmb.ac.lk/ltrl/?page=downloads > > I switched the word list to that in spellchecker version 0.2. The LTRL word list states it has 70142 distinct Sinhala words. si-LK.dic appears to have 26707 words. Did you take a subset of the words from the LTRL word list? cya, # |
From: Sandaruwan G. <san...@gu...> - 2010-07-04 19:30:27
|
Hi, On Sun, Jul 4, 2010 at 11:57 PM, Harshula <har...@gm...> wrote: > Hi Sandaruwan, > > On Sun, 2010-07-04 at 22:01 +0530, Sandaruwan Gunathilake wrote: > > What about the sinhala words list on UCSC language lab page? > > > > http://www.ucsc.cmb.ac.lk/ltrl/?page=downloads > > > > I switched the word list to that in spellchecker version 0.2. > > The LTRL word list states it has 70142 distinct Sinhala words. si-LK.dic > appears to have 26707 words. Did you take a subset of the words from the > LTRL word list? > No, everything is there. I just used compressed the words list with "affixcompress" utility and added few extra rules at the top of .aff file to support "ණ/න/ල/ළ", etc. -- Best Regards, Sandaruwan Gunathilake |
From: Harshula <har...@gm...> - 2010-07-05 15:04:10
|
Hi Sandaruwan, On Mon, 2010-07-05 at 00:59 +0530, Sandaruwan Gunathilake wrote: > No, everything is there. I just used compressed the words list with > "affixcompress" utility and added few extra rules at the top of .aff > file to support "ණ/න/ල/ළ", etc. Thanks for the explanation, I was unaware of this. cya, # |
From: Harshula <har...@gm...> - 2010-07-05 17:36:37
|
Hi Laknath, Good to know! Feel free update us on what you are working on. You may want to mention it on the sinhala-unicode google group too, if you haven't already. cya, # On Mon, 2010-07-05 at 21:08 +0530, Buddhika Laknath wrote: > Hi guys, > > I apologize for getting in the middle of a discussion. But I thought > this project which I have been doing as a university project would be > helpful here. > It's intended as a UI for Hunspell but targeting mainly Sinhala (but > usable for other languages as well). You can easily crawl, harvest words > consisting of characters within a given range, process between two > wordlists (diff, merge, add, remove words), munch and unmunch wordlists > back and forth to Hunspell format at the moment > > Git repo - http://github.com/buddhika/Sinhala-Dictionary-Tools > > Sadly I couldn't do any documentation so if you would like to use it to > improve current wordfiles and face any trouble just let me know. Anyway, > I'd add documentation soon enough. > > Apologize again if this seems like a shameless advertising but I just > want to have some input and actually use this to improve current Sinhala > dictionaries. > > Cheers, > Laknath |
From: Buddhika L. <bla...@gm...> - 2010-07-05 19:00:13
|
Thanks Harshula. I'll put a message on sinhala-unicode group as well when the software is stable enough (and with documentation) to be used by non-techies or people will feel like banging heads :) Also there are quite a bit of stuff that could be added to the software and ultimate goal is that we should have a Sinhala word list that could be used with high accuracy so people will use it by default, like how it is for English. But will need lots of love from the community for that to happen. But thanks again for the support. Cheers, Laknath Harshula wrote: > Hi Laknath, > > Good to know! Feel free update us on what you are working on. You may > want to mention it on the sinhala-unicode google group too, if you > haven't already. > > cya, > # > > On Mon, 2010-07-05 at 21:08 +0530, Buddhika Laknath wrote: > >> Hi guys, >> >> I apologize for getting in the middle of a discussion. But I thought >> this project which I have been doing as a university project would be >> helpful here. >> It's intended as a UI for Hunspell but targeting mainly Sinhala (but >> usable for other languages as well). You can easily crawl, harvest words >> consisting of characters within a given range, process between two >> wordlists (diff, merge, add, remove words), munch and unmunch wordlists >> back and forth to Hunspell format at the moment >> >> Git repo - http://github.com/buddhika/Sinhala-Dictionary-Tools >> >> Sadly I couldn't do any documentation so if you would like to use it to >> improve current wordfiles and face any trouble just let me know. Anyway, >> I'd add documentation soon enough. >> >> Apologize again if this seems like a shameless advertising but I just >> want to have some input and actually use this to improve current Sinhala >> dictionaries. >> >> Cheers, >> Laknath >> > > -- I blog at http://mytechgossips.com |
From: Gihan D. <gi...@cs...> - 2010-07-19 06:25:27
|
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> <html> <head> <meta content="text/html; charset=UTF-8" http-equiv="Content-Type"> </head> <body bgcolor="#ffffff" text="#000000"> <div class="moz-text-html" lang="x-unicode"> <style> <!-- /* Font Definitions */ @font-face {font-family:Latha; panose-1:2 11 6 4 2 2 2 2 2 4;} @font-face {font-family:Latha; panose-1:2 11 6 4 2 2 2 2 2 4;} @font-face {font-family:Cambria; panose-1:2 4 5 3 5 4 6 3 2 4;} @font-face {font-family:Calibri; panose-1:2 15 5 2 2 2 4 3 2 4;} @font-face {font-family:Tahoma; panose-1:2 11 6 4 3 5 4 4 2 4;} @font-face {font-family:Calibri-Bold; panose-1:0 0 0 0 0 0 0 0 0 0;} @font-face {font-family:OpenSymbol; panose-1:0 0 0 0 0 0 0 0 0 0;} /* Style Definitions */ p.MsoNormal, li.MsoNormal, div.MsoNormal {margin-top:0in; margin-right:0in; margin-bottom:10.0pt; margin-left:0in; line-height:115%; font-size:11.0pt; font-family:"Calibri","sans-serif";} a:link, span.MsoHyperlink {mso-style-priority:99; color:blue; text-decoration:underline;} a:visited, span.MsoHyperlinkFollowed {mso-style-priority:99; color:purple; text-decoration:underline;} p.MsoAcetate, li.MsoAcetate, div.MsoAcetate {mso-style-priority:99; mso-style-link:"Balloon Text Char"; margin:0in; margin-bottom:.0001pt; font-size:8.0pt; font-family:"Tahoma","sans-serif";} span.EmailStyle17 {mso-style-type:personal-compose; font-family:"Calibri","sans-serif"; color:windowtext;} span.BalloonTextChar {mso-style-name:"Balloon Text Char"; mso-style-priority:99; mso-style-link:"Balloon Text"; font-family:"Tahoma","sans-serif";} .MsoChpDefault {mso-style-type:export-only;} @page Section1 {size:8.5in 11.0in; margin:1.0in 1.0in 1.0in 1.0in;} div.Section1 {page:Section1;} --> </style><!--[if gte mso 9]><xml> <o:shapedefaults v:ext="edit" spidmax="1026" /> </xml><![endif]--><!--[if gte mso 9]><xml> <o:shapelayout v:ext="edit"> <o:idmap v:ext="edit" data="1" /> </o:shapelayout></xml><![endif]--> <div class="Section1"> <p class="MsoNormal" style="margin-bottom: 0.0001pt; text-align: center; line-height: normal;" align="center"><b><span style="font-size: 16pt; font-family: Calibri-Bold;">Conference on Localised Systems and Applications- 2010<o:p></o:p></span></b></p> <p class="MsoNormal" style="margin-bottom: 0.0001pt; text-align: center; line-height: normal;" align="center"><b><span style="font-size: 16pt; font-family: Calibri-Bold;">(CLSA 2010)<o:p></o:p></span></b></p> <p class="MsoNormal" style="margin-bottom: 0.0001pt; text-align: center; line-height: normal;" align="center"><b><span style="font-size: 16pt; font-family: Calibri-Bold;"><o:p> </o:p></span></b></p> <p class="MsoNormal" style="margin-bottom: 0.0001pt; text-align: center; line-height: normal;" align="center"><b><span style="font-size: 14pt; font-family: Calibri-Bold;">20-21 August<o:p></o:p></span></b></p> <p class="MsoNormal" style="margin-bottom: 0.0001pt; text-align: center; line-height: normal;" align="center"><b><i><span style="font-family: "Cambria","serif";">Centre of Excellence on Localised Applications (LAKapps Centre), <o:p></o:p></span></i></b></p> <p class="MsoNormal" style="margin-bottom: 0.0001pt; text-align: center; line-height: normal;" align="center"><b><i><span style="font-family: "Cambria","serif";">University of Moratuwa, Sri Lanka<o:p></o:p></span></i></b></p> <p class="MsoNormal" style="margin-bottom: 0.0001pt; text-align: center; line-height: normal;" align="center"><b><span style="font-size: 14pt; font-family: Calibri-Bold;"><o:p> </o:p></span></b></p> <p class="MsoNormal" style="text-align: center;" align="center"><b><span style="font-size: 18pt; line-height: 115%; font-family: Calibri-Bold;">Call for Presentations</span></b><o:p></o:p></p> <p class="MsoNormal" style="margin-bottom: 0.0001pt; text-align: justify; line-height: normal;"><span style="font-size: 12pt;">The Conference on Localised Systems and Applications will be held on 20-21 August, 2010. This will provide people working in local language computing an opportunity to present and discuss their work with others in the field.<o:p></o:p></span></p> <p class="MsoNormal" style="margin-bottom: 0.0001pt; line-height: normal;"><span style="font-size: 12pt;"><o:p> </o:p></span></p> <p class="MsoNormal" style="margin-bottom: 0.0001pt; line-height: normal;"><span style="font-size: 12pt;">You are invited to present your original work in the areas of localised systems and applications<o:p></o:p></span></p> <p class="MsoNormal" style="margin-bottom: 0.0001pt; line-height: normal;"><span style="font-size: 12pt;">including, but not limited to:<o:p></o:p></span></p> <p class="MsoNormal" style="margin-bottom: 0.0001pt; line-height: normal;"><span style="font-size: 12pt; font-family: OpenSymbol;">• </span><span style="font-size: 12pt;">local language support in hardware and operating systems<o:p></o:p></span></p> <p class="MsoNormal" style="margin-bottom: 0.0001pt; line-height: normal;"><span style="font-size: 12pt; font-family: OpenSymbol;">• </span><span style="font-size: 12pt;">localised software<o:p></o:p></span></p> <p class="MsoNormal" style="margin-bottom: 0.0001pt; line-height: normal;"><span style="font-size: 12pt; font-family: OpenSymbol;">• </span><span style="font-size: 12pt;">local language content and<o:p></o:p></span></p> <p class="MsoNormal" style="margin-bottom: 0.0001pt; line-height: normal;"><span style="font-size: 12pt; font-family: OpenSymbol;">• </span><span style="font-size: 12pt;">promotion and dissemination of local language computing.<o:p></o:p></span></p> <p class="MsoNormal" style="margin-bottom: 0.0001pt; line-height: normal;"><span style="font-size: 12pt;"><o:p> </o:p></span></p> <p class="MsoNormal" style="margin-bottom: 0.0001pt; line-height: normal;"><span style="font-size: 12pt;"><o:p> </o:p></span></p> <p class="MsoNormal" style="margin-bottom: 0.0001pt; line-height: normal;"><span style="font-size: 12pt;">If you would like to present your work at this symposium, please submit the title and a short (1-page) summary of your presentation by email to </span><b><span style="font-size: 12pt; font-family: Calibri-Bold;"><a class="moz-txt-link-abbreviated" href="mailto:pa...@la...">pa...@la...</a> </span></b><span style="font-size: 12pt;">by <b>31<sup>st</sup> July, 2010</b>.<o:p></o:p></span></p> <p class="MsoNormal" style="margin-bottom: 0.0001pt; line-height: normal;"><span style="font-size: 12pt;"><o:p> </o:p></span></p> <p class="MsoNormal"><o:p> </o:p></p> </div> </div> </body> </html> |
From: Buddhika L. <bla...@gm...> - 2010-07-05 19:28:12
|
Great, thanks for the patch mate. There sure to be many bugs and will need to make this stable before going public. > I played around with your app for a little bit. However, unicode > characters are not displayed in the java GUI for some reason (my box > is ubuntu 9.10 x64 with sun java6). Did you try selecting an Unicode font from Edit > Preferences > Font ? I'm using Iskola Potha if that helps. > Anyway, the crawler is quite useful for getting a word list. There > were some bugs with the crawler, I fixed some (the patch is attached) > - However, there is a bug in the Trie data structure which causes an > ArrayOutOfBounds exception, I simply disabled ommitWords. you might > want to check into that. > > The word list is not displayed properly due to that display bug I > talked about - but I managed to save the words successfully into a > file. I used the wikipedia home page (http://si.wikipedia.org/) and > there are 402 words :) Crawler needs some more testing and I'll look into this issue. Thanks for letting me know. > As I see, there are lot of people involved in sinhala unicode these > days. So, may be we can created a simple wiki like interface, get a > huge word list and use a crowd sourcing system for filtering out the > correctly spelled words. Yes, I was also thinking of such a system because frankly this is out of the scope of one person or even a group. We can only provide tools and may be an initial wordlist (combining freely available wordlists) and then it's up to all others to make it improve. Another major issue with current Sinhala wordlists is that they don't have all forms of words (ex: verbs). So I'm trying to device a way to make it possible to create Hunspell rules easily by using a UI (table) so people can give a base word and then generate all combinations of a word and add it the the list. Let's see how well it goes. Cheers, Laknath -- I blog at http://mytechgossips.com |
From: Ruvan W. <ar...@uc...> - 2010-07-06 15:41:05
|
you usually use finite-state machines (xfst/openfst) to do the last task you mention here. we've got a fairly comprehensive list of 'lemmas' for this in xfst which we need to convert to a re-usable resource... On 07/06/2010 12:58 AM, Buddhika Laknath wrote: > Great, thanks for the patch mate. There sure to be many bugs and will > need to make this stable before going public. > > >> I played around with your app for a little bit. However, unicode >> characters are not displayed in the java GUI for some reason (my box >> is ubuntu 9.10 x64 with sun java6). >> > Did you try selecting an Unicode font from Edit> Preferences> Font ? > I'm using Iskola Potha if that helps. > > >> Anyway, the crawler is quite useful for getting a word list. There >> were some bugs with the crawler, I fixed some (the patch is attached) >> - However, there is a bug in the Trie data structure which causes an >> ArrayOutOfBounds exception, I simply disabled ommitWords. you might >> want to check into that. >> >> The word list is not displayed properly due to that display bug I >> talked about - but I managed to save the words successfully into a >> file. I used the wikipedia home page (http://si.wikipedia.org/) and >> there are 402 words :) >> > Crawler needs some more testing and I'll look into this issue. Thanks > for letting me know. > > >> As I see, there are lot of people involved in sinhala unicode these >> days. So, may be we can created a simple wiki like interface, get a >> huge word list and use a crowd sourcing system for filtering out the >> correctly spelled words. >> > Yes, I was also thinking of such a system because frankly this is out of > the scope of one person or even a group. We can only provide tools and > may be an initial wordlist (combining freely available wordlists) and > then it's up to all others to make it improve. > > Another major issue with current Sinhala wordlists is that they don't > have all forms of words (ex: verbs). So I'm trying to device a way to > make it possible to create Hunspell rules easily by using a UI (table) > so people can give a base word and then generate all combinations of a > word and add it the the list. Let's see how well it goes. > > Cheers, > Laknath > > |
From: Buddhika L. <bla...@gm...> - 2010-07-05 15:38:47
|
Hi guys, I apologize for getting in the middle of a discussion. But I thought this project which I have been doing as a university project would be helpful here. It's intended as a UI for Hunspell but targeting mainly Sinhala (but usable for other languages as well). You can easily crawl, harvest words consisting of characters within a given range, process between two wordlists (diff, merge, add, remove words), munch and unmunch wordlists back and forth to Hunspell format at the moment Git repo - http://github.com/buddhika/Sinhala-Dictionary-Tools Sadly I couldn't do any documentation so if you would like to use it to improve current wordfiles and face any trouble just let me know. Anyway, I'd add documentation soon enough. Apologize again if this seems like a shameless advertising but I just want to have some input and actually use this to improve current Sinhala dictionaries. Cheers, Laknath Harshula wrote: > Hi Sandaruwan, > > On Mon, 2010-07-05 at 00:59 +0530, Sandaruwan Gunathilake wrote: > > >> No, everything is there. I just used compressed the words list with >> "affixcompress" utility and added few extra rules at the top of .aff >> file to support "ණ/න/ල/ළ", etc. >> > > Thanks for the explanation, I was unaware of this. > > cya, > # > -- I blog at http://mytechgossips.com |
From: Sandaruwan G. <san...@gu...> - 2010-07-05 18:24:24
Attachments:
0001-fixed-few-bugs-on-crawler.patch
|
Hey, I played around with your app for a little bit. However, unicode characters are not displayed in the java GUI for some reason (my box is ubuntu 9.10 x64 with sun java6). Anyway, the crawler is quite useful for getting a word list. There were some bugs with the crawler, I fixed some (the patch is attached) - However, there is a bug in the Trie data structure which causes an ArrayOutOfBounds exception, I simply disabled ommitWords. you might want to check into that. The word list is not displayed properly due to that display bug I talked about - but I managed to save the words successfully into a file. I used the wikipedia home page (http://si.wikipedia.org/) and there are 402 words :) As I see, there are lot of people involved in sinhala unicode these days. So, may be we can created a simple wiki like interface, get a huge word list and use a crowd sourcing system for filtering out the correctly spelled words. Best Regards, Sandaruwan On Mon, Jul 5, 2010 at 11:06 PM, Harshula <har...@gm...> wrote: > Hi Laknath, > > Good to know! Feel free update us on what you are working on. You may > want to mention it on the sinhala-unicode google group too, if you > haven't already. > > cya, > # > > On Mon, 2010-07-05 at 21:08 +0530, Buddhika Laknath wrote: > > Hi guys, > > > > I apologize for getting in the middle of a discussion. But I thought > > this project which I have been doing as a university project would be > > helpful here. > > It's intended as a UI for Hunspell but targeting mainly Sinhala (but > > usable for other languages as well). You can easily crawl, harvest words > > consisting of characters within a given range, process between two > > wordlists (diff, merge, add, remove words), munch and unmunch wordlists > > back and forth to Hunspell format at the moment > > > > Git repo - http://github.com/buddhika/Sinhala-Dictionary-Tools > > > > Sadly I couldn't do any documentation so if you would like to use it to > > improve current wordfiles and face any trouble just let me know. Anyway, > > I'd add documentation soon enough. > > > > Apologize again if this seems like a shameless advertising but I just > > want to have some input and actually use this to improve current Sinhala > > dictionaries. > > > > Cheers, > > Laknath > > > > ------------------------------------------------------------------------------ > This SF.net email is sponsored by Sprint > What will you do first with EVO, the first 4G phone? > Visit sprint.com/first -- http://p.sf.net/sfu/sprint-com-first > _______________________________________________ > sinhala-technical mailing list > sin...@li... > https://lists.sourceforge.net/lists/listinfo/sinhala-technical > -- Best Regards, Sandaruwan Gunathilake |
From: Harshula <har...@gm...> - 2012-08-23 05:15:59
|
Hi Sandaruwan, Parag (CC'd) is wondering where the upstream source tarball for the word list went? cya, # On Mon, 2010-07-05 at 00:59 +0530, Sandaruwan Gunathilake wrote: > Hi, > > On Sun, Jul 4, 2010 at 11:57 PM, Harshula <har...@gm...> wrote: > Hi Sandaruwan, > > On Sun, 2010-07-04 at 22:01 +0530, Sandaruwan Gunathilake > wrote: > > What about the sinhala words list on UCSC language lab page? > > > > http://www.ucsc.cmb.ac.lk/ltrl/?page=downloads > > > > I switched the word list to that in spellchecker version > 0.2. > > > The LTRL word list states it has 70142 distinct Sinhala words. > si-LK.dic > appears to have 26707 words. Did you take a subset of the > words from the > LTRL word list? > > No, everything is there. I just used compressed the words list with > "affixcompress" utility and added few extra rules at the top of .aff > file to support "ණ/න/ල/ළ", etc. > > -- > Best Regards, > Sandaruwan Gunathilake |
From: Sandaruwan G. <san...@gu...> - 2012-08-23 05:40:20
|
The original word list is still available in the UCSC page : http://www.ucsc.cmb.ac.lk/ltrl/?page=downloads I don't have the processed file at the moment - I'll dig up my backups and check whether I still have them. It's still in the firefox addon though : https://addons.mozilla.org/en-us/firefox/addon/sinhala-spellchecker/ On Thu, Aug 23, 2012 at 10:45 AM, Harshula <har...@gm...> wrote: > Hi Sandaruwan, > > Parag (CC'd) is wondering where the upstream source tarball for the word > list went? > > cya, > # > > On Mon, 2010-07-05 at 00:59 +0530, Sandaruwan Gunathilake wrote: > > Hi, > > > > On Sun, Jul 4, 2010 at 11:57 PM, Harshula <har...@gm...> wrote: > > Hi Sandaruwan, > > > > On Sun, 2010-07-04 at 22:01 +0530, Sandaruwan Gunathilake > > wrote: > > > What about the sinhala words list on UCSC language lab page? > > > > > > http://www.ucsc.cmb.ac.lk/ltrl/?page=downloads > > > > > > I switched the word list to that in spellchecker version > > 0.2. > > > > > > The LTRL word list states it has 70142 distinct Sinhala words. > > si-LK.dic > > appears to have 26707 words. Did you take a subset of the > > words from the > > LTRL word list? > > > > No, everything is there. I just used compressed the words list with > > "affixcompress" utility and added few extra rules at the top of .aff > > file to support "ණ/න/ල/ළ", etc. > > > > -- > > Best Regards, > > Sandaruwan Gunathilake > > > -- Best Regards, Sandaruwan Gunathilake |
From: Harshula <har...@gm...> - 2012-08-23 05:54:54
|
Can the processing steps be automated in a shell script or makefile? That way Parag can d/l the UCSC word list and build the final output file. On Thu, 2012-08-23 at 11:09 +0530, Sandaruwan Gunathilake wrote: > The original word list is still available in the UCSC > page : http://www.ucsc.cmb.ac.lk/ltrl/?page=downloads > > I don't have the processed file at the moment - I'll dig up my backups > and check whether I still have them. It's still in the firefox addon > though : https://addons.mozilla.org/en-us/firefox/addon/sinhala-spellchecker/ > > On Thu, Aug 23, 2012 at 10:45 AM, Harshula <har...@gm...> wrote: > Hi Sandaruwan, > > Parag (CC'd) is wondering where the upstream source tarball > for the word > list went? > > cya, > # > > On Mon, 2010-07-05 at 00:59 +0530, Sandaruwan Gunathilake > wrote: > > > Hi, > > > > On Sun, Jul 4, 2010 at 11:57 PM, Harshula > <har...@gm...> wrote: > > Hi Sandaruwan, > > > > On Sun, 2010-07-04 at 22:01 +0530, Sandaruwan > Gunathilake > > wrote: > > > What about the sinhala words list on UCSC language > lab page? > > > > > > http://www.ucsc.cmb.ac.lk/ltrl/?page=downloads > > > > > > I switched the word list to that in spellchecker > version > > 0.2. > > > > > > The LTRL word list states it has 70142 distinct > Sinhala words. > > si-LK.dic > > appears to have 26707 words. Did you take a subset > of the > > words from the > > LTRL word list? > > > > No, everything is there. I just used compressed the words > list with > > "affixcompress" utility and added few extra rules at the top > of .aff > > file to support "ණ/න/ල/ළ", etc. > > > > -- > > Best Regards, > > Sandaruwan Gunathilake > > > > > > > > -- > Best Regards, > Sandaruwan Gunathilake > |
From: Laknath <bla...@gm...> - 2012-08-23 05:55:01
|
May be this will work as well if it's the Hunspell formatted dic/aff file pair you are looking for - https://github.com/laknath/Sinhala-Dictionary/downloads Cheers, Laknath On 8/23/12 11:09 AM, Sandaruwan Gunathilake wrote: > The original word list is still available in the UCSC page : > http://www.ucsc.cmb.ac.lk/ltrl/?page=downloads > > I don't have the processed file at the moment - I'll dig up my backups > and check whether I still have them. It's still in the firefox addon > though : > https://addons.mozilla.org/en-us/firefox/addon/sinhala-spellchecker/ > > On Thu, Aug 23, 2012 at 10:45 AM, Harshula <har...@gm... > <mailto:har...@gm...>> wrote: > > Hi Sandaruwan, > > Parag (CC'd) is wondering where the upstream source tarball for > the word > list went? > > cya, > # > > On Mon, 2010-07-05 at 00:59 +0530, Sandaruwan Gunathilake wrote: > > Hi, > > > > On Sun, Jul 4, 2010 at 11:57 PM, Harshula <har...@gm... > <mailto:har...@gm...>> wrote: > > Hi Sandaruwan, > > > > On Sun, 2010-07-04 at 22:01 +0530, Sandaruwan Gunathilake > > wrote: > > > What about the sinhala words list on UCSC language lab > page? > > > > > > http://www.ucsc.cmb.ac.lk/ltrl/?page=downloads > > > > > > I switched the word list to that in spellchecker version > > 0.2. > > > > > > The LTRL word list states it has 70142 distinct Sinhala > words. > > si-LK.dic > > appears to have 26707 words. Did you take a subset of the > > words from the > > LTRL word list? > > > > No, everything is there. I just used compressed the words list with > > "affixcompress" utility and added few extra rules at the top of .aff > > file to support "?/?/?/?", etc. > > > > -- > > Best Regards, > > Sandaruwan Gunathilake > > > > > > -- > Best Regards, > Sandaruwan Gunathilake > > |
From: Harshula <har...@gm...> - 2012-08-23 14:35:29
|
On Thu, 2012-08-23 at 11:28 +0530, Laknath wrote: > May be this will work as well if it's the Hunspell formatted dic/aff > file pair you are looking for - > https://github.com/laknath/Sinhala-Dictionary/downloads What are the origins and licensing of the word list? cya, # |
From: Sandaruwan G. <san...@gu...> - 2012-08-23 06:02:29
|
Yes, It can be automated. You can use affixcompress. Then, few extra rules can be added to the aff file - such as "ණ <-> න". On Thu, Aug 23, 2012 at 11:24 AM, Harshula <har...@gm...> wrote: > Can the processing steps be automated in a shell script or makefile? > That way Parag can d/l the UCSC word list and build the final output > file. > > On Thu, 2012-08-23 at 11:09 +0530, Sandaruwan Gunathilake wrote: > > The original word list is still available in the UCSC > > page : http://www.ucsc.cmb.ac.lk/ltrl/?page=downloads > > > > I don't have the processed file at the moment - I'll dig up my backups > > and check whether I still have them. It's still in the firefox addon > > though : > https://addons.mozilla.org/en-us/firefox/addon/sinhala-spellchecker/ > > > > On Thu, Aug 23, 2012 at 10:45 AM, Harshula <har...@gm...> wrote: > > Hi Sandaruwan, > > > > Parag (CC'd) is wondering where the upstream source tarball > > for the word > > list went? > > > > cya, > > # > > > > On Mon, 2010-07-05 at 00:59 +0530, Sandaruwan Gunathilake > > wrote: > > > > > Hi, > > > > > > On Sun, Jul 4, 2010 at 11:57 PM, Harshula > > <har...@gm...> wrote: > > > Hi Sandaruwan, > > > > > > On Sun, 2010-07-04 at 22:01 +0530, Sandaruwan > > Gunathilake > > > wrote: > > > > What about the sinhala words list on UCSC language > > lab page? > > > > > > > > http://www.ucsc.cmb.ac.lk/ltrl/?page=downloads > > > > > > > > I switched the word list to that in spellchecker > > version > > > 0.2. > > > > > > > > > The LTRL word list states it has 70142 distinct > > Sinhala words. > > > si-LK.dic > > > appears to have 26707 words. Did you take a subset > > of the > > > words from the > > > LTRL word list? > > > > > > No, everything is there. I just used compressed the words > > list with > > > "affixcompress" utility and added few extra rules at the top > > of .aff > > > file to support "ණ/න/ල/ළ", etc. > > > > > > -- > > > Best Regards, > > > Sandaruwan Gunathilake > > > > > > > > > > > > > > > > -- > > Best Regards, > > Sandaruwan Gunathilake > > > > > -- Best Regards, Sandaruwan Gunathilake |
From: Parag N. <pn...@re...> - 2012-08-23 06:04:02
|
Hi Sandaruwan, I do have this file at http://paragn.fedorapeople.org/si-LK.tar.gz but I need some upstream source download URL. If you can host it somewhere then that will be helpful. Thanks, Parag. On 23/08/12 11:24, Harshula wrote: > Can the processing steps be automated in a shell script or makefile? > That way Parag can d/l the UCSC word list and build the final output > file. > > On Thu, 2012-08-23 at 11:09 +0530, Sandaruwan Gunathilake wrote: >> The original word list is still available in the UCSC >> page : http://www.ucsc.cmb.ac.lk/ltrl/?page=downloads >> >> I don't have the processed file at the moment - I'll dig up my backups >> and check whether I still have them. It's still in the firefox addon >> though : https://addons.mozilla.org/en-us/firefox/addon/sinhala-spellchecker/ >> >> On Thu, Aug 23, 2012 at 10:45 AM, Harshula <har...@gm...> wrote: >> Hi Sandaruwan, >> >> Parag (CC'd) is wondering where the upstream source tarball >> for the word >> list went? >> >> cya, >> # >> >> On Mon, 2010-07-05 at 00:59 +0530, Sandaruwan Gunathilake >> wrote: >> >> > Hi, >> > >> > On Sun, Jul 4, 2010 at 11:57 PM, Harshula >> <har...@gm...> wrote: >> > Hi Sandaruwan, >> > >> > On Sun, 2010-07-04 at 22:01 +0530, Sandaruwan >> Gunathilake >> > wrote: >> > > What about the sinhala words list on UCSC language >> lab page? >> > > >> > > http://www.ucsc.cmb.ac.lk/ltrl/?page=downloads >> > > >> > > I switched the word list to that in spellchecker >> version >> > 0.2. >> > >> > >> > The LTRL word list states it has 70142 distinct >> Sinhala words. >> > si-LK.dic >> > appears to have 26707 words. Did you take a subset >> of the >> > words from the >> > LTRL word list? >> > >> > No, everything is there. I just used compressed the words >> list with >> > "affixcompress" utility and added few extra rules at the top >> of .aff >> > file to support "ණ/න/ල/ළ", etc. >> > >> > -- >> > Best Regards, >> > Sandaruwan Gunathilake >> >> >> >> >> >> >> >> -- >> Best Regards, >> Sandaruwan Gunathilake >> > |
From: Sandaruwan G. <san...@gu...> - 2012-08-23 06:12:39
|
Here you go : http://www.sandaru1.com/si-LK.tar.gz I also uploaded it to github if anyone is interested in maintaining it : https://github.com/sandaru1/si-LK On Thu, Aug 23, 2012 at 11:33 AM, Parag Nemade <pn...@re...> wrote: > Hi Sandaruwan, > I do have this file at http://paragn.fedorapeople.**org/si-LK.tar.gz<http://paragn.fedorapeople.org/si-LK.tar.gz>but I need some upstream source download URL. If you can host it somewhere > then that will be helpful. > Thanks, > Parag. > > > On 23/08/12 11:24, Harshula wrote: > >> Can the processing steps be automated in a shell script or makefile? >> That way Parag can d/l the UCSC word list and build the final output >> file. >> >> On Thu, 2012-08-23 at 11:09 +0530, Sandaruwan Gunathilake wrote: >> >>> The original word list is still available in the UCSC >>> page : http://www.ucsc.cmb.ac.lk/**ltrl/?page=downloads<http://www.ucsc.cmb.ac.lk/ltrl/?page=downloads> >>> >>> I don't have the processed file at the moment - I'll dig up my backups >>> and check whether I still have them. It's still in the firefox addon >>> though : https://addons.mozilla.org/en-**us/firefox/addon/sinhala-** >>> spellchecker/<https://addons.mozilla.org/en-us/firefox/addon/sinhala-spellchecker/> >>> >>> On Thu, Aug 23, 2012 at 10:45 AM, Harshula <har...@gm...> wrote: >>> Hi Sandaruwan, >>> Parag (CC'd) is wondering where the upstream source >>> tarball >>> for the word >>> list went? >>> cya, >>> # >>> On Mon, 2010-07-05 at 00:59 +0530, Sandaruwan >>> Gunathilake >>> wrote: >>> > Hi, >>> > >>> > On Sun, Jul 4, 2010 at 11:57 PM, Harshula >>> <har...@gm...> wrote: >>> > Hi Sandaruwan, >>> > >>> > On Sun, 2010-07-04 at 22:01 +0530, Sandaruwan >>> Gunathilake >>> > wrote: >>> > > What about the sinhala words list on UCSC language >>> lab page? >>> > > >>> > > http://www.ucsc.cmb.ac.lk/**ltrl/?page=downloads<http://www.ucsc.cmb.ac.lk/ltrl/?page=downloads> >>> > > >>> > > I switched the word list to that in spellchecker >>> version >>> > 0.2. >>> > >>> > >>> > The LTRL word list states it has 70142 distinct >>> Sinhala words. >>> > si-LK.dic >>> > appears to have 26707 words. Did you take a subset >>> of the >>> > words from the >>> > LTRL word list? >>> > >>> > No, everything is there. I just used compressed the words >>> list with >>> > "affixcompress" utility and added few extra rules at the top >>> of .aff >>> > file to support "ණ/න/ල/ළ", etc. >>> > >>> > -- >>> > Best Regards, >>> > Sandaruwan Gunathilake >>> >>> >>> >>> >>> -- >>> Best Regards, >>> Sandaruwan Gunathilake >>> >>> >> > -- Best Regards, Sandaruwan Gunathilake |
From: Parag N. <pn...@re...> - 2012-08-23 06:16:45
|
Hi, Thanks for the first link. I will use it in spec file and build a new hunspell-si package. Regards, Parag. On 23/08/12 11:42, Sandaruwan Gunathilake wrote: > Here you go : http://www.sandaru1.com/si-LK.tar.gz > > I also uploaded it to github if anyone is interested in maintaining it > : https://github.com/sandaru1/si-LK > > On Thu, Aug 23, 2012 at 11:33 AM, Parag Nemade <pn...@re... > <mailto:pn...@re...>> wrote: > > Hi Sandaruwan, > I do have this file at > http://paragn.fedorapeople.org/si-LK.tar.gz but I need some > upstream source download URL. If you can host it somewhere then > that will be helpful. > Thanks, > Parag. > > > On 23/08/12 11:24, Harshula wrote: > > Can the processing steps be automated in a shell script or > makefile? > That way Parag can d/l the UCSC word list and build the final > output > file. > > On Thu, 2012-08-23 at 11:09 +0530, Sandaruwan Gunathilake wrote: > > The original word list is still available in the UCSC > page : http://www.ucsc.cmb.ac.lk/ltrl/?page=downloads > > I don't have the processed file at the moment - I'll dig > up my backups > and check whether I still have them. It's still in the > firefox addon > though : > https://addons.mozilla.org/en-us/firefox/addon/sinhala-spellchecker/ > > On Thu, Aug 23, 2012 at 10:45 AM, Harshula > <har...@gm... <mailto:har...@gm...>> wrote: > Hi Sandaruwan, > Parag (CC'd) is wondering where the > upstream source tarball > for the word > list went? > cya, > # > On Mon, 2010-07-05 at 00:59 +0530, > Sandaruwan Gunathilake > wrote: > > Hi, > > > > On Sun, Jul 4, 2010 at 11:57 PM, Harshula > <har...@gm... <mailto:har...@gm...>> > wrote: > > Hi Sandaruwan, > > > > On Sun, 2010-07-04 at 22:01 +0530, > Sandaruwan > Gunathilake > > wrote: > > > What about the sinhala words list on > UCSC language > lab page? > > > > > > > http://www.ucsc.cmb.ac.lk/ltrl/?page=downloads > > > > > > I switched the word list to that in > spellchecker > version > > 0.2. > > > > > > The LTRL word list states it has 70142 > distinct > Sinhala words. > > si-LK.dic > > appears to have 26707 words. Did you > take a subset > of the > > words from the > > LTRL word list? > > > > No, everything is there. I just used compressed > the words > list with > > "affixcompress" utility and added few extra > rules at the top > of .aff > > file to support "ණ/න/ල/ළ", etc. > > > > -- > > Best Regards, > > Sandaruwan Gunathilake > > > > > -- > Best Regards, > Sandaruwan Gunathilake > > > > > > > -- > Best Regards, > Sandaruwan Gunathilake |