Re: [sinhala-technical] hunspell-si

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi Sandaruwan,

1) I did the following:

iconv -f UTF-16 -t UTF-8 $src | awk '{print $1}' | LANG=si_LK.UTF-8 sort
-u -k 1 > $dst

affixcompress $dst 2> /dev/null

2) I noted you added the following:
--------------------------------------
SET UTF-8

TRY ්ාෘුැූෑිීෙ

REP 25
REP න ණ
REP ණ න
REP ල ළ
REP ළ ල
REP ස ෂ
REP ෂ ස
REP ස ශ
REP ශ ස
REP ච ඡ
REP ඡ ච
REP බ භ
REP භ බ
REP ද ධ
REP ධ ද
REP ර් ්‍ර
REP ට ඨ
REP ඨ ට
REP ක ඛ
REP ඛ ක
REP ඩ ඪ
REP ඪ ඩ
REP ඉ ඊ
REP ඊ ඉ
REP ප ඵ
REP ඵ ප
--------------------------------------
Does that complete the steps from UCSC word list file to the dictionary?

3) Were you able to determine the license under which the UCSC word list
is distributed? The word list license would impact your right to
distribute a derived work.

cya,
#

On Thu, 2012-08-23 at 14:07 +0530, Sandaruwan Gunathilake wrote:
> I just took a look into the file. I can remember it used to be a
> SQLite (or it might have been some other word source). The new file
> seems to be UTF-16 Little Endian. You can simple convert it using
> "iconv".
> 
> 
> iconv -f 'UTF-16LE' DistinctWords.txt
> 
> On Thu, Aug 23, 2012 at 12:35 PM, Harshula <har...@gm...> wrote:
>         The UCSC word list isn't in sqlite format, right?
>         
>         DistinctWords.txt: DBase 3 data file with memo(s) (3211273
>         records)