Re: [Senseclusters-developers] senseclusters-developers Digest, Vol 7, Issue 2
Status: Beta
Brought to you by:
tpederse
From: Ted P. <dul...@gm...> - 2008-04-13 20:48:46
|
On Sat, Apr 12, 2008 at 5:19 AM, Teshome Kassie <tk...@ya...> wrote: > Hell all; > > Does SenseClusters support Utf-8 ? > > Teshome > Great question, and I think the answer is no. Unfortunately not. The main issue I think is not so much SenseClusters as it is Text::NSP, which is what we use for a significant portion of our feature extraction needs. There has been considerable discussion regarding how to make Text::NSP better at handling different character sets. If you are interested in the history of that discussion, you can see the most recent version of it here: http://www.mail-archive.com/ng...@ya.../msg00156.html The short version is that I've decided that the right thing to do is to use the Perl module Encode in Text::NSP to provide full unicode support. The only draw back is that this requires a bit of work, and right now it hasn't risen high enough in the queue. But, it's getting there, especially since SenseClusters has such a heavy dependence on Text::NSP. http://search.cpan.org/dist/Encode/ So, that's the long term solution I have planned. Unfortunately that doesn't help much in the shorter term. Sorry I don't have a better answer. Other suggestions are most welcome. Cordially, Ted -- Ted Pedersen http://www.d.umn.edu/~tpederse |