Re: [infomap-nlp-devel] multibyte coding?

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Dear Colin and Victor,

I'm afraid I can't be of much help right now because unfortunately (or
rather, fortunately for me I guess) I've just started a new and very
exciting job in Pittsbugh, Pennsylvania for a group of mad scientists
called MAYA Design. I'm doing lots of fuzzy geometry for dealing with
imprecise spatial data, and they're really interested in forming a good
general representation for temporal concepts and events, so I might end up
learning a lot more about the semantics of verbs than I ever managed to on
the Infomap project!

I don know that Shuji Yamaguchi implemented a system for turning
Japanese characters into ASCII-like text so that the infomap software
could build vectors from Japanese corpora. I don't know if Shuji would be
able to help you with a development version of whatever he did?

Sorry I can't be more help, but good luck.
-Dominic

On Fri, 30 Jul 2004, Viktor Tron wrote:

> Dear Colin,
>
>
> Infomap had a lot of features hardwired that influenced tokenization.
> Since it was made for English, the upper half of 8-bit ascii table was
> not considered as word characters.
> By making this user-driven, now any 8-bit character-coding can be used.
> This change has been incorporated in the program and is found
> in the cvs source-tree.
>
> As far as I can see, what you mean is unicode or where characters are more than
> one byte.
> Although I know nothing about this, I reckon that C IO cannot handle these
> and therefore reads characters byte-wise.
> Since tokenization into words is character-based (which is now one byte I reckon),
> segmentation rules (see documentation) can only be given correctly for a multibyte
> language if all possible bytes that occur in word characters and non-word characters are disjoint. And there might be other problems as well I guess.
>
> Is that correct or complete rubbish?
>
> This problem as well as the rather crude nature of hard-wired character-set-based
> tokenization is why I thought imporivement is in order.
> My idea was to have a mode where segmentation is totally user-defined
> with word tags (similar to doc/text already built in), that is
> in the first tokenization stage any entity between <w> and </w> is
> considered a word.
>
> Anyone having time to implement it?
>
> Best
> Viktor
>
> On Fri, 30 Jul 2004 15:35:46 +0100, Colin J Bannard <C.J...@ed...> wrote:
>
> > Hi Viktor,
> >
> > Earlier in the year you mentioned to me some changes that you had made to the
> > InfoMap code to enable it to handle multibyte languages. I have been asked
> > about using InfoMap by a researcher in Japan who says that the version
> > currently available from Sourceforge doesn't handle Japanese. Do you know what
> > happened to the changes you made? If they haven't been included in the official
> > release yet, would you be willing to provide my friend in Kobe with the improved
> > version?
> >
> > hope you are well.
> >
> > see you soon,
> > Colin
>
>
>
>
> -------------------------------------------------------
> This SF.Net email is sponsored by OSTG. Have you noticed the changes on
> Linux.com, ITManagersJournal and NewsForge in the past few weeks? Now,
> one more big change to announce. We are now OSTG- Open Source Technology
> Group. Come see the changes on the new OSTG site. www.ostg.com
> _______________________________________________
> infomap-nlp-devel mailing list
> inf...@li...
> https://lists.sourceforge.net/lists/listinfo/infomap-nlp-devel
>