[infomap-nlp-devel] multibyte coding?

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Dear Colin,

Infomap had a lot of features hardwired that influenced tokenization.
Since it was made for English, the upper half of 8-bit ascii table was
not considered as word characters.
By making this user-driven, now any 8-bit character-coding can be used.
This change has been incorporated in the program and is found
in the cvs source-tree.

As far as I can see, what you mean is unicode or where characters are more than
one byte.
Although I know nothing about this, I reckon that C IO cannot handle these
and therefore reads characters byte-wise.
Since tokenization into words is character-based (which is now one byte I reckon),
segmentation rules (see documentation) can only be given correctly for a multibyte
language if all possible bytes that occur in word characters and non-word characters are disjoint. And there might be other problems as well I guess.

Is that correct or complete rubbish?

This problem as well as the rather crude nature of hard-wired character-set-based
tokenization is why I thought imporivement is in order.
My idea was to have a mode where segmentation is totally user-defined
with word tags (similar to doc/text already built in), that is
in the first tokenization stage any entity between <w> and </w> is
considered a word.

Anyone having time to implement it?

Best
Viktor

On Fri, 30 Jul 2004 15:35:46 +0100, Colin J Bannard <C.J...@ed...> wrote:

> Hi Viktor,
>
> Earlier in the year you mentioned to me some changes that you had made to the
> InfoMap code to enable it to handle multibyte languages. I have been asked
> about using InfoMap by a researcher in Japan who says that the version
> currently available from Sourceforge doesn't handle Japanese. Do you know what
> happened to the changes you made? If they haven't been included in the official
> release yet, would you be willing to provide my friend in Kobe with the improved
> version?
>
> hope you are well.
>
> see you soon,
> Colin