From: Viktor T. <v....@ed...> - 2004-07-30 18:17:47
|
Dear Colin, Infomap had a lot of features hardwired that influenced tokenization. Since it was made for English, the upper half of 8-bit ascii table was not considered as word characters. By making this user-driven, now any 8-bit character-coding can be used. This change has been incorporated in the program and is found in the cvs source-tree. As far as I can see, what you mean is unicode or where characters are more than one byte. Although I know nothing about this, I reckon that C IO cannot handle these and therefore reads characters byte-wise. Since tokenization into words is character-based (which is now one byte I reckon), segmentation rules (see documentation) can only be given correctly for a multibyte language if all possible bytes that occur in word characters and non-word characters are disjoint. And there might be other problems as well I guess. Is that correct or complete rubbish? This problem as well as the rather crude nature of hard-wired character-set-based tokenization is why I thought imporivement is in order. My idea was to have a mode where segmentation is totally user-defined with word tags (similar to doc/text already built in), that is in the first tokenization stage any entity between <w> and </w> is considered a word. Anyone having time to implement it? Best Viktor On Fri, 30 Jul 2004 15:35:46 +0100, Colin J Bannard <C.J...@ed...> wrote: > Hi Viktor, > > Earlier in the year you mentioned to me some changes that you had made to the > InfoMap code to enable it to handle multibyte languages. I have been asked > about using InfoMap by a researcher in Japan who says that the version > currently available from Sourceforge doesn't handle Japanese. Do you know what > happened to the changes you made? If they haven't been included in the official > release yet, would you be willing to provide my friend in Kobe with the improved > version? > > hope you are well. > > see you soon, > Colin |