From: Dominic W. <dwi...@cs...> - 2004-08-02 14:54:00
|
Dear Colin and Victor, I'm afraid I can't be of much help right now because unfortunately (or rather, fortunately for me I guess) I've just started a new and very exciting job in Pittsbugh, Pennsylvania for a group of mad scientists called MAYA Design. I'm doing lots of fuzzy geometry for dealing with imprecise spatial data, and they're really interested in forming a good general representation for temporal concepts and events, so I might end up learning a lot more about the semantics of verbs than I ever managed to on the Infomap project! I don know that Shuji Yamaguchi implemented a system for turning Japanese characters into ASCII-like text so that the infomap software could build vectors from Japanese corpora. I don't know if Shuji would be able to help you with a development version of whatever he did? Sorry I can't be more help, but good luck. -Dominic On Fri, 30 Jul 2004, Viktor Tron wrote: > Dear Colin, > > > Infomap had a lot of features hardwired that influenced tokenization. > Since it was made for English, the upper half of 8-bit ascii table was > not considered as word characters. > By making this user-driven, now any 8-bit character-coding can be used. > This change has been incorporated in the program and is found > in the cvs source-tree. > > As far as I can see, what you mean is unicode or where characters are more than > one byte. > Although I know nothing about this, I reckon that C IO cannot handle these > and therefore reads characters byte-wise. > Since tokenization into words is character-based (which is now one byte I reckon), > segmentation rules (see documentation) can only be given correctly for a multibyte > language if all possible bytes that occur in word characters and non-word characters are disjoint. And there might be other problems as well I guess. > > Is that correct or complete rubbish? > > This problem as well as the rather crude nature of hard-wired character-set-based > tokenization is why I thought imporivement is in order. > My idea was to have a mode where segmentation is totally user-defined > with word tags (similar to doc/text already built in), that is > in the first tokenization stage any entity between <w> and </w> is > considered a word. > > Anyone having time to implement it? > > Best > Viktor > > On Fri, 30 Jul 2004 15:35:46 +0100, Colin J Bannard <C.J...@ed...> wrote: > > > Hi Viktor, > > > > Earlier in the year you mentioned to me some changes that you had made to the > > InfoMap code to enable it to handle multibyte languages. I have been asked > > about using InfoMap by a researcher in Japan who says that the version > > currently available from Sourceforge doesn't handle Japanese. Do you know what > > happened to the changes you made? If they haven't been included in the official > > release yet, would you be willing to provide my friend in Kobe with the improved > > version? > > > > hope you are well. > > > > see you soon, > > Colin > > > > > ------------------------------------------------------- > This SF.Net email is sponsored by OSTG. Have you noticed the changes on > Linux.com, ITManagersJournal and NewsForge in the past few weeks? Now, > one more big change to announce. We are now OSTG- Open Source Technology > Group. Come see the changes on the new OSTG site. www.ostg.com > _______________________________________________ > infomap-nlp-devel mailing list > inf...@li... > https://lists.sourceforge.net/lists/listinfo/infomap-nlp-devel > |