From: Beate D. <do...@im...> - 2006-02-21 08:19:14
|
Hi Neal, There is a file called valid_chars.en (it's in the admin directory) which contains the characters to be kept. You can adjust this file to your needs by adding the ~, $, |, 1, etc. This will leave your words "unharmed". In addition, you might want to replace the English stoplist with an Arabic one (to ignore determiners, pronouns, etc.). Best wishes, Beate On Mon, 20 Feb 2006, Neal Snider wrote: > Does anyone know how do prevent the text processing that infomap does on its > corpora? I'm using (trying anyway) infomap to work on a project to try to > induce Arabic verb clusters. My data are already lemmatized and they use the > Buckwalter transliteration system, so they look rather funny: > > HalAwaY_1 jan~ap_1 muriyd_1 > taHoDiyr_1 |l_2 > HalAwap_2 jaraH-a_1 muro$id_1 > taHoDiyriy~_1 |laY_1 > HalAyib_2 jaraY-i_1 muroDiy_1 > taHoSiyl_1 |lam_1 > > but the dic file after infomap processing shows that it takes out a lot of the > important Arabic characters: > > 16474 3508 0 adaf_ > 16403 3434 0 amokan_ > 15606 3404 0 ar_ > 14308 3263 0 ieotabar_ > 13134 3180 0 ra > 12666 2933 0 daea > 12290 3055 0 ay > 11965 2849 0 wasal > 11558 2824 0 hasal > 11173 2772 0 qad~am_ > 11148 2997 0 nolemma > > How can I keep it from doing this? > > Thanks! > > ------ > Neal Snider > Ph.D. Student > Department of Linguistics > Stanford University > Margaret Jacks Hall, Bldg 460 - Room 118 > Stanford CA 94305-2150 > (650) 723-4284; Fax: (650) 723-5666 > http://www.stanford.edu/~snider > > > > > ------------------------------------------------------- > This SF.net email is sponsored by: Splunk Inc. Do you grep through log files > for problems? Stop! Download the new AJAX search engine that makes > searching your log files as easy as surfing the web. DOWNLOAD SPLUNK! > http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642 > _______________________________________________ > infomap-nlp-users mailing list > inf...@li... > https://lists.sourceforge.net/lists/listinfo/infomap-nlp-users > |