From: Neal S. <sn...@st...> - 2006-02-21 06:23:26
|
Does anyone know how do prevent the text processing that infomap does on its corpora? I'm using (trying anyway) infomap to work on a project to try to induce Arabic verb clusters. My data are already lemmatized and they use the Buckwalter transliteration system, so they look rather funny: HalAwaY_1 jan~ap_1 muriyd_1 taHoDiyr_1 |l_2 HalAwap_2 jaraH-a_1 muro $id_1 taHoDiyriy~_1 |laY_1 HalAyib_2 jaraY-i_1 muroDiy_1 taHoSiyl_1 |lam_1 but the dic file after infomap processing shows that it takes out a lot of the important Arabic characters: 16474 3508 0 adaf_ 16403 3434 0 amokan_ 15606 3404 0 ar_ 14308 3263 0 ieotabar_ 13134 3180 0 ra 12666 2933 0 daea 12290 3055 0 ay 11965 2849 0 wasal 11558 2824 0 hasal 11173 2772 0 qad~am_ 11148 2997 0 nolemma How can I keep it from doing this? Thanks! ------ Neal Snider Ph.D. Student Department of Linguistics Stanford University Margaret Jacks Hall, Bldg 460 - Room 118 Stanford CA 94305-2150 (650) 723-4284; Fax: (650) 723-5666 http://www.stanford.edu/~snider |