From: <MS...@FS...> - 2005-02-08 10:25:38
|
"There is no limit to the number of neighbors. You'll simply have to change= =20 MAX_NEIGHBOR and PRINT_NEIGHBOR in associate.h to a bigger number and use= =20 associate -n <big_number>." Got it, thanks ! > Another, unrelated question that should be easily answered then. If I=20 > recall correctly, some words in infomap are not taken into account,=20 > for example, the word 'the' does not hold semantic content and is=20 > therefore not taken into consideration in computing correspondences.=20 > Am I right there? Since I am trying to get this program to work using=20 > Dutch documents, I was wondering which info-map file holds the words=20 > that are ignored in the computation, so it would be more easy to adapt=20 > it to the current languange. "It's "stop.list" in the admin directory which contains the words to be=20 disregarded. If you want to use a different set of stopwords, you can=20 simply set STOPLIST_FILE in "default-params.in" to point to a different=20 stoplist, the format of which should be such that each line contains=20 exactly one stopword." I see. Taking a look at the stop.list right now, I must say it looks rather= extensive. Could you or anyone else tell me what exactly the reasons are s= ome words are in this stoplist and others not? Several words, such as 'arti= cles', 'americans', 'training', etc, would seem to hold semantic content bu= t appear here nevertheless. Is there any literature why these words are in = the list, or could the underlying reasoning be explained? Thanks, Mich (strolling the internet right now looking for Dutch books - the Guthen= berg project, although having an archive of quite a few of these, seems to = be quite outdated in terms of spelling changes that arose over the last two= centuries). ********************************************************************** This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you have received this email in error please notify the system manager. ********************************************************************** |
From: <lei...@gm...> - 2005-02-08 12:32:32
|
Hi Mich, if you create a stoplist for a specific domain, some of the normally content bearing words, but too common in this domain, should be included in the stop list. I supervised a master thesis about IR once, and they worked with the Intranet at Volvo. Their stoplist included words like car, and Volvo to get better search result. As long as the result is not getting worse, it might be a good idea to include high frequency words in the stop list. Regards, Leif On Tue, 8 Feb 2005 11:25:15 +0100, Spap=E9, Michiel <MS...@fs...> wrote: > "There is no limit to the number of neighbors. You'll simply have to chan= ge > MAX_NEIGHBOR and PRINT_NEIGHBOR in associate.h to a bigger number and use > associate -n <big_number>." >=20 > Got it, thanks ! >=20 > > Another, unrelated question that should be easily answered then. If I > > recall correctly, some words in infomap are not taken into account, > > for example, the word 'the' does not hold semantic content and is > > therefore not taken into consideration in computing correspondences. > > Am I right there? Since I am trying to get this program to work using > > Dutch documents, I was wondering which info-map file holds the words > > that are ignored in the computation, so it would be more easy to adapt > > it to the current languange. >=20 > "It's "stop.list" in the admin directory which contains the words to be > disregarded. If you want to use a different set of stopwords, you can > simply set STOPLIST_FILE in "default-params.in" to point to a different > stoplist, the format of which should be such that each line contains > exactly one stopword." >=20 > I see. Taking a look at the stop.list right now, I must say it looks rath= er extensive. Could you or anyone else tell me what exactly the reasons are= some words are in this stoplist and others not? Several words, such as 'ar= ticles', 'americans', 'training', etc, would seem to hold semantic content = but appear here nevertheless. Is there any literature why these words are i= n the list, or could the underlying reasoning be explained? >=20 > Thanks, >=20 > Mich (strolling the internet right now looking for Dutch books - the Guth= enberg project, although having an archive of quite a few of these, seems t= o be quite outdated in terms of spelling changes that arose over the last t= wo centuries). >=20 >=20 > ********************************************************************** > This email and any files transmitted with it are confidential and > intended solely for the use of the individual or entity to whom they > are addressed. If you have received this email in error please notify > the system manager. > ********************************************************************** >=20 > ------------------------------------------------------- > SF email is sponsored by - The IT Product Guide > Read honest & candid reviews on hundreds of IT Products from real users. > Discover which products truly live up to the hype. Start reading now. > http://ads.osdn.com/?ad_ide95&alloc_id=14396&opclick > _______________________________________________ > infomap-nlp-users mailing list > inf...@li... > https://lists.sourceforge.net/lists/listinfo/infomap-nlp-users >=20 --=20 Leif Gr=F6nqvist, GSLT, le...@li..., www.ling.gu.se/~leifg, 031-821515= (home) School of Mathematics and Systems Engineering, V=E4xj=F6 University 0707164= 380(mob) Department of Linguistics, G=F6teborg University, +46 31 773 1177, 773 485= 3(fax) |