From: <MS...@FS...> - 2005-02-07 11:07:27
|
Sorry to give you hope, I have actually no clue. My question, however, was = related to yours in that I would like to know if it would be possible to ex= tend infomap's output to multiple documents/words. Put more clearly: - Using the -n tag, one can extend the number of neighbours of associate's = output to 200. Is this a hard limit or could this be altered to the theoret= ical maximum that is near the vocabulary size? Another, unrelated question that should be easily answered then. If I recal= l correctly, some words in infomap are not taken into account, for example,= the word 'the' does not hold semantic content and is therefore not taken i= nto consideration in computing correspondences. Am I right there? Since I a= m trying to get this program to work using Dutch documents, I was wondering= which info-map file holds the words that are ignored in the computation, s= o it would be more easy to adapt it to the current languange. Lastly, I was wondering if people here have done any experimentation with t= he length of documents. The typical textfiles used here seem to have been t= aken from the Guthenberg project ebooks, so they are usually very large (30= 00+ words or something). We've been working here with small documents (+-20= 0 words) and although it has its uses, I'm getting more and more the impres= sion that despite our growing collection of these small files (currently at= 220 files, which should grow to 1000 files), they provide too sparse infor= mation for LSI. Thus, I was wondering if we could share thoughts on: 1. minimal size of entire corpus (in words) 2. effects of average document size of files in corpus I was thinking to maybe provide the corpus with a corpus 'bulk' of some Dut= ch ebooks so the model could have a headstart in providing more accurate an= alysis.=20 Thanks for listening, Mich -----Original Message----- From: inf...@li... [mailto:infomap-nlp-use= rs-...@li...] On Behalf Of Leif Gr=F6nqvist Sent: Friday, February 04, 2005 4:49 PM To: inf...@li... Subject: [infomap-nlp-users] spare matrix format Hi! Infomap is now using ordinary C-matrices with 4 bytes per cell. Has anyone = tried to rewrite the matrix handling code using a spare matrix format like = for example Harwell-Boeing? This would make it possible to run on a much la= rger vocabulary and also, not limiting the matrix size in the second dimens= ion. I would like to run it on 500 million running words or so, which leads to 3= .5 million word types... What do you developers think? How big would that task be? Regards, Leif -- Leif Gr=F6nqvist, GSLT, le...@li..., www.ling.gu.se/~leifg, 031-821515= (home) School of Mathematics and Systems Engineering, V=E4xj=F6 University = 0707164380(mob) Department of Linguistics, G=F6teborg University, +46 31 7= 73 1177, 773 4853(fax) ------------------------------------------------------- This SF.Net email is sponsored by: IntelliVIEW -- Interactive Reporting Too= l for open source databases. Create drag-&-drop reports. Save time by over = 75%! Publish reports on the web. Export to DOC, XLS, RTF, etc. Download a F= REE copy at http://www.intelliview.com/go/osdn_nl _______________________________________________ infomap-nlp-users mailing list inf...@li... https://lists.sourceforge.net/lists/listinfo/infomap-nlp-users ********************************************************************** This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you have received this email in error please notify the system manager. ********************************************************************** |
From: Beate D. <do...@IM...> - 2005-02-07 11:43:00
|
Hi Mich, > - Using the -n tag, one can extend the number of neighbours of > associate's output to 200. Is this a hard limit or could this > be altered to the theoretical maximum that is near the vocabulary size? There is no limit to the number of neighbors. You'll simply have to change MAX_NEIGHBOR and PRINT_NEIGHBOR in associate.h to a bigger number and use associate -n <big_number>. > Another, unrelated question that should be easily answered then. > If I recall correctly, some words in infomap are not taken into account, > for example, the word 'the' does not hold semantic content and is > therefore not taken into consideration in computing correspondences. Am > I right there? Since I am trying to get this program to work using Dutch > documents, I was wondering which info-map file holds the words that are > ignored in the computation, so it would be more easy to adapt it to the > current languange. It's "stop.list" in the admin directory which contains the words to be disregarded. If you want to use a different set of stopwords, you can simply set STOPLIST_FILE in "default-params.in" to point to a different stoplist, the format of which should be such that each line contains exactly one stopword. Best wishes, Beate |