You can subscribe to this list here.
2004 |
Jan
(1) |
Feb
|
Mar
(8) |
Apr
(6) |
May
(6) |
Jun
(1) |
Jul
|
Aug
(2) |
Sep
|
Oct
(7) |
Nov
(1) |
Dec
|
---|---|---|---|---|---|---|---|---|---|---|---|---|
2005 |
Jan
(9) |
Feb
(13) |
Mar
(2) |
Apr
(2) |
May
(14) |
Jun
(9) |
Jul
|
Aug
(1) |
Sep
|
Oct
(15) |
Nov
(1) |
Dec
(1) |
2006 |
Jan
(2) |
Feb
(9) |
Mar
|
Apr
(1) |
May
(3) |
Jun
(3) |
Jul
(8) |
Aug
(8) |
Sep
(3) |
Oct
(6) |
Nov
(4) |
Dec
|
2007 |
Jan
(6) |
Feb
(5) |
Mar
(5) |
Apr
(15) |
May
(4) |
Jun
|
Jul
(4) |
Aug
(20) |
Sep
(14) |
Oct
(7) |
Nov
(11) |
Dec
(5) |
2008 |
Jan
(4) |
Feb
(5) |
Mar
(34) |
Apr
(35) |
May
(10) |
Jun
(14) |
Jul
(35) |
Aug
(15) |
Sep
(17) |
Oct
(21) |
Nov
(43) |
Dec
(40) |
2009 |
Jan
(37) |
Feb
(22) |
Mar
(45) |
Apr
(48) |
May
(123) |
Jun
(103) |
Jul
(71) |
Aug
(25) |
Sep
(19) |
Oct
(42) |
Nov
(12) |
Dec
(22) |
2010 |
Jan
(12) |
Feb
(18) |
Mar
(39) |
Apr
(59) |
May
(67) |
Jun
(65) |
Jul
(37) |
Aug
(39) |
Sep
(20) |
Oct
(3) |
Nov
|
Dec
(1) |
2011 |
Jan
(3) |
Feb
(6) |
Mar
(7) |
Apr
(1) |
May
(1) |
Jun
(4) |
Jul
(7) |
Aug
(4) |
Sep
(4) |
Oct
|
Nov
(4) |
Dec
(1) |
2012 |
Jan
(4) |
Feb
(5) |
Mar
(3) |
Apr
(4) |
May
(2) |
Jun
(1) |
Jul
(1) |
Aug
(5) |
Sep
(2) |
Oct
(4) |
Nov
(4) |
Dec
(1) |
2013 |
Jan
(7) |
Feb
(4) |
Mar
(2) |
Apr
(2) |
May
(3) |
Jun
(5) |
Jul
(9) |
Aug
(6) |
Sep
(6) |
Oct
(5) |
Nov
(7) |
Dec
(3) |
2014 |
Jan
(7) |
Feb
(4) |
Mar
(8) |
Apr
(7) |
May
(7) |
Jun
(2) |
Jul
(6) |
Aug
(4) |
Sep
(7) |
Oct
(9) |
Nov
(3) |
Dec
(1) |
2015 |
Jan
|
Feb
(1) |
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(1) |
Dec
(1) |
2016 |
Jan
|
Feb
|
Mar
(1) |
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2017 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(1) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
From: Dominic W. <dwi...@cs...> - 2005-05-17 23:48:45
|
Dear Vladimir, I'm glad you asked this question, you have certainly got down to one of the rules of thumb in the infomap code that we just really learned from experience. The infomap vector-building algorithm is described in some detail in http://infomap.stanford.edu/book/ (Chapters 5 and 6) However, I don't think that this, or any other published description, contains the decision we made about not multiplying by singular values. When we did multiply by singular values, we found that many cosine similarities were very high, in the 0.98 to 0.99 region, and so not very discrimating. This also meant that vector negation was very chaotic, removed practically the whole of the word you were interested in, and sent your query way off course. So we just removed the scaling factor, effectively treating all latent axes as having equivalent importance (at least, I think this is the upshot of the decision). It would be good to have a better understanding of why this happens and what a truly "principled" method would be. In practice it's quite hard to find encouragement for such a study, because data is noisy and so improvements are marginal, leaving little hope of detecting any measurably "best" configuration. The strategy for our research was always something along the lines of "get it working and show that it does interesting things". Given that LSA vector models aren't configured to accept annotated training data, you can always find a supervised machine learning method that outperforms infomap on currently recognised tasks, so any papers we wrote claiming that we'd found a principled "optimal" way of doing things would meet with the criticism that LSA is not optimal compared with other methods. This certainly spurred us to search for innovation rather than perfection. Sorry I can't give you a better answer right now! Best wishes, Dominic On Tue, 17 May 2005 vl...@sp... wrote: > Hello, > > I want to ask about one thing in the Infomap algorithm. > > Let matrix A is the cooccurrence matrix ,where i,j-th element is cooccurrence count for word i within a window of content bearing word j > throughout the corpus. If the rows in matrix A are normalized, the cosine similarity between two words (i-th and j-th) can be computed as > a dot product of the i-th and j-th row vector of A. That is the i,j-th element of matrix A(A^T). After Singular Value Decomposition > A=US(V^T) and keeping only k largest singular values in S the matrix A'=U'S'(V'^T) is approximation of A. So A'(A'^T)=U'S'(V'^T) > (U'S'(V??^T))^T=U'S'(U'S)^T is approximation of A(A^T). So the i-th row of matrix U'S' should be vector in WordSpace for i-th word. > I'm missing the reason, why the Infomal algorithm is using only rows of U' as vectors in WordSpace. > > The multiplication of U' by S' can be done by this commented part of code in encode_wordvec.c : > > /* Scale the vector if you want to > > Commented out by Dominic, June 2001, because the first > principal component only really seems to say that all the > data is in the positive quadrant; declare i if you want to > use these two lines of code. (Also, comment back in the > code to declare the singvals array and to open SINGVAL_FILE > and read its contents into that array.) > > for( i=0; i<SINGVALS; i++) > vector[i] *= singvals[i]; */ > > It is true that "first principal component only really seems to say that all the data is in the positive quadrant" ? > > I will be very happy if someone could clarify the infomap algorithm to me. > > Best regards, > Vladimir Repisky > > > > ------------------------------------------------------- > This SF.Net email is sponsored by Oracle Space Sweepstakes > Want to be the first software developer in space? > Enter now for the Oracle Space Sweepstakes! > http://ads.osdn.com/?ad_id=7412&alloc_id=16344&op=click > _______________________________________________ > infomap-nlp-users mailing list > inf...@li... > https://lists.sourceforge.net/lists/listinfo/infomap-nlp-users > |
From: <vl...@sp...> - 2005-05-17 22:33:07
|
Hello, I want to ask about one thing in the Infomap algorithm. Let matrix A is the cooccurrence matrix ,where i,j-th element is cooccurrence count for word i within a window of content bearing word j throughout the corpus. If the rows in matrix A are normalized, the cosine similarity between two words (i-th and j-th) can be computed as a dot product of the i-th and j-th row vector of A. That is the i,j-th element of matrix A(A^T). After Singular Value Decomposition A=US(V^T) and keeping only k largest singular values in S the matrix A'=U'S'(V'^T) is approximation of A. So A'(A'^T)=U'S'(V'^T) (U'S'(V??^T))^T=U'S'(U'S)^T is approximation of A(A^T). So the i-th row of matrix U'S' should be vector in WordSpace for i-th word. I'm missing the reason, why the Infomal algorithm is using only rows of U' as vectors in WordSpace. The multiplication of U' by S' can be done by this commented part of code in encode_wordvec.c : /* Scale the vector if you want to Commented out by Dominic, June 2001, because the first principal component only really seems to say that all the data is in the positive quadrant; declare i if you want to use these two lines of code. (Also, comment back in the code to declare the singvals array and to open SINGVAL_FILE and read its contents into that array.) for( i=0; i<SINGVALS; i++) vector[i] *= singvals[i]; */ It is true that "first principal component only really seems to say that all the data is in the positive quadrant" ? I will be very happy if someone could clarify the infomap algorithm to me. Best regards, Vladimir Repisky |
From: <vl...@sp...> - 2005-05-09 11:34:43
|
Hi Beate, thank you very much for your advice, everything works fine now. I was confused because of the count of vectors in 'left' file was the same as count of words in 'dic' file, 'left' file constains zero vectors on the end to match the count of words in 'dic' file. And 'wordvec.bin' contains only non-stopwords vectors. I find it unnecessary running SVD with the dimension of all words, it can be done with dimension of non-stopwords count. Best regards Vladimir Repisky On Mon, 09 May 2005 11:18:30 +0200, Beate Dorow <do...@im...> wrote: > > Hi Vladimir, > >> I want to ask, if I've correctly understanded the source codes; that the >> associate command for word associating is only reading vectors >> from 'wordvec.bin' file and returns sorted top words with greatest >> cosine >> similarity to the query word vector? And if the query word is non- >> stopword from 'dic' file, than the vector for this word is the same as >> corresponding vector stored in 'wordvec.bin' file? > > Yes, that's exactly how it works. > >> I assume that the words in the 'wordvec.bin' are normalized vectors form >> 'left' file, where vectors are successively associated to words in >> 'dic' file. Is this right? > > This is right except that the dictionary contains also the stopwords. So > the vectors from the "left" file are succesively associated with the > non-stopwords (value of the third column is 0) in "dic". > > To check where things go wrong, you could retrieve the vectors of word1 > and word2 using "associate -q ..." and check whether the vectors > retrieved are consistent with the vectors you obtain. > > Further, you can compute the cosine similarity of word1 and word2 using > "compare_words.pl <options> word1 word2" and see if you get the same > result using your own technique. > > Good luck! > Best wishes, > Beate |
From: Beate D. <do...@im...> - 2005-05-09 09:19:11
|
Hi Vladimir, > I want to ask, if I've correctly understanded the source codes; that the > associate command for word associating is only reading vectors > from 'wordvec.bin' file and returns sorted top words with greatest cosine > similarity to the query word vector? And if the query word is non- > stopword from 'dic' file, than the vector for this word is the same as > corresponding vector stored in 'wordvec.bin' file? Yes, that's exactly how it works. > I assume that the words in the 'wordvec.bin' are normalized vectors form > 'left' file, where vectors are successively associated to words in > 'dic' file. Is this right? This is right except that the dictionary contains also the stopwords. So the vectors from the "left" file are succesively associated with the non-stopwords (value of the third column is 0) in "dic". To check where things go wrong, you could retrieve the vectors of word1 and word2 using "associate -q ..." and check whether the vectors retrieved are consistent with the vectors you obtain. Further, you can compute the cosine similarity of word1 and word2 using "compare_words.pl <options> word1 word2" and see if you get the same result using your own technique. Good luck! Best wishes, Beate |
From: <vl...@sp...> - 2005-05-09 00:14:54
|
Hello, I want to build word to word matrix, which will contains similarities between words. The way of running associate command for all non- stopwords in 'dir' file is very slow. So I tried to read vectors from 'wordvec.bin' file and compute cosine similarity between them. I'm not getting the same results as the associate command. I want to ask, if I've correctly understanded the source codes; that the associate command for word associating is only reading vectors from 'wordvec.bin' file and returns sorted top words with greatest cosine similarity to the query word vector? And if the query word is non- stopword from 'dic' file, than the vector for this word is the same as corresponding vector stored in 'wordvec.bin' file? I assume that the words in the 'wordvec.bin' are normalized vectors form 'left' file, where vectors are successively associated to words in 'dic' file. Is this right? Best regards, Vladimir Repisky |
From: Deane, P. <pd...@et...> - 2005-04-29 13:19:43
|
Has anyone successfully compiled infomap on 64-bit linux? (AMD 64 running Debian with 64-bit libraries only). When I tried to compile 0.8.5 (using the 64-bit version of GCC), I get the following error messages during compile: myutils.c: In function `mymalloc': myutils.c:167: error: conflicting types for 'malloc' make[3]: *** [myutils.o] Error 1 make[2]: *** [all-recursive] Error 1 make[1]: *** [all-recursive] Error 1 When I comment out the offending line, it compiles, but I get memory errors when I attempt to run the program, i.e., the end of the log looks like this: ================================================== Building target: /export/home/bragi2/pdeane/test/left Prerequisites: /export/home/bragi2/pdeane/test/coll /export/home/bragi2/pdeane/t est/indx Fri Apr 29 09:06:56 EDT 2005 .................................................. cd /export/home/bragi2/pdeane/test && rm svd_diag left \ rght sing rm: cannot remove `svd_diag': No such file or directory rm: cannot remove `left': No such file or directory rm: cannot remove `rght': No such file or directory rm: cannot remove `sing': No such file or directory make: [/export/home/bragi2/pdeane/test/left] Error 1 (ignored) cd /export/home/bragi2/pdeane/test && svdinterface \ -singvals 100 \ -iter 100 This is svdinterface. Writing to: left Writing to: rght Writing to: sing Writing to: svd_diag Reading: indx Reading: indx Reading: coll make: *** [/export/home/bragi2/pdeane/test/left] Error 139 ---------------------------------------------------------------------------- ----------------- When I check the directory the model was being created in, this is the state of things, with the evidence suggesting that something when nastily wrong in the svdinterface code that calls the mymalloc function: -rw-r--r-- 1 pdeane nlp 2819650 Apr 29 09:06 coll -rw-r--r-- 1 pdeane nlp 374 Apr 29 09:06 corpus_format.bin -rw-r--r-- 1 pdeane nlp 157223 Apr 29 09:06 dic -rw-r--r-- 1 pdeane nlp 3939 Apr 29 09:06 indx -rw-r--r-- 1 pdeane nlp 0 Apr 29 09:06 left -rw-r--r-- 1 pdeane nlp 28 Apr 29 09:06 model_info.bin -rw-r--r-- 1 pdeane nlp 16396 Apr 29 09:06 model_params.bin -rw-r--r-- 1 pdeane nlp 4 Apr 29 09:06 numDocs -rw-r--r-- 1 pdeane nlp 0 Apr 29 09:06 rght -rw-r--r-- 1 pdeane nlp 0 Apr 29 09:06 sing -rw-r--r-- 1 pdeane nlp 0 Apr 29 09:06 svd_diag -rw-r--r-- 1 pdeane nlp 540345 Apr 29 09:06 wordlist I tried changing the defines e.g. BIGINT, BIGFLOAT in fixed.h to 64-bit values but this make no difference, and that was the only obvious 32-bit dependency I could see. Interestingly, the original SVDPACKC compiles and runs in my environment, so whatever is going on looks like it must be a function of the svdinterface code and associated changes. Also, btw, I have had no problem running infomap on a 32-bit linux machine nor under cygwin on a windows machine. Can anybody help me out here? Thanks Paul Deane ************************************************************************** This e-mail and any files transmitted with it may contain privileged or confidential information. It is solely for use by the individual for whom it is intended, even if addressed incorrectly. If you received this e-mail in error, please notify the sender; do not disclose, copy, distribute, or take any action in reliance on the contents of this information; and delete it from your system. Any other use of this e-mail is prohibited. Thank you for your compliance. |
From: Jeremy H. <hi...@le...> - 2005-04-26 18:43:10
|
Hi all: I'm trying to install the infomap package on my Mandrake 10.0 machine. I ran ./configure. It told me to go download and install the gnudbm package. Which I did with no problem. But the infomap ./configure is still complaining. I moved the gdmn-... folder to /usr/local/include/gdbm and that fixed some of the error messages. But now I'm really stuck this is my output from infomap ./configure [root@localhost infomap-nlp-0.8.5]# ./configure checking for a BSD-compatible install... /usr/bin/install -c checking whether build environment is sane... yes checking for gawk... gawk checking whether make sets $(MAKE)... yes checking for gcc... gcc checking for C compiler default output file name... a.out checking whether the C compiler works... yes checking whether we are cross compiling... no checking for suffix of executables... checking for suffix of object files... o checking whether we are using the GNU C compiler... yes checking whether gcc accepts -g... yes checking for gcc option to accept ANSI C... none needed checking for style of include used by make... GNU checking dependency style of gcc... gcc3 checking how to run the C preprocessor... gcc -E checking for a BSD-compatible install... /usr/bin/install -c checking whether make sets $(MAKE)... (cached) yes checking for sqrt in -lm... yes checking for dbm_open in -lgdbm... no checking for dbm_open in -lgdbm_compat... no checking for egrep... grep -E checking for ANSI C header files... yes checking for sys/types.h... yes checking for sys/stat.h... yes checking for stdlib.h... yes checking for string.h... yes checking for memory.h... yes checking for strings.h... yes checking for inttypes.h... yes checking for stdint.h... yes checking for unistd.h... yes checking fcntl.h usability... yes checking fcntl.h presence... yes checking for fcntl.h... yes checking locale.h usability... yes checking locale.h presence... yes checking for locale.h... yes checking for stdlib.h... (cached) yes checking for string.h... (cached) yes checking sys/file.h usability... yes checking sys/file.h presence... yes checking for sys/file.h... yes checking sys/time.h usability... yes checking sys/time.h presence... yes checking for sys/time.h... yes checking for unistd.h... (cached) yes checking ndbm.h usability... yes checking ndbm.h presence... yes checking for ndbm.h... yes checking for an ANSI C-conforming const... yes checking for off_t... yes checking for size_t... yes checking for stdlib.h... (cached) yes checking for GNU libc compatible malloc... yes checking whether lstat dereferences a symlink specified with a trailing slash... yes checking whether stat accepts an empty string... no checking for getcwd... yes checking for memset... yes checking for setlocale... yes checking for sqrt... yes checking for strchr... yes checking for strdup... yes checking for strstr... yes checking for dbm_open... no checking for dbm_close... no checking for dbm_fetch... no checking for dbm_store... no configure: WARNING: configure could not find all of the necessary DBM functions. (dbm_open, dbm_close, dbm_fetch, dbm_store). This probably means that configure could not find your DBM library, or that you do not have a DBM library installed. (See above output for the results of library checks.) Please install a DBM-compatible library (such as GNU DBM) or write to inf...@li... for advice. configure: error: One or more errors found (see output above for details). Configuration unsuccessful. I'm not really sure where I should go from here. Thanks!!!!!! hise -- Jeremy Hise http://h153.dr0ne.info/ |
From: Mich <pse...@zo...> - 2005-03-03 19:52:32
|
Hello Trevor, Wednesday, March 2, 2005, 8:04:37 PM, you wrote: TC> Hello and thank you to the developers of infomap, it is TC> proving most useful. I have an issue regarding document x document TC> comparison. Looking through the archive, I found this comment: TC> TC> "To actually compare two documents, you can use "associate -q TC> -i d ..." for each of the two documents to obtain their vector TC> representations, and then simply compute the scalar product." TC> This would be great, but the 0.8.5 version of infomap that TC> i'm using has an associate that doesn't take the "-i" option. TC> TC> Is there a way in this version to compare documents to one TC> another or retreive their vectors? It was me who originally asked this question concerning document x document comparisons. Incidentally, someone approached me today concerning a case of possible plagiarism and if infomap could be used for this. My idea is it could, but mostly using something such as you mentioned there - ie, if you have enough documents in your corpus - including the suspected case of plagiarism, it should be possible to find out which document yields the greatest similarity. Anyway, the suggestion was put forward by entering the document of comparison as a full plaintext parameter in the 'associate' program. (ie: associate ... -d [document as plaintext]". This would be a 'crude way' of doing comparisons. Let me know if you find anything, and if you succeed in getting those vector representations and scalar products. Cheers, Mich |
From: Trevor C. <tr...@ya...> - 2005-03-02 19:04:49
|
Hello and thank you to the developers of infomap, it is proving most useful. I have an issue regarding document x document comparison. Looking through the archive, I found this comment: "To actually compare two documents, you can use "associate -q -i d ..." for each of the two documents to obtain their vector representations, and then simply compute the scalar product." This would be great, but the 0.8.5 version of infomap that i'm using has an associate that doesn't take the "-i" option. Is there a way in this version to compare documents to one another or retreive their vectors? Regards, Trevor Send instant messages to your online friends http://uk.messenger.yahoo.com |
From: Mich <pse...@zo...> - 2005-02-21 02:34:20
|
Hello Jeff, Sunday, February 20, 2005, 11:17:12 PM, you wrote: JP> I think I already know the answer to this, but I'm being JP> forced to ask this by a higher up on this research project I've JP> been hired on to help with. JP> Does anyone know of or done a windows conversion for the infomap software? JP> All our servers for the research project are on Windows JP> boxes, and the idea of running a *nix machine, while a good idea JP> to me, scares the project leaders. JP> If not, any other equivalent projects anyone knows of for Windows would be helpful. Well, till now, I have worked with infomap using cygwin, which takes quite some time to install well, but it seems to do the trick. After that (local) install, i built some sort of windows wrap around it, which uses a few batch files and a gui, but I don't think it will easily be translatable to another pc. Still, I suggest cygwin; and if you find a way to port the infomap-build wrapper to windows, let me know! Cheers, Mich |
From: Jeff P. <jeffsp@u.washington.edu> - 2005-02-20 22:17:16
|
I think I already know the answer to this, but I'm being forced to ask this by a higher up on this research project I've been hired on to help with. Does anyone know of or done a windows conversion for the infomap software? All our servers for the research project are on Windows boxes, and the idea of running a *nix machine, while a good idea to me, scares the project leaders. If not, any other equivalent projects anyone knows of for Windows would be helpful. Thanks. Jeff Pollard PGIST Assistant University of Washington |
From: Felice Dell'O. <fel...@il...> - 2005-02-17 14:03:32
|
Hi, I'm Felice Dell'Orletta and I'm working for ILC Institute at CNR in Pisa. I'm writing to you hoping you may help me to solve a problem. I'm using Infomap NLP software and the problem is how I can do to extract the reduced Word Space vector of all bearing words? And what is the output of "-f" command-line option ("associate" command)? when I use this command-line option on two differents words it returns different Word Space vectors for the same word,why? and in that Word Space vectors I have a particular caracter "nan nan", what is? I wish you would help me and excuse me for having bothered you. Thank you in advance. Felice Dell'Orletta. ---------------------------------------------------------------- This message was sent using IMP, the Internet Messaging Program. |
From: Trevor C. <tr...@ya...> - 2005-02-11 23:24:40
|
thanks michiel - i've noticed the wordlist file generated in the <modeltag> subdirectory has document labels, so they could be collected there also. still need to check all of this for consistency somehow, though... regards, trevor ------------- Maybe it will not help you, but I have similar problems with print_doc; that is, it states Segmentation fault, core dumped" and puts the following in a text-file: "Exception: STATUS_ACCESS_VIOLATION at eip=610D3DA0 eax=61792D11 ebx=61792D0C ecx=61792D11 edx=00000000 esi=10010210 edi=61792D11 ebp=0022E308 esp=0022E2FC program=F:\Cygwin\usr\local\bin\print_doc.exe, pid 1912, thread main cs=001B ds=0023 es=0023 fs=0038 gs=0000 ss=0023 Stack trace: Frame Function Args 0022E308 610D3DA0 (61792D11, 00000000, 0022E328, 6100650A) 0022E328 610D3C35 (61792D0C, 00000000, 00000000, 00000000) 0022EFC8 004010E5 (00000007, 61792C88, 100100A8, 0022F020) 0022F008 61006145 (0022F020, 0022F31C, 77F64EAC, 0022F340) 0022FF88 61006350 (00000000, 00000000, 00000000, 00000000) End of stack trace" ---------- As you might notice, I'm using cygwin, which works, except for print_doc. Would it be possible to take the doc.id and, using that, find back the title without using print_doc? What I'm currently doing is using a side-route: the document id's seem to increase as a function of their placement in the corpus-file. So, what I'm doing is feeding the model a word of which I know is present in every document, thus feeding back the output. Then, I sort the doc.id's, and put them next to the list of files in the corpus (I have all documents appended to another at an earlier stage) and 'bind' the id's to the title. Is this a valid approach? If it is, maybe it's useful to Trevor as well. Thanks, Mich ___________________________________________________________ ALL-NEW Yahoo! Messenger - all new features - even more fun! http://uk.messenger.yahoo.com |
From: <MS...@FS...> - 2005-02-10 10:49:56
|
First of all: thanks Leif & Beate.=20 ----Original Message----- From: inf...@li... [mailto:infomap-nlp-use= rs-...@li...] On Behalf Of Trevor Cohen Sent: Wednesday, February 09, 2005 11:25 PM To: inf...@li... Subject: [infomap-nlp-users] print_doc and infomap-install problems Hi,=20 I've managed to construct a semantic space from a sizeable corpus and generate term and document associations. Unfortunately print_doc hangs indefinitely when trying to retrieve docs, and attempts to install any models generates the following error: io(/home/trc9003)> infomap-install toto /home/trc9003/local/bin/infomap-install: install: not found Using install control file "/home/trc9003/local/share/infomap-nlp/install.c= ontrol" /home/trc9003/local/bin/infomap-install: !: not found /home/trc9003/local/bin/infomap-install: install: not found" --> this message is repeated several times Does anyone have any idea what the problem is here? The same thing happens with the kjbible.txt file. Strangely I have a recollection of successfully using print_doc with this when first experimenting with the software, but it crashes now. Regards, Trevor=20 -------------- Maybe it will not help you, but I have similar problems with print_doc; tha= t is, it states Segmentation fault, core dumped" and puts the following in = a text-file: "Exception: STATUS_ACCESS_VIOLATION at eip=3D610D3DA0 eax=3D61792D11 ebx=3D61792D0C ecx=3D61792D11 edx=3D00000000 esi=3D10010210 = edi=3D61792D11 ebp=3D0022E308 esp=3D0022E2FC program=3DF:\Cygwin\usr\local\bin\print_doc.e= xe, pid 1912, thread main cs=3D001B ds=3D0023 es=3D0023 fs=3D0038 gs=3D0000 ss=3D0023 Stack trace: Frame Function Args 0022E308 610D3DA0 (61792D11, 00000000, 0022E328, 6100650A) 0022E328 610D3C35 (61792D0C, 00000000, 00000000, 00000000) 0022EFC8 004010E5 (00000007, 61792C88, 100100A8, 0022F020) 0022F008 61006145 (0022F020, 0022F31C, 77F64EAC, 0022F340) 0022FF88 61006350 (00000000, 00000000, 00000000, 00000000) End of stack trace"=20 ---------- As you might notice, I'm using cygwin, which works, except for print_doc. Would it be possible to take the doc.id and, using that, find back the titl= e without using print_doc? What I'm currently doing is using a side-route: = the document id's seem to increase as a function of their placement in the = corpus-file. So, what I'm doing is feeding the model a word of which I know= is present in every document, thus feeding back the output. Then, I sort t= he doc.id's, and put them next to the list of files in the corpus (I have a= ll documents appended to another at an earlier stage) and 'bind' the id's t= o the title. Is this a valid approach? If it is, maybe it's useful to Trevo= r as well. Thanks, Mich ********************************************************************** This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you have received this email in error please notify the system manager. ********************************************************************** |
From: Trevor C. <tr...@ya...> - 2005-02-09 22:25:17
|
Hi, I've managed to construct a semantic space from a sizeable corpus and generate term and document associations. Unfortunately print_doc hangs indefinitely when trying to retrieve docs, and attempts to install any models generates the following error: io(/home/trc9003)> infomap-install toto /home/trc9003/local/bin/infomap-install: install: not found Using install control file "/home/trc9003/local/share/infomap-nlp/install.control" /home/trc9003/local/bin/infomap-install: !: not found /home/trc9003/local/bin/infomap-install: install: not found" --> this message is repeated several times Does anyone have any idea what the problem is here? The same thing happens with the kjbible.txt file. Strangely I have a recollection of successfully using print_doc with this when first experimenting with the software, but it crashes now. Regards, Trevor ___________________________________________________________ ALL-NEW Yahoo! Messenger - all new features - even more fun! http://uk.messenger.yahoo.com |
From: <lei...@gm...> - 2005-02-08 12:32:32
|
Hi Mich, if you create a stoplist for a specific domain, some of the normally content bearing words, but too common in this domain, should be included in the stop list. I supervised a master thesis about IR once, and they worked with the Intranet at Volvo. Their stoplist included words like car, and Volvo to get better search result. As long as the result is not getting worse, it might be a good idea to include high frequency words in the stop list. Regards, Leif On Tue, 8 Feb 2005 11:25:15 +0100, Spap=E9, Michiel <MS...@fs...> wrote: > "There is no limit to the number of neighbors. You'll simply have to chan= ge > MAX_NEIGHBOR and PRINT_NEIGHBOR in associate.h to a bigger number and use > associate -n <big_number>." >=20 > Got it, thanks ! >=20 > > Another, unrelated question that should be easily answered then. If I > > recall correctly, some words in infomap are not taken into account, > > for example, the word 'the' does not hold semantic content and is > > therefore not taken into consideration in computing correspondences. > > Am I right there? Since I am trying to get this program to work using > > Dutch documents, I was wondering which info-map file holds the words > > that are ignored in the computation, so it would be more easy to adapt > > it to the current languange. >=20 > "It's "stop.list" in the admin directory which contains the words to be > disregarded. If you want to use a different set of stopwords, you can > simply set STOPLIST_FILE in "default-params.in" to point to a different > stoplist, the format of which should be such that each line contains > exactly one stopword." >=20 > I see. Taking a look at the stop.list right now, I must say it looks rath= er extensive. Could you or anyone else tell me what exactly the reasons are= some words are in this stoplist and others not? Several words, such as 'ar= ticles', 'americans', 'training', etc, would seem to hold semantic content = but appear here nevertheless. Is there any literature why these words are i= n the list, or could the underlying reasoning be explained? >=20 > Thanks, >=20 > Mich (strolling the internet right now looking for Dutch books - the Guth= enberg project, although having an archive of quite a few of these, seems t= o be quite outdated in terms of spelling changes that arose over the last t= wo centuries). >=20 >=20 > ********************************************************************** > This email and any files transmitted with it are confidential and > intended solely for the use of the individual or entity to whom they > are addressed. If you have received this email in error please notify > the system manager. > ********************************************************************** >=20 > ------------------------------------------------------- > SF email is sponsored by - The IT Product Guide > Read honest & candid reviews on hundreds of IT Products from real users. > Discover which products truly live up to the hype. Start reading now. > http://ads.osdn.com/?ad_ide95&alloc_id=14396&opclick > _______________________________________________ > infomap-nlp-users mailing list > inf...@li... > https://lists.sourceforge.net/lists/listinfo/infomap-nlp-users >=20 --=20 Leif Gr=F6nqvist, GSLT, le...@li..., www.ling.gu.se/~leifg, 031-821515= (home) School of Mathematics and Systems Engineering, V=E4xj=F6 University 0707164= 380(mob) Department of Linguistics, G=F6teborg University, +46 31 773 1177, 773 485= 3(fax) |
From: <MS...@FS...> - 2005-02-08 10:25:38
|
"There is no limit to the number of neighbors. You'll simply have to change= =20 MAX_NEIGHBOR and PRINT_NEIGHBOR in associate.h to a bigger number and use= =20 associate -n <big_number>." Got it, thanks ! > Another, unrelated question that should be easily answered then. If I=20 > recall correctly, some words in infomap are not taken into account,=20 > for example, the word 'the' does not hold semantic content and is=20 > therefore not taken into consideration in computing correspondences.=20 > Am I right there? Since I am trying to get this program to work using=20 > Dutch documents, I was wondering which info-map file holds the words=20 > that are ignored in the computation, so it would be more easy to adapt=20 > it to the current languange. "It's "stop.list" in the admin directory which contains the words to be=20 disregarded. If you want to use a different set of stopwords, you can=20 simply set STOPLIST_FILE in "default-params.in" to point to a different=20 stoplist, the format of which should be such that each line contains=20 exactly one stopword." I see. Taking a look at the stop.list right now, I must say it looks rather= extensive. Could you or anyone else tell me what exactly the reasons are s= ome words are in this stoplist and others not? Several words, such as 'arti= cles', 'americans', 'training', etc, would seem to hold semantic content bu= t appear here nevertheless. Is there any literature why these words are in = the list, or could the underlying reasoning be explained? Thanks, Mich (strolling the internet right now looking for Dutch books - the Guthen= berg project, although having an archive of quite a few of these, seems to = be quite outdated in terms of spelling changes that arose over the last two= centuries). ********************************************************************** This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you have received this email in error please notify the system manager. ********************************************************************** |
From: Beate D. <do...@IM...> - 2005-02-07 11:43:00
|
Hi Mich, > - Using the -n tag, one can extend the number of neighbours of > associate's output to 200. Is this a hard limit or could this > be altered to the theoretical maximum that is near the vocabulary size? There is no limit to the number of neighbors. You'll simply have to change MAX_NEIGHBOR and PRINT_NEIGHBOR in associate.h to a bigger number and use associate -n <big_number>. > Another, unrelated question that should be easily answered then. > If I recall correctly, some words in infomap are not taken into account, > for example, the word 'the' does not hold semantic content and is > therefore not taken into consideration in computing correspondences. Am > I right there? Since I am trying to get this program to work using Dutch > documents, I was wondering which info-map file holds the words that are > ignored in the computation, so it would be more easy to adapt it to the > current languange. It's "stop.list" in the admin directory which contains the words to be disregarded. If you want to use a different set of stopwords, you can simply set STOPLIST_FILE in "default-params.in" to point to a different stoplist, the format of which should be such that each line contains exactly one stopword. Best wishes, Beate |
From: <MS...@FS...> - 2005-02-07 11:07:27
|
Sorry to give you hope, I have actually no clue. My question, however, was = related to yours in that I would like to know if it would be possible to ex= tend infomap's output to multiple documents/words. Put more clearly: - Using the -n tag, one can extend the number of neighbours of associate's = output to 200. Is this a hard limit or could this be altered to the theoret= ical maximum that is near the vocabulary size? Another, unrelated question that should be easily answered then. If I recal= l correctly, some words in infomap are not taken into account, for example,= the word 'the' does not hold semantic content and is therefore not taken i= nto consideration in computing correspondences. Am I right there? Since I a= m trying to get this program to work using Dutch documents, I was wondering= which info-map file holds the words that are ignored in the computation, s= o it would be more easy to adapt it to the current languange. Lastly, I was wondering if people here have done any experimentation with t= he length of documents. The typical textfiles used here seem to have been t= aken from the Guthenberg project ebooks, so they are usually very large (30= 00+ words or something). We've been working here with small documents (+-20= 0 words) and although it has its uses, I'm getting more and more the impres= sion that despite our growing collection of these small files (currently at= 220 files, which should grow to 1000 files), they provide too sparse infor= mation for LSI. Thus, I was wondering if we could share thoughts on: 1. minimal size of entire corpus (in words) 2. effects of average document size of files in corpus I was thinking to maybe provide the corpus with a corpus 'bulk' of some Dut= ch ebooks so the model could have a headstart in providing more accurate an= alysis.=20 Thanks for listening, Mich -----Original Message----- From: inf...@li... [mailto:infomap-nlp-use= rs-...@li...] On Behalf Of Leif Gr=F6nqvist Sent: Friday, February 04, 2005 4:49 PM To: inf...@li... Subject: [infomap-nlp-users] spare matrix format Hi! Infomap is now using ordinary C-matrices with 4 bytes per cell. Has anyone = tried to rewrite the matrix handling code using a spare matrix format like = for example Harwell-Boeing? This would make it possible to run on a much la= rger vocabulary and also, not limiting the matrix size in the second dimens= ion. I would like to run it on 500 million running words or so, which leads to 3= .5 million word types... What do you developers think? How big would that task be? Regards, Leif -- Leif Gr=F6nqvist, GSLT, le...@li..., www.ling.gu.se/~leifg, 031-821515= (home) School of Mathematics and Systems Engineering, V=E4xj=F6 University = 0707164380(mob) Department of Linguistics, G=F6teborg University, +46 31 7= 73 1177, 773 4853(fax) ------------------------------------------------------- This SF.Net email is sponsored by: IntelliVIEW -- Interactive Reporting Too= l for open source databases. Create drag-&-drop reports. Save time by over = 75%! Publish reports on the web. Export to DOC, XLS, RTF, etc. Download a F= REE copy at http://www.intelliview.com/go/osdn_nl _______________________________________________ infomap-nlp-users mailing list inf...@li... https://lists.sourceforge.net/lists/listinfo/infomap-nlp-users ********************************************************************** This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you have received this email in error please notify the system manager. ********************************************************************** |
From: Rob M. <rm...@ma...> - 2005-02-04 17:56:02
|
On Fri, 2005-02-04 at 07:37, Leif Gr=F6nqvist wrote: > Hi! >=20 > Infomap is now using ordinary C-matrices with 4 bytes per cell. Has > anyone tried to rewrite the matrix handling code using a spare matrix > format like for example Harwell-Boeing? This would make it possible to > run on a much larger vocabulary and also, not limiting the matrix size > in the second dimension. >=20 > I would like to run it on 500 million running words or so, which leads > to 3.5 million word types... >=20 > What do you developers think? How big would that task be? I'm not sure what you mean. The mathematical guts of infomap (svdinterface/las2.c) does use a sparse representation for the word co-occurrence matrix, and there would be no advantage to using a sparse format for the reduced matrix. =20 Is count_wordvec where you're running into trouble? If so, I think it would be fairly easy to replace that one stage with something more robust. In fact, one of the things on my to-do list is to rewrite count_wordvec in python (it will be much slower, but also much more flexible, more easily integrated with other NLP tools, and better suited for students to tinker with). That would make what you're asking for trivial. --=20 Rob Malouf <rm...@ma...> Department of Linguistics and Oriental Languages San Diego State University |
From: <lei...@gm...> - 2005-02-04 15:48:41
|
Hi! Infomap is now using ordinary C-matrices with 4 bytes per cell. Has anyone tried to rewrite the matrix handling code using a spare matrix format like for example Harwell-Boeing? This would make it possible to run on a much larger vocabulary and also, not limiting the matrix size in the second dimension. I would like to run it on 500 million running words or so, which leads to 3.5 million word types... What do you developers think? How big would that task be? Regards, Leif -- Leif Gr=F6nqvist, GSLT, le...@li..., www.ling.gu.se/~leifg, 031-821515= (home) School of Mathematics and Systems Engineering, V=E4xj=F6 University 0707164= 380(mob) Department of Linguistics, G=F6teborg University, +46 31 773 1177, 773 485= 3(fax) |
From: <lei...@gm...> - 2005-02-04 15:38:12
|
Hi! Infomap is now using ordinary C-matrices with 4 bytes per cell. Has anyone tried to rewrite the matrix handling code using a spare matrix format like for example Harwell-Boeing? This would make it possible to run on a much larger vocabulary and also, not limiting the matrix size in the second dimension. I would like to run it on 500 million running words or so, which leads to 3.5 million word types... What do you developers think? How big would that task be? Regards,=20 =20 Leif --=20 Leif Gr=F6nqvist, GSLT, le...@li..., www.ling.gu.se/~leifg, 031-821515= (home) School of Mathematics and Systems Engineering, V=E4xj=F6 University 0707164= 380(mob) Department of Linguistics, G=F6teborg University, +46 31 773 1177, 773 485= 3(fax) |
From: Rob M. <rm...@ma...> - 2005-01-30 20:11:27
|
Hi, On Jan 29, 2005, at 6:27 PM, Dominic Widdows wrote: > * If you want to checkout the version from CVS you can request to > become a > developer on the project - it's a simple process to sign up for a > sourceforge accound and then you can checkout the current source code. I don't think you need to be a developer or even have a sourceforge account to get the latest version from CVS: http://sourceforge.net/cvs/?group_id=99109 --- Rob Malouf <rm...@ma...> Department of Linguistics and Oriental Languages San Diego State University |
From: Dominic W. <dwi...@cs...> - 2005-01-30 02:27:35
|
Dear All, A couple of users recently have had problems installing the infomap software, and I have to apologise that with my current day job, new dog and an old house to fix up I'm not going to be able to be much help any time soon. At the moment, infomap-nlp-0.8.5 is not working for me on MacOS either, but the development version from CVS at sourceforge is fine. I therefore have 2 suggestions: * If you want to checkout the version from CVS you can request to become a developer on the project - it's a simple process to sign up for a sourceforge accound and then you can checkout the current source code. * You could wait a couple of weeks in the hope that we get round to realeasing a version 0.8.6. I'm going to wait a bit on this because a couple of new users have kindly offered to become developers and check in some bug-fixes. Again, apologies for this lack of responsiveness - all of the original infomap project team have moved on from Stanford and the only way that the software is going to improve is if people find it useful enough to take a lead in development. I'm already very grateful that this has happened in a few cases, so hopefully the future is positive. Deep thanks to those who have already contributed. Please feel free to register bugs directly on the sourceforge site - even if we're not good at fixing them, it's the best way to keep a record and there's always the chance that a new user might fix an old bug. Best wishes, Dominic |
From: <m92...@st...> - 2005-01-29 09:09:14
|
i=20have=20some=20problem=20in=20installing=20infomap here=20is=20the=20message ------------------- temin=20[~/infomap-nlp-0.8.5]=20-victor->=20make make=20=20all-recursive Making=20all=20in=20. Making=20all=20in=20admin "Makefile",=20line=20421:=20Need=20an=20operator Error=20expanding=20embedded=20variable. ***=20Error=20code=201 Stop=20in=20/usr/home/victor/infomap-nlp-0.8.5. ***=20Error=20code=201 Stop=20in=20/usr/home/victor/infomap-nlp-0.8.5. -------------------- do=20i=20lack=20some=20installation=20step? thanks=20for=20help! =20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20Tzu-= Hsiang=20Lin =20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20nsys= u,Taiwan ---------------------------------------------------------- temin=20[~/infomap-nlp-0.8.5]=20-victor->=20./configure checking=20for=20a=20BSD-compatible=20install...=20/usr/bin/install=20-c checking=20whether=20build=20environment=20is=20sane...=20yes checking=20for=20gawk...=20gawk checking=20whether=20make=20sets=20$(MAKE)...=20yes checking=20for=20gcc...=20gcc checking=20for=20C=20compiler=20default=20output=20file=20name...=20a.out checking=20whether=20the=20C=20compiler=20works...=20yes checking=20whether=20we=20are=20cross=20compiling...=20no checking=20for=20suffix=20of=20executables...=20 checking=20for=20suffix=20of=20object=20files...=20o checking=20whether=20we=20are=20using=20the=20GNU=20C=20compiler...=20yes checking=20whether=20gcc=20accepts=20-g...=20yes checking=20for=20gcc=20option=20to=20accept=20ANSI=20C...=20none=20needed checking=20for=20style=20of=20include=20used=20by=20make...=20GNU checking=20dependency=20style=20of=20gcc...=20gcc3 checking=20how=20to=20run=20the=20C=20preprocessor...=20gcc=20-E checking=20for=20a=20BSD-compatible=20install...=20/usr/bin/install=20-c checking=20whether=20make=20sets=20$(MAKE)...=20(cached)=20yes checking=20for=20sqrt=20in=20-lm...=20yes checking=20for=20dbm_open=20in=20-lgdbm...=20no checking=20for=20dbm_open=20in=20-lgdbm_compat...=20no checking=20for=20egrep...=20grep=20-E checking=20for=20ANSI=20C=20header=20files...=20yes checking=20for=20sys/types.h...=20yes checking=20for=20sys/stat.h...=20yes checking=20for=20stdlib.h...=20yes checking=20for=20string.h...=20yes checking=20for=20memory.h...=20yes checking=20for=20strings.h...=20yes checking=20for=20inttypes.h...=20yes checking=20for=20stdint.h...=20yes checking=20for=20unistd.h...=20yes checking=20fcntl.h=20usability...=20yes checking=20fcntl.h=20presence...=20yes checking=20for=20fcntl.h...=20yes checking=20locale.h=20usability...=20yes checking=20locale.h=20presence...=20yes checking=20for=20locale.h...=20yes checking=20for=20stdlib.h...=20(cached)=20yes checking=20for=20string.h...=20(cached)=20yes checking=20sys/file.h=20usability...=20yes checking=20sys/file.h=20presence...=20yes checking=20for=20sys/file.h...=20yes checking=20sys/time.h=20usability...=20yes checking=20sys/time.h=20presence...=20yes checking=20for=20sys/time.h...=20yes checking=20for=20unistd.h...=20(cached)=20yes checking=20ndbm.h=20usability...=20yes checking=20ndbm.h=20presence...=20yes checking=20for=20ndbm.h...=20yes checking=20for=20an=20ANSI=20C-conforming=20const...=20yes checking=20for=20off_t...=20yes checking=20for=20size_t...=20yes checking=20for=20stdlib.h...=20(cached)=20yes checking=20for=20GNU=20libc=20compatible=20malloc...=20yes checking=20whether=20lstat=20dereferences=20a=20symlink=20specified=20with=20= a=20trailing=20slash...=20no checking=20whether=20stat=20accepts=20an=20empty=20string...=20no checking=20for=20getcwd...=20yes checking=20for=20memset...=20yes checking=20for=20setlocale...=20yes checking=20for=20sqrt...=20yes checking=20for=20strchr...=20yes checking=20for=20strdup...=20yes checking=20for=20strstr...=20yes checking=20for=20dbm_open...=20yes checking=20for=20dbm_close...=20yes checking=20for=20dbm_fetch...=20yes checking=20for=20dbm_store...=20yes configure:=20creating=20./config.status ./config.status:=20line=20305:=20NLP=20config.status=200.8.5 configured=20by=20./configure,=20generated=20by=20GNU=20Autoconf=202.59, =20=20with=20options=20"" Copyright=20(C)=202003=20Free=20Software=20Foundation,=20Inc. This=20config.status=20script=20is=20free=20software;=20the=20Free=20Softwar= e=20Foundation gives=20unlimited=20permission=20to=20copy,=20distribute=20and=20modify=20it= .:=20No=20such=20file=20or=20directory config.status:=20creating=20Makefile config.status:=20creating=20preprocessing/Makefile config.status:=20creating=20search/Makefile config.status:=20creating=20svd/Makefile config.status:=20creating=20svd/svdinterface/Makefile config.status:=20creating=20lib/Makefile config.status:=20creating=20admin/Makefile config.status:=20creating=20admin/infomap-build config.status:=20creating=20admin/infomap-install config.status:=20creating=20admin/Makefile.data config.status:=20creating=20admin/default-params config.status:=20creating=20doc/Makefile config.status:=20creating=20doc/man/Makefile config.status:=20creating=20doc/man/man1/Makefile config.status:=20creating=20doc/man/man5/Makefile config.status:=20creating=20doc/html/Makefile config.status:=20creating=20doc/html/tutorial/Makefile config.status:=20creating=20config.h config.status:=20config.h=20is=20unchanged config.status:=20executing=20depfiles=20commands |