From: Christian P. <pr...@gm...> - 2006-09-14 22:20:28
|
Dear Ted and Dominic, Thank you for all the helpful information. The "folding in" is great idea but I am concerned that with a growing amount of rows and columns the result will become increasingly imprecise (plus Perl's performance might be of a concern too). The data base used in my thesis is 700k+ terms and 200k+ documents. Eventually only a more "economical" adaptation of INFOMAP will solve the problem in my opinion. The information about the SVDPACK format are very welcome in this regard. A modification of INFOMAP is quite possible but I am uncertain if I will have the time available to do so. In case of a successful modification I will post a patch on this list but don't hold your breath yet. Cheers, Christian p.s. If anyone on this list has further ideas or hints feel free to send an email any time. ted pedersen wrote: > Hi Christian, > > I have been following your notes on the infomap mailing list, and > wanted to mention that we have used SVDPACKC a fair bit, and I think > it might scale reasonably well to your particular situation. The problem > with SVDPACK is that it uses a rather obscure input format, and then the > output format is equally obscure. :) But, we have created some programs > that try and deal with that in the SenseClusters package. > > http://senseclusters.sourceforge.net > > There are two programs that might help - the first is called > > mat2harbo.pl > > and this takes a matrix in a fairly standard adjacency matrix > representatin (sparse) and converts it to Harwell-Boeing format, > which is what SVDPACKC requires. It also helps set up the > parameters that SVDPACKC needs to run, and then goes ahead and > runs las2 (one of the types of SVD supported by SVDPACKC, and > to our mind the most standard and reliable). > > Then, a program called svdpackout.pl is run to read the binary > files generated by las2 and produce more readable output, which > allows you to see the post-svd matrix in a plain text form that > you can then use for whatever you need to do. > > I hope this might help you try out SVDPACKC. I don't know if it > will solve your problem exactly, but I think it has a good chance > of doing so. We have run matrices of approximately the size you > describe with SenseClusters. > > BTW, SVDPACKC is the C version of SVDPACK, download and install > instructions are included with SenseClusters in the INSTALL file. > > Cordially, > Ted > > On Mon, 11 Sep 2006, Dominic Widdows wrote: > > >> Dear Christian, >> >> I'm afraid the deafening silence in response to your question seems >> to suggest that there isn't a very good answer to your questions - at >> least, not one that anyone has actively used yet. >> >> In answer to your SVD question - I don't think that SVD-Pack would >> necessarily run into the same problems, because it uses a sparse >> representation. (At least, I know that it reads a fairly sparse >> column-major representation from disk, though I don't really know its >> internals.) It would certainly have scaling issues at some point, but >> I don't know how these would compare with infomap's initial matrix >> generation. >> >> Computing and writing the matrix in blocks would certainly be an >> effort - one I'd very appreciate someone doing, but not to be taken >> on lightly. >> >> Here is one sort-of solution I've used in the past for extending a >> basic model to individual rare words or phrases. Compute a basic >> infomap model within the 50k x 1k safe area. Once you've done this, >> you can generate word vectors for rare words using the same "folding >> in" method you might use to get context vectors, document vectors, >> etc. That is, for a single rare word W, collect the words V_1, ... , >> V_n that occur near W (using grep or some more principled method), >> take an average of those V_i that already have word vectors, and call >> this the word vector for W. In this way, you can build a framework >> from the common words, and use this as scaffolding to get vectors for >> rare words. >> >> Used naively, the method scales pretty poorly - if you wanted to >> create vectors for another 50k words, you'd be pretty sad to run >> 50,000 greps end to end. Obviously you wouldn't do this in practice, >> you'd write something to keep track of your next 50k words and their >> word vectors as you go along. For example, some data structure that >> recorded "word, vector, count_of_neighbors_used" would enable you to >> update the word vector when you encountered new neighbors in text, >> using the count to weight changes to the vector. In this case, memory >> requirements to add a lot of new words would be pretty minimal. For >> large scale work, you'd then want to find a way of adding these >> vectors to the database files you already have for the common words. >> >> So, there is work to do, but I think it's simpler than refactoring >> the matrix algebra. If you only want word vectors for a few rare >> words, it's really easy. Let me know if this is the case, I have a >> (very grubby) perl script already that might help you out. >> >> Sorry for the delay in answering, I hope this helps. >> Dominic >> >> On Sep 8, 2006, at 3:55 AM, Christian Prokopp wrote: >> >> >>> Hello, >>> >>> I am running INFOMAP on a 32bit Linux machine and have problems when I >>> try to use a large matrix, e.g. beyond 40k x 2k or 80k x 1k. My >>> suspicion is that the matrix allocated in initialize_matrix() in >>> matrix.c exits because it runs out of address space at around 3GB. >>> Does anyone have a solution besides using a 64bit system? >>> It seems very possible to rewrite the parts of INFOMAP to compute and >>> write the matrix in blocks rather than in its entirety but (a) that >>> is a >>> lot of work and (b) would SVD-Pack run into the same problem? >>> >>> Any thoughts are appreciated! >>> >>> Cheers, >>> Christian >>> >>> ---------------------------------------------------------------------- >>> --- >>> Using Tomcat but need to do more? Need to support web services, >>> security? >>> Get stuff done quickly with pre-integrated technology to make your >>> job easier >>> Download IBM WebSphere Application Server v.1.0.1 based on Apache >>> Geronimo >>> http://sel.as-us.falkag.net/sel? >>> cmd=lnk&kid=120709&bid=263057&dat=121642 >>> _______________________________________________ >>> infomap-nlp-devel mailing list >>> inf...@li... >>> https://lists.sourceforge.net/lists/listinfo/infomap-nlp-devel >>> >>> >> ------------------------------------------------------------------------- >> Using Tomcat but need to do more? Need to support web services, security? >> Get stuff done quickly with pre-integrated technology to make your job easier >> Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo >> http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 >> _______________________________________________ >> infomap-nlp-devel mailing list >> inf...@li... >> https://lists.sourceforge.net/lists/listinfo/infomap-nlp-devel >> >> > > |