From: Dominic W. <wi...@ma...> - 2006-09-12 01:56:53
|
Dear Christian, I'm afraid the deafening silence in response to your question seems to suggest that there isn't a very good answer to your questions - at least, not one that anyone has actively used yet. In answer to your SVD question - I don't think that SVD-Pack would necessarily run into the same problems, because it uses a sparse representation. (At least, I know that it reads a fairly sparse column-major representation from disk, though I don't really know its internals.) It would certainly have scaling issues at some point, but I don't know how these would compare with infomap's initial matrix generation. Computing and writing the matrix in blocks would certainly be an effort - one I'd very appreciate someone doing, but not to be taken on lightly. Here is one sort-of solution I've used in the past for extending a basic model to individual rare words or phrases. Compute a basic infomap model within the 50k x 1k safe area. Once you've done this, you can generate word vectors for rare words using the same "folding in" method you might use to get context vectors, document vectors, etc. That is, for a single rare word W, collect the words V_1, ... , V_n that occur near W (using grep or some more principled method), take an average of those V_i that already have word vectors, and call this the word vector for W. In this way, you can build a framework from the common words, and use this as scaffolding to get vectors for rare words. Used naively, the method scales pretty poorly - if you wanted to create vectors for another 50k words, you'd be pretty sad to run 50,000 greps end to end. Obviously you wouldn't do this in practice, you'd write something to keep track of your next 50k words and their word vectors as you go along. For example, some data structure that recorded "word, vector, count_of_neighbors_used" would enable you to update the word vector when you encountered new neighbors in text, using the count to weight changes to the vector. In this case, memory requirements to add a lot of new words would be pretty minimal. For large scale work, you'd then want to find a way of adding these vectors to the database files you already have for the common words. So, there is work to do, but I think it's simpler than refactoring the matrix algebra. If you only want word vectors for a few rare words, it's really easy. Let me know if this is the case, I have a (very grubby) perl script already that might help you out. Sorry for the delay in answering, I hope this helps. Dominic On Sep 8, 2006, at 3:55 AM, Christian Prokopp wrote: > Hello, > > I am running INFOMAP on a 32bit Linux machine and have problems when I > try to use a large matrix, e.g. beyond 40k x 2k or 80k x 1k. My > suspicion is that the matrix allocated in initialize_matrix() in > matrix.c exits because it runs out of address space at around 3GB. > Does anyone have a solution besides using a 64bit system? > It seems very possible to rewrite the parts of INFOMAP to compute and > write the matrix in blocks rather than in its entirety but (a) that > is a > lot of work and (b) would SVD-Pack run into the same problem? > > Any thoughts are appreciated! > > Cheers, > Christian > > ---------------------------------------------------------------------- > --- > Using Tomcat but need to do more? Need to support web services, > security? > Get stuff done quickly with pre-integrated technology to make your > job easier > Download IBM WebSphere Application Server v.1.0.1 based on Apache > Geronimo > http://sel.as-us.falkag.net/sel? > cmd=lnk&kid=120709&bid=263057&dat=121642 > _______________________________________________ > infomap-nlp-devel mailing list > inf...@li... > https://lists.sourceforge.net/lists/listinfo/infomap-nlp-devel > |