Re: [infomap-nlp-devel] Heap problem

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Dear Christian,

I'm afraid the deafening silence in response to your question seems  
to suggest that there isn't a very good answer to your questions - at  
least, not one that anyone has actively used yet.

In answer to your SVD question - I don't think that SVD-Pack would  
necessarily run into the same problems, because it uses a sparse  
representation. (At least, I know that it reads a fairly sparse  
column-major representation from disk, though I don't really know its  
internals.) It would certainly have scaling issues at some point, but  
I don't know how these would compare with infomap's initial matrix  
generation.

Computing and writing the matrix in blocks would certainly be an  
effort - one I'd very appreciate someone doing, but not to be taken  
on lightly.

Here is one sort-of solution I've used in the past for extending a  
basic model to individual rare words or phrases. Compute a basic  
infomap model within the 50k x 1k safe area. Once you've done this,  
you can generate word vectors for rare words using the same "folding  
in" method you might use to get context vectors, document vectors,  
etc. That is, for a single rare word W, collect the words V_1, ... ,  
V_n that occur near W (using grep or  some more principled method),  
take an average of those V_i that already have word vectors, and call  
this the word vector for W. In this way, you can build a framework  
from the common words, and use this as scaffolding to get vectors for  
rare words.

Used naively, the method scales pretty poorly - if you wanted to  
create vectors for another 50k words, you'd be pretty sad to run  
50,000 greps end to end. Obviously you wouldn't do this in practice,  
you'd write something to keep track of your next 50k words and their  
word vectors as you go along. For example, some data structure that  
recorded "word, vector, count_of_neighbors_used" would enable you to  
update the word vector when you encountered new neighbors in text,  
using the count to weight changes to the vector. In this case, memory  
requirements to add a lot of new words would be pretty minimal. For  
large scale work, you'd then want to find a way of adding these  
vectors to the database files you already have for the common words.

So, there is work to do, but I think it's simpler than refactoring  
the matrix algebra. If you only want word vectors for a few rare  
words, it's really easy. Let me know if this is the case, I have a  
(very grubby) perl script already that might help you out.

Sorry for the delay in answering, I hope this helps.
Dominic

On Sep 8, 2006, at 3:55 AM, Christian Prokopp wrote:

> Hello,
>
> I am running INFOMAP on a 32bit Linux machine and have problems when I
> try to use a large matrix, e.g. beyond 40k x 2k or 80k x 1k. My
> suspicion is that the matrix allocated in initialize_matrix() in
> matrix.c exits because it runs out of address space at around 3GB.
> Does anyone have a solution besides using a 64bit system?
> It seems very possible to rewrite the parts of INFOMAP to compute and
> write the matrix in blocks rather than in its entirety but (a) that  
> is a
> lot of work and (b) would SVD-Pack run into the same problem?
>
> Any thoughts are appreciated!
>
> Cheers,
> Christian
>
> ---------------------------------------------------------------------- 
> ---
> Using Tomcat but need to do more? Need to support web services,  
> security?
> Get stuff done quickly with pre-integrated technology to make your  
> job easier
> Download IBM WebSphere Application Server v.1.0.1 based on Apache  
> Geronimo
> http://sel.as-us.falkag.net/sel? 
> cmd=lnk&kid=120709&bid=263057&dat=121642
> _______________________________________________
> infomap-nlp-devel mailing list
> inf...@li...
> https://lists.sourceforge.net/lists/listinfo/infomap-nlp-devel
>