Dear Christian,
I'm afraid the deafening silence in response to your question seems
to suggest that there isn't a very good answer to your questions - at
least, not one that anyone has actively used yet.
In answer to your SVD question - I don't think that SVD-Pack would
necessarily run into the same problems, because it uses a sparse
representation. (At least, I know that it reads a fairly sparse
column-major representation from disk, though I don't really know its
internals.) It would certainly have scaling issues at some point, but
I don't know how these would compare with infomap's initial matrix
generation.
Computing and writing the matrix in blocks would certainly be an
effort - one I'd very appreciate someone doing, but not to be taken
on lightly.
Here is one sort-of solution I've used in the past for extending a
basic model to individual rare words or phrases. Compute a basic
infomap model within the 50k x 1k safe area. Once you've done this,
you can generate word vectors for rare words using the same "folding
in" method you might use to get context vectors, document vectors,
etc. That is, for a single rare word W, collect the words V_1, ... ,
V_n that occur near W (using grep or some more principled method),
take an average of those V_i that already have word vectors, and call
this the word vector for W. In this way, you can build a framework
from the common words, and use this as scaffolding to get vectors for
rare words.
Used naively, the method scales pretty poorly - if you wanted to
create vectors for another 50k words, you'd be pretty sad to run
50,000 greps end to end. Obviously you wouldn't do this in practice,
you'd write something to keep track of your next 50k words and their
word vectors as you go along. For example, some data structure that
recorded "word, vector, count_of_neighbors_used" would enable you to
update the word vector when you encountered new neighbors in text,
using the count to weight changes to the vector. In this case, memory
requirements to add a lot of new words would be pretty minimal. For
large scale work, you'd then want to find a way of adding these
vectors to the database files you already have for the common words.
So, there is work to do, but I think it's simpler than refactoring
the matrix algebra. If you only want word vectors for a few rare
words, it's really easy. Let me know if this is the case, I have a
(very grubby) perl script already that might help you out.
Sorry for the delay in answering, I hope this helps.
Dominic
On Sep 8, 2006, at 3:55 AM, Christian Prokopp wrote:
> Hello,
>
> I am running INFOMAP on a 32bit Linux machine and have problems when I
> try to use a large matrix, e.g. beyond 40k x 2k or 80k x 1k. My
> suspicion is that the matrix allocated in initialize_matrix() in
> matrix.c exits because it runs out of address space at around 3GB.
> Does anyone have a solution besides using a 64bit system?
> It seems very possible to rewrite the parts of INFOMAP to compute and
> write the matrix in blocks rather than in its entirety but (a) that
> is a
> lot of work and (b) would SVD-Pack run into the same problem?
>
> Any thoughts are appreciated!
>
> Cheers,
> Christian
>
> ----------------------------------------------------------------------
> ---
> Using Tomcat but need to do more? Need to support web services,
> security?
> Get stuff done quickly with pre-integrated technology to make your
> job easier
> Download IBM WebSphere Application Server v.1.0.1 based on Apache
> Geronimo
> http://sel.as-us.falkag.net/sel?
> cmd=lnk&kid=120709&bid=263057&dat=121642
> _______________________________________________
> infomap-nlp-devel mailing list
> inf...@li...
> https://lists.sourceforge.net/lists/listinfo/infomap-nlp-devel
>
|