From: Rob M. <rm...@ma...> - 2005-02-04 17:56:02
|
On Fri, 2005-02-04 at 07:37, Leif Gr=F6nqvist wrote: > Hi! >=20 > Infomap is now using ordinary C-matrices with 4 bytes per cell. Has > anyone tried to rewrite the matrix handling code using a spare matrix > format like for example Harwell-Boeing? This would make it possible to > run on a much larger vocabulary and also, not limiting the matrix size > in the second dimension. >=20 > I would like to run it on 500 million running words or so, which leads > to 3.5 million word types... >=20 > What do you developers think? How big would that task be? I'm not sure what you mean. The mathematical guts of infomap (svdinterface/las2.c) does use a sparse representation for the word co-occurrence matrix, and there would be no advantage to using a sparse format for the reduced matrix. =20 Is count_wordvec where you're running into trouble? If so, I think it would be fairly easy to replace that one stage with something more robust. In fact, one of the things on my to-do list is to rewrite count_wordvec in python (it will be much slower, but also much more flexible, more easily integrated with other NLP tools, and better suited for students to tinker with). That would make what you're asking for trivial. --=20 Rob Malouf <rm...@ma...> Department of Linguistics and Oriental Languages San Diego State University |