On Fri, 2005-02-04 at 07:37, Leif Gr=F6nqvist wrote:
> Hi!
>=20
> Infomap is now using ordinary C-matrices with 4 bytes per cell. Has
> anyone tried to rewrite the matrix handling code using a spare matrix
> format like for example Harwell-Boeing? This would make it possible to
> run on a much larger vocabulary and also, not limiting the matrix size
> in the second dimension.
>=20
> I would like to run it on 500 million running words or so, which leads
> to 3.5 million word types...
>=20
> What do you developers think? How big would that task be?
I'm not sure what you mean. The mathematical guts of infomap
(svdinterface/las2.c) does use a sparse representation for the word
co-occurrence matrix, and there would be no advantage to using a sparse
format for the reduced matrix. =20
Is count_wordvec where you're running into trouble? If so, I think it
would be fairly easy to replace that one stage with something more
robust. In fact, one of the things on my to-do list is to rewrite
count_wordvec in python (it will be much slower, but also much more
flexible, more easily integrated with other NLP tools, and better suited
for students to tinker with). That would make what you're asking for
trivial.
--=20
Rob Malouf <rm...@ma...>
Department of Linguistics and Oriental Languages
San Diego State University
|