From: Dominic W. <wi...@ma...> - 2005-10-22 12:49:03
|
Dear Rahul, Sorry for not getting back to you sooner. > Is there a known limit for number of files in a > multi-document corpus, expecting optimal performance? > Is there a known break-down point? I know of no maximum number of files for a multi-document corpus, as far as the Infomap software is concerned. However, this isn't because we've stretched the system and found that it doesn't break, it's because we haven't really used it for this very much. I've only ever built a couple of multidocument models, and have had sporadic reports of this functionality not working at all. If we were embarking on a new, well-resourced project, we'd look into this straight away, > Does the uniformity (or lack of it) in the sizes of > individual documents in a corpus affect the quality > model? It certainly matters much less than in a standard search / LSA engine, because the Infomap software relies on cooccurrence within a fixed-width window rather than cooccurrence within a document. This is discussed properly in Ch6 of Geometry and Meaning. There's a sketch on the web at http://infomap.stanford.edu/book/chapters/chapter6.html but of course, finding yourself a copy of the book would be more useful ;-) Best wishes, Dominic |