[infomap-nlp-users] SVD, Valid Chars Questions

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hi All, 

I am building models with large corpora (mostly >60000
words). I have the following questions (please bear
with me for the long questions):

(1) The "content bearing words (CBW) - words" matrix
formed (i.e., input to SVD) has dimensions: number of
words (where maximum = ROWS in default-params.in) X
number of CBWs (specified by COLUMNS). The dictionary
identifies words that are stop words. However, when
the matrix is formed, the printf statements show that
the number of rows is equal to min(ROWS, the total
number of words). In the case, ROWS > total number
words, the matrix has rows equal to total number of
words. That is, it seems that the stop words are
considered (?)

Illustration: 
  ROWS=20000 COLUMNS=1000
  Dic entries = 10608, Non-stop word types = 9951
  "Entering write_matrix_svd; rows = 10608 and columns
= 1000."

(2) Number of singular values is controlled by
SINGVALS and SVD_ITER. As the SVD_ITER is increased,
we obtain more singular values (limited by SINGVALS
and actual max number of computed SVD singular values
possible). Is there a good value of computed SINGVALS
(hashcomp) that we should aim (using more iterations)
for a given number of words (rows) in the input
matrix? In other words, if time is NOT a constraint,
increasing singular values could increase the
dimensions (rows) of the resultant matrices but it
could also increase accuracy.

Example: SINGVALS = 200, SVD_ITER = 400 for ROWS=50000
COLUMNS=1000. This could give us, for example, 150
singular values.

(3) The valid chars file in the new release is
effective to discard tokens with, for example, numbers
if we don't include numbers as valid chars. But
non-standard chars from the corpus like the copyright
symbol, registered trade mark symbol, etc. still
appear in the words in the dictionary. Any hints to
make quick code change?

Your replies are really, greatly appreciated.

Thanks!
Rahul.

__________________________________ 
Start your day with Yahoo! - Make it your home page! 
http://www.yahoo.com/r/hs