From: Rahul J. <rj...@ya...> - 2005-12-02 20:53:33
|
Hi All, I am building models with large corpora (mostly >60000 words). I have the following questions (please bear with me for the long questions): (1) The "content bearing words (CBW) - words" matrix formed (i.e., input to SVD) has dimensions: number of words (where maximum = ROWS in default-params.in) X number of CBWs (specified by COLUMNS). The dictionary identifies words that are stop words. However, when the matrix is formed, the printf statements show that the number of rows is equal to min(ROWS, the total number of words). In the case, ROWS > total number words, the matrix has rows equal to total number of words. That is, it seems that the stop words are considered (?) Illustration: ROWS=20000 COLUMNS=1000 Dic entries = 10608, Non-stop word types = 9951 "Entering write_matrix_svd; rows = 10608 and columns = 1000." (2) Number of singular values is controlled by SINGVALS and SVD_ITER. As the SVD_ITER is increased, we obtain more singular values (limited by SINGVALS and actual max number of computed SVD singular values possible). Is there a good value of computed SINGVALS (hashcomp) that we should aim (using more iterations) for a given number of words (rows) in the input matrix? In other words, if time is NOT a constraint, increasing singular values could increase the dimensions (rows) of the resultant matrices but it could also increase accuracy. Example: SINGVALS = 200, SVD_ITER = 400 for ROWS=50000 COLUMNS=1000. This could give us, for example, 150 singular values. (3) The valid chars file in the new release is effective to discard tokens with, for example, numbers if we don't include numbers as valid chars. But non-standard chars from the corpus like the copyright symbol, registered trade mark symbol, etc. still appear in the words in the dictionary. Any hints to make quick code change? Your replies are really, greatly appreciated. Thanks! Rahul. __________________________________ Start your day with Yahoo! - Make it your home page! http://www.yahoo.com/r/hs |