Currently content-bearing words can only be selected
according to their rank in frequency.
The START_COLUMNS constant in
preprocessing/preprocessing_env.h says how many of
the most frequent words to skip; the COLUMNS
parameter in admin/default-params says how many
content-bearing words to use.
The following should be done:
- START_COLUMNS should be a command-line
argument to count_wordvec; it can then be controlled
using default-params and the infomap-build script.
- It should be possible for the user to specify a file
listing the desired content-bearing words.