Term Space Reduction
Status: Alpha
Brought to you by:
kostia76
A Document-Term matrix may be very large. That has a bad impact both on the computational and accuracy side as explained in the paper:
D. Koller and M. Sahami (1997). "Hierarchically classifying documents using very few words." Proceedings of the 14th International Conference on Machine Learning (ICML) (pp. 170-178).
http://robotics.stanford.edu/users/sahami/papers-dir/ml97-hier.pdf
A too big Document-Term matrix was critical for Text Classification and Clustering: it took too much time for the Singular Values Decomposition.
D. Koller and M. Sahami suggest to remove terms that appear fewer than 10 ore more that 1000 times (Zipf's law based feature selection).
Yang and Pedersen [1] have shown that it is possible to reduce the dimensionality by a factor of 10 with no loss in effectiveness (a reduction by a factor of 100 bringing about just a small loss).
[1]Yang, Y. and Pedersen, J. O. 1997., A comparative study on feature selection in text
categorization. In Proceedings of ICML-97, 14th International Conference on Machine
Learning (Nashville, US, 1997), pp. 412â€"420.
http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.32.9956
In the Yang and Pedersen's study the best feature selection method is the DF (Document Frequency). The best threshold is 3000. It's the optimal trade-off between computational performance and precision.
See also:
"Survey of text mining II: clustering, classification, and retrieval, Volume 2"
keywords: document frequency thresholding