text-analysis / Tickets / #56 Term Space Reduction

#56 Term Space Reduction

Milestone: Release 1.0

Status: new

Owner: Kostia

Labels: None

Resolution:

Component: Text-Analysis

Priority: blocker

Version: 1.0

Type: defect

Updated: 2011-09-25

Created: 2011-09-23

Creator: Kostia

Private: No

A Document-Term matrix may be very large. That has a bad impact both on the computational and accuracy side as explained in the paper:

D. Koller and M. Sahami (1997). "Hierarchically classifying documents using very few words." Proceedings of the 14th International Conference on Machine Learning (ICML) (pp. 170-178).
http://robotics.stanford.edu/users/sahami/papers-dir/ml97-hier.pdf

A too big Document-Term matrix was critical for Text Classification and Clustering: it took too much time for the Singular Values Decomposition.

Discussion

Kostia - 2011-09-25

In the Yang and Pedersen's study the best feature selection method is the DF (Document Frequency). The best threshold is 3000. It's the optimal trade-off between computational performance and precision.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link: