Menu

#56 Term Space Reduction

Release 1.0
new
Kostia
None
Text-Analysis
blocker
1.0
defect
2011-09-25
2011-09-23
Kostia
No

A Document-Term matrix may be very large. That has a bad impact both on the computational and accuracy side as explained in the paper:

D. Koller and M. Sahami (1997). "Hierarchically classifying documents using very few words." Proceedings of the 14th International Conference on Machine Learning (ICML) (pp. 170-178).
http://robotics.stanford.edu/users/sahami/papers-dir/ml97-hier.pdf

A too big Document-Term matrix was critical for Text Classification and Clustering: it took too much time for the Singular Values Decomposition.

Discussion

  • Kostia

    Kostia - 2011-09-23

    D. Koller and M. Sahami suggest to remove terms that appear fewer than 10 ore more that 1000 times (Zipf's law based feature selection).

    Yang and Pedersen [1] have shown that it is possible to reduce the dimensionality by a factor of 10 with no loss in effectiveness (a reduction by a factor of 100 bringing about just a small loss).

    [1]Yang, Y. and Pedersen, J. O. 1997., A comparative study on feature selection in text
    categorization. In Proceedings of ICML-97, 14th International Conference on Machine
    Learning (Nashville, US, 1997), pp. 412â€"420.
    http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.32.9956

     
  • Kostia

    Kostia - 2011-09-25

    In the Yang and Pedersen's study the best feature selection method is the DF (Document Frequency). The best threshold is 3000. It's the optimal trade-off between computational performance and precision.

     
  • Kostia

    Kostia - 2011-09-25

    See also:
    "Survey of text mining II: clustering, classification, and retrieval, Volume 2"
    keywords: document frequency thresholding

     

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.