We are pleased to announce the release of SenseClusters version 0.95.
SenseClusters is a freely available package that allows you to cluster similar contexts, or to identify clusters of related words. It is fully
unsupervised, and can automatically discover the optimal number of clusters in your text.
As of version 0.95, we now fully support Latent Semantic Analysis for context and word clustering, and we continue to improve the native
SenseClusters methods, which include the ability to cluster first and second order representations of context.
SenseClusters can be downloaded from :
You can also try out SenseClusters via our web interface:
In both native and LSA modes, SenseClusters relies on lexical features (such as unigrams, bigrams, and co--occurrences) that can be identified in raw text. The tokenization is very flexible and can be defined via Perl regular expressions, so it is possible to work with many other languages besides English, and you can easily work with tokenization schemes other than white-space separated words, such as character based tokens, like 2 letter sequences, etc.
The native SenseClusters methods support traditional first order context clustering, where you identify a feature set, and then determine which of those features occur in the contexts you are clustering. The native
methods also support second order context clustering, where each word is represented by a vector of the words with which it co-occurs.
All the words in a context to be clustered are replaced by their associated vectors, and these vectors are averaged together to represent
that context. Note that you can also cluster the word vectors to identify sets of related words.
Latent Semantic Analysis differs from the native SenseClusters methods in that each feature is represented by a vector that shows the contexts in which that feature occurs. Then, all the features in a context to be clustered are replaced by their associated vectors, and these are averaged together to represent the context. Note that you can also cluster the feature vectors directly to identify sets of related features.