Notes:
NAME
SenseClusters
SYNOPSIS
SenseClusters is a suite of various Perl programs that collectively
provide unsupervised clustering of similar contexts. SenseClusters is a
complete system in a way that it takes users from preprocessing of raw
text to actual discrimination that involves selection of most
discriminating features, context representations, clustering, followed
by extensive analysis and performance evaluation.
INTRODUCTION
The problem of clustering a bunch of contexts can be approached in
various ways - supervised approaches learn the necessary knowledge from
a training data before clustering the test data. SenseClusters adapts an
unsupervised approach where the clustering decisions are solely based on
the knowledge embedded in the test data itself. The contextual
similarity between the contexts is harnessed to perform the clustering.
SenseClusters supports various degrees of context clustering where the
degree of context clustering refers to the ganularity of the context. In
other words the contexts to be clustered can be single words as in case
of "Word-Clustering", few words surrounding the a particular head word
as in case of "With Head Clustering" or complete documents as in case of
"Headless Clustering".
Though "Word-Clustering" is currently in a very nascent state one can
expect the output to consist of various groups each containing words
that exhibit a particular kind of relationship with all the other words
in that group. "Word-Clustering" can be looked at as a potential synonym
and antonym clustering method.
The "With Head Clustering" discriminates the contexts containing a
particular ambiguous word (head word). Thus the identified clusters
should each contain contexts referring to a particular underlying
identity. This clustering type can easily translate into applications
like proper name discrimination, word sense disambiguation etc.
"Headless Clustering" is performed on contexts without any particular
head word and thus the name. In this method usually the contents of the
complete document drive the clustering decision. This clustering method
typically finds its way for applications like Email Clustering, Document
Clustering, Text classifcation.
SenseClusters has also been extended to compute/predict the number of
clusters (k) that should be used for a given data, starting with version
0.83. We refer to this process of predicting k as cluster stopping. Four
such cluster stopping measures namely PK1, PK2, PK3 and Adapted Gap
Statistic that each try to predict k using clustering criterion
functions have been introduced.
APPLICATIONS
SenseClusters can be used for various Natural Language Processing tasks
like information retrieval, document clustering/indexing, synonymy
identification, word sense disambiguation, text/topic summarization,
text classification and even in special applications like plagiarism
detection or email sorting/indexing.
Broadly speaking, SenseClusters can be used for any task that requires
clustering of contextually similar text units.
UNIQUE FEATURES
* Efficiency and Choices
SenseClusters integrates highly proficient and specialised tools
like the Ngram Statistics package (NSP), SVDPACK and CLUTO to
provide a variety of choices and high efficiency at each step in its
processing like feature selection, dimensionality reduction,
clustering, post-clustering analysis and visualization.
* Features
SenseClusters supports a number of lexical features like unigrams,
bigrams, co-occurrences, target co-occurrences via NSP. Powerful
features can be selected by performing statistical tests of
associations (like log-likelihood, fisher's exact tests, point-wise
mutual information) and by discarding features with low scores of
associations.
* LSA Model
SenseClusters simulates a cognitive/psychological learning model,
LSA, by converting a word level feature space into a concept level
semantic space that will avoid errors due to the polysemy and
synonymy in natural languages. It represents each feature word and
context of the target word as a vector in a high dimensional feature
space and supports the dimensionality reduction technique, SVD, by
providing an interface to SVDPACK.
SenseClusters allows two distinct context vector representations
namely the first order (or order1) vectors and second order (or
order2) vectors. The first order context vectors represent context
by a vector of features that directly occur in the context while the
second order context vectors are the average of the word vectors of
these contextual features.
* Clustering
For extensive clustering options, SenseClusters seamlessly supports
a specialised clustering toolkit, Cluto. Cluto implements a number
of clustering algorithms, criteria functions as well as some clever
cluster analysis and visualization techniques. This will report to
the user the inter-cluster and intra-cluster similarity, standard
deviation, most discriminating features characterizing each cluster,
dendogram tree view which might be interesting to SenseClusters
users too.
* Evaluation
In addition to the analysis and visualization techniques indirectly
supported via Cluto, SenseClusters also reports the accuracy of
discrimination in terms of the precision, recall and f-measure
values when the true classification of the discriminated instances
is known.
In addition to above, SenseClusters is a open source software project
that is freely distributed under the GNU Public License (GPL). The
packages like NSP, SVDPACK, CLUTO that are used by SenseClusters in its
processing are also freely distributed and easy to install.
DOWNLOAD
SenseClusters
Recent version of SenseClusters can be downloaded from -
http://sourceforge.net/projects/senseclusters/
You can also checkout the current development version of SenseClusters
from the CVS repository of Sourceforge with -
cvs -d :pserver:anonymous@cvs.sf.net:/cvsroot/senseclusters checkout -P SC
command on your Unix terminal.
You may browse the SenseClusters CVS repository ONLINE from -
http://cvs.sourceforge.net/viewcvs.py/senseclusters/SC/
Other Packages Used by SenseClusters :
NSP
http://www.d.umn.edu/~tpederse/nsp.html
PDL
http://search.cpan.org/dist/PDL/
CLUTO
http://www-users.cs.umn.edu/~karypis/cluto/
SVDPACK
http://netlib.org/svdpack/
Bit::Vector
http://search.cpan.org/dist/Bit-Vector/
Sparse
http://search.cpan.org/dist/Sparse/
Set::Scalar
http://search.cpan.org/dist/Set-Scalar/
Algorithm::Munkres
http://search.cpan.org/dist/Algorithm-Munkres/
Algorithm::RandomMatrixGeneration
http://search.cpan.org/dist/Algorithm-RandomMatrixGeneration/
DOCUMENTATION
SenseClusters' documentation is available ONLINE at -
http://senseclusters.sourceforge.net/SenseClusters-Code-README.html
For OFFLINE browsing, directory Docs/HTML is provided in SenseClusters'
main package directory and the SenseClusters-Code-README.html file can
be found here and locally browsed.
All programs have inline source code documentation written in pod style
and this can be browsed from command line as a man page. As a part of
the installation, these man pages will be automatically installed on
your system. Documentation of any program can be viewed with man
command. e.g. 'man bitsimat.pl' will display the man page of program
bitsimat.pl.
GETTING STARTED
You may first like to run and see the Demo scripts in Demos/ directory
to get an idea of SenseClusters' usage and functionality.
SenseClusters can be used in two ways -
* using the wrapper programs that provide an easy to use interface and
take you through the complete processing
* running the tools provided by SenseClusters in conjunction with NSP,
SVDPACK and CLUTO separately for customized experiments.
Directory Demos/ contains scripts that illustrate both these ways. If
you are planning to run the programs separately without wrappers, you
will find the flowcharts provided in Docs/Flows/ directory very useful.
Step 1. Input
To use SenseClusters, you will first need to convert your data into
Senseval-2 format. We provide with this distribution a
pre-processing program text2sval.pl in Toolkit/preprocess/plain/
that converts data in plain text format (with a single text instance
on each line) into Senseval-2 format.
Step 2. Using Wrapper
This is the easiest and quickest way to use SenseClusters. The
wrapper script (discriminate.pl found in SenseClusters' main
directory) integrates SenseClusters' code with other packages like
NSP, CLUTO or SVDPACK to provide an easy command line interface to
carry out a complete discrimination experiment.
Once you have the data in Senseval-2 format, you are ready to use
the wrapper - discriminate.pl, and try its various options.
(Detailed explanation about discriminate.pl can be viewed using
"perldoc discriminate.pl)
You can also optionally provide a separate training file in plain
text format (which is not clustered) as a source for selecting
features to this wrapper.
Step 3. Using Toolkit
Check the Demo scripts in Demos/ and flowcharts in Docs/Flows/ to
understand the sequence of commands to run a full experiment. Run
the programs with relevant options as shown in the demo scripts.
Short descriptions and organization of various scripts under the
Toolkit directory can be found in Toolkit/README.Toolkit.pod. All
programs in Toolkit/ are documented in pod format and their
documentation can be browsed as a man page. Selecting --help option
of any program, will display a quick summary of options and program
usage.
PACKAGE ORGANIZATION
After downloading and unpacking SenseClusters, you should find following
files/directories within SenseClusters' directory.
* README.SC.pod
This file.
* INSTALL
An installation guide.
* discriminate.pl
Clusters the given text instances based on their contextual
similarities.
* Demos/
A directory of scripts that demonstrate SenseClusters' usage and
functionality.
* Toolkit/
A directory of Perl programs implemented and used by SenseClusters.
Users who are interested to use SenseClusters' tools individually
and separately without using the wrapper programs are encouraged to
browse through the Toolkit and Toolkit.pod.
* Docs/
A directory of SenseClusters' documentation in html format.
Directory Docs/Flows/ contains flow diagrams that illustrate how to
put together the programs provided in SenseClusters' Toolkit with
other packages like NSP, SVDPACK and CLUTO to run experiments
without wrappers.
* Testing/
A directory of test cases written as C-shell scripts that will test
if the package is installed properly or not.
* Web/
Contains an easy to use and install web interface for SenseClusters.
* Changes/
A directory of changelogs that document the changes and improvements
done during each version.
* Makefile.PL
Generates a Makefile on running 'perl Makefile.PL'.
* GPL.txt
A copy of the GNU General Public License, the terms under which
SenseClusters is distributed.
* FDL.txt
A copy of the GNU Free Documentation License, the terms under which
the documentation of SenseClusters is distributed.
CONTACT US
SenseClusters was originally developed and maintained by Amruta
Purandare and Ted Pedersen from September 2002 until August 2004. Since
that time it is now developed and maintained by Anagha Kulkarni and Ted
Pedersen.
Please join our mailing lists to participate in the package related
discussions, to post your questions or bugs and also to suggest
enhancements to the package functionality.
To subscribe to the user's mailing list, visit :
http://lists.sourceforge.net/lists/listinfo/senseclusters-users
A low volume news list can be joined by visiting:
http://lists.sourceforge.net/lists/listinfo/senseclusters-news
To subscribe to the developer's mailing list, visit :
http://lists.sourceforge.net/lists/listinfo/senseclusters-developers
Recent version of SenseClusters can be downloaded from :
http://senseclusters.sourceforge.net/
SEE ALSO
SenseClusters' ONLINE Documentation at
http://senseclusters.sourceforge.net/SenseClusters-Code-README.html
AUTHORS
Ted Pedersen
University of Minnesota, Duluth
tpederse@d.umn.edu
http://www.d.umn.edu/~tpederse/
Amruta Purandare
University of Pittsburgh
amruta@cs.pitt.edu
http://www.cs.pitt.edu/~amruta/
Anagha Kulkarni
University of Minnesota, Duluth
kulka020@d.umn.edu
http://www.d.umn.edu/~kulka020/
ACKNOWLEDGMENTS
This work has been partially supported by a National Science Foundation
Faculty Early CAREER Development award (Grant #0092784).
We would also like to express our special thanks to
* Dr. George Karypis and his research group for developing CLUTO
* Michael Berry and co-developers of SVDPACK
* Christian Soeller and PDL developers' team
* Satanjeev Banerjee for developing the Ngram Statistics Package
COPYRIGHT
Copyright (c) 2003-2006
Ted Pedersen, University of Minnesota, Duluth
tpederse@d.umn.edu
Amruta Purandare, University of Pittsburgh
amruta@cs.pitt.edu
Anagha Kulkarni, University of Minnesota, Duluth
kulka020@d.umn.edu
This program is free software; you can redistribute it and/or modify it
under the terms of the GNU General Public License as published by the
Free Software Foundation; either version 2 of the License, or (at your
option) any later version.
This program is distributed in the hope that it will be useful, but
WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General
Public License for more details.
You should have received a copy of the GNU General Public License along
with this program; if not, write to
The Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston,
MA 02111-1307, USA.
Changes:
Changes made in Sense-Clusters version 0.85 during version 0.87
Ted Pedersen tpederse@d.umn.edu
Anagha Kulkarni kulka020@d.umn.edu
1. Fixed a bug in clusterstopping.pl related to the case of empty column,
i.e, when a feature(s) does not occur in any of the contexts/instances.
-Anagha
2. Updated INSTALL and Makefile.PL to require v0.03 of Algorithm::RandomMatrixGe
neration. -Anagha
(Changelog-v0.85to0.87 Last Updated on 05/16/2006 by Anagha)
Copyright © 2009 Geeknet, Inc. All rights reserved. Terms of Use