Notes:
NAME
SenseClusters
SYNOPSIS
SenseClusters is a complete Word Sense Discrimination system that takes
users from preprocessing of raw text to actual discrimination that
involves selection of most discriminating features, context
representations, clustering, followed by extensive analysis and
performance evaluation.
INTRODUCTION
Words in natural language are often ambiguous and their meaning changes
as the context around them changes. Word Sense Discrimination attempts
to distinguish different possible meanings of a word by identifying
which occurrences of such ambiguous words are likely to illustrate same
or similar meaning. Since semantically related word meanings are often
used in similar contexts, SenseClusters achieves sense discrimination by
grouping together the contexts of these target word/s (words that are
being discriminated) that are strongly similar to each other.
In short, SenseClusters creates clusters of given text instances (an
instance could be a group of one or more sentences that form a context
of the target word) such that instances grouped together in the same
cluster are contextually more similar to each other than to other
instances and thus are more likely to use same meaning of the target
word. Each cluster thus represents a single word meaning that is shared
by the instances belonging to that cluster.
APPLICATIONS
SenseClusters can be used for various Natural Language Processing tasks
like information retrieval, document clustering/indexing, synonymy
identification, word sense disambiguation, text/topic summarization,
text classification and even in special applications like plagiarism
detection or email sorting/indexing.
Broadly speaking, SenseClusters can be used for any task that requires
clustering of contextually similar text units.
UNIQUE FEATURES
* Efficiency and Choices
SenseClusters integrates highly proficient and specialised tools
like the Ngram Statistics package (NSP), SVDPACK, CLUTO and GCLUTO
to provide a variety of choices and high efficiency at each step in
its processing like feature selection, dimensionality reduction,
clustering, post-clustering analysis and visualization.
* Features
SenseClusters supports a number of lexical features like unigrams,
bigrams, co-occurrences via NSP. Powerful features can be selected
by performing statistical tests of associations (like
log-likelihood, fisher's exact tests, point-wise mutual information)
and by discarding features with low scores of associations.
* LSA Model
SenseClusters simulates a cognitive/psychological learning model,
LSA, by converting a word level feature space into a concept level
semantic space that will avoid errors due to the polysemy and
synonymy in natural languages. It represents each feature word and
context of the target word as a vector in a high dimensional feature
space and supports the dimensionality reduction technique, SVD, by
providing an interface to SVDPACK.
SenseClusters allows two distinct context vector representations
namely the first order (or order1) vectors and second order (or
order2) vectors. The first order context vectors represent context
by a vector of features that directly occur in the context while the
second order context vectors are the average of the word vectors of
these contextual features.
* Clustering
SenseClusters implements 3 agglomerative algorithms namely single,
complete and average link clustering. For extensive clustering
options, SenseClusters recommends and seamlessly supports a
specialised clustering toolkit, Cluto. Cluto implements a number of
clustering algorithms, criteria functions as well as some clever
cluster analysis and visualization techniques. This will report to
the user the inter-cluster and intra-cluster similarity, standard
deviation, most discriminating features characterizing each cluster,
dendogram tree view, 3D mountain view of clusters which might be
helpful and interesting to SenseClusters users too.
* Evaluation
In addition to the analysis and visualization techniques indirectly
supported via Cluto, SenseClusters also reports the accuracy of
discrimination in terms of the precision, recall and f-measure
values when the true classification of the discriminated instances
is known.
In addition to above, SenseClusters is a open source software project
that is freely distributed under the GNU Public License (GPL). The
packages like NSP, SVDPACK, CLUTO that are used by SenseClusters in its
processing are also freely distributed and easy to install.
DOWNLOAD
SenseClusters
Recent version of SenseClusters can be downloaded from -
http://sourceforge.net/projects/senseclusters/
You can also checkout the current development version of SenseClusters
from the CVS repository of Sourceforge with -
cvs -d :pserver:anonymous@cvs.sf.net:/cvsroot/senseclusters checkout -P SC
command on your Unix terminal.
You may browse the SenseClusters CVS repository ONLINE from -
http://cvs.sourceforge.net/viewcvs.py/senseclusters/SC/
Other Packages Used by SenseClusters :
NSP
http://www.d.umn.edu/~tpederse/nsp.html
SenseTools
http://www.d.umn.edu/~tpederse/sensetools.html
PDL
http://search.cpan.org/dist/PDL/
CLUTO
http://www-users.cs.umn.edu/~karypis/cluto/
SVDPACK
http://netlib.org/svdpack/
Bit::Vector
http://search.cpan.org/dist/Bit-Vector/
Sparse
http://search.cpan.org/dist/Sparse/
Set::Scalar
http://search.cpan.org/dist/Set-Scalar/
DOCUMENTATION
SenseClusters' documentation is available ONLINE at -
http://senseclusters.sourceforge.net/Docs.html
For OFFLINE browsing, directory Docs/ is provided in SenseClusters' main
package directory.
All programs have inline source code documentation written in pod style
and this can be browsed from command line as a man page. As a part of
the installation, these man pages will be automatically installed on
your system. Documentation of any program can be viewed with man
command. e.g. 'man bitsimat.pl' will display the man page of program
bitsimat.pl.
GETTING STARTED
You may first like to run and see the Demo scripts in Demos/ directory
to get an idea of SenseClusters' usage and functionality.
SenseClusters can be used in two ways -
* using the wrapper programs that provide an easy to use interface and
take you through the complete processing
* running the tools provided by SenseClusters in conjunction with NSP,
SVDPACK and CLUTO separately for customized experiments.
Directory Demos/ contains scripts that illustrate both these ways. If
you are planning to run the programs separately without wrappers, you
will find the flowcharts provided in Docs/Flows/ directory very useful.
Step 1. Input
To use SenseClusters, you will first need to convert your data into
Senseval-2 format. We provide with this distribution a
pre-processing program text2sval.pl in Toolkit/preprocess/plain/
that converts data in plain text format (with a single text instance
on each line) into Senseval-2 format.
Step 2. Using Wrappers
This is the easiest and quickest way to use SenseClusters. Wrappers
(setup.pl and discriminate.pl found in SenseClusters' main
directory) integrate SenseClusters' code with other packages like
NSP, SenseTools, CLUTO or SVDPACK to provide an easy command line
interface to carry out a complete discrimination experiment.
Once you have the data in Senseval-2 format, run setup.pl which will
tokenize and preprocess the data. This is not required but
recommended as this also performs some cleanup operations to make
sure that the data is in right format as required by the rest of the
programs. Note that, the setup.pl script splits the given data based
on the 'lexelt' tag values, separating the instances that belong to
different lexelt tags (target words).
Use the preprocessed and tokenized test instance file in Senseval-2
format as input to program discriminate.pl to discriminate the
instances from this file. You can also optionally provide a separate
training file in plain text format (*training.count file if setup.pl
is run) as a source for selecting features to this wrapper.
Step 3. Using Toolkit
Check the Demo scripts in Demos/ and flowcharts in Docs/Flows/ to
understand the sequence of commands to run a full experiment. Run
the programs with relevant options as shown in the demo scripts.
You may find the individual program descriptions in
Toolkit/Toolkit.pod helpful while using the Toolkit. All programs in
Toolkit/ are documented in pod format and their documentation can be
browsed as a man page. Selecting --help option of any program, will
display a quick summary of options and program usage.
PACKAGE ORGANIZATION
After downloading and unpacking SenseClusters, you should find following
files/directories within SenseClusters' directory.
* README.SC.pod
This file.
* INSTALL
An installation guide.
* setup.pl
A preprocessing wrapper that provides an interface to SenseClusters'
data preprocessing programs.
* discriminate.pl
Clusters the given text instances based on their contextual
similarities.
* Demos/
A directory of scripts that demonstrate SenseClusters' usage and
functionality.
* Toolkit/
A directory of Perl programs implemented and used by SenseClusters.
Users who are interested to use SenseClusters' tools individually
and separately without using the wrapper programs are encouraged to
browse through the Toolkit and Toolkit.pod.
* Docs/
A directory of SenseClusters' documentation in html format.
Directory Docs/Flows/ contains flow diagrams that illustrate how to
put together the programs provided in SenseClusters' Toolkit with
other packages like NSP, SenseTools, SVDPACK and CLUTO to run
experiments without wrappers.
* Testing/
A directory of test cases written as C-shell scripts that will test
if the package is installed properly or not.
* Scripts/
A directory of C-Shell scripts used by SenseClusters.
* Web/
Contains an easy to use and install web interface for SenseClusters.
* Changes/
A directory of changelogs that document the changes and improvements
done during each version.
* Makefile.PL
Generates a Makefile on running 'perl Makefile.PL'.
* GPL.txt
A copy of the GNU General Public License, the terms under which
SenseClusters is distributed.
* FDL.txt
A copy of the GNU Free Documentation License, the terms under which
the documentation of SenseClusters is distributed.
CONTACT US
SenseClusters is developed and maintained by Amruta Purandare, Anagha
Kulkarni, and Ted Pedersen via Sourceforge (http://sourceforge.net/).
Please join our mailing lists to participate in the package related
discussions, to post your questions or bugs and also to suggest
enhancements to the package functionality.
To subscribe to the Users' mailing list, visit -
http://lists.sourceforge.net/lists/listinfo/senseclusters-users
News-group mailing list can be joined from site -
http://lists.sourceforge.net/lists/listinfo/senseclusters-news
Recent version of SenseClusters can be downloaded from
http://senseclusters.sourceforge.net/
SEE ALSO
SenseClusters' ONLINE Documentation at
http://senseclusters.sourceforge.net/Docs.html
AUTHORS
Amruta Purandare
University of Pittsburgh
amruta@cs.pitt.edu
http://www.cs.pitt.edu/~amruta/
Anagha Kulkarni
University of Minnesota, Duluth
kulka020@d.umn.edu
http://www.d.umn.edu/~kulka020/
Ted Pedersen
University of Minnesota, Duluth
tpederse@d.umn.edu
http://www.d.umn.edu/~tpederse/
ACKNOWLEDGMENTS
This work has been partially supported by a National Science Foundation
Faculty Early CAREER Development award (Grant #0092784).
We would also like to express our special thanks to
* Dr. George Karypis and his research group for developing CLUTO and
GCLUTO
* Michael Berry and co-developers of SVDPACK
* Christian Soeller and PDL developers' team
* Satanjeev Banerjee for developing the Ngram Statistics Package and
SenseTools package
And finally, to Guergana Savova for her valuable feedback, use and
testing of SenseClusters' starting with beta versions 0.37 and 0.39 and
continuing to the present.
COPYRIGHT
Copyright (c) 2003-2005
Amruta Purandare, University of Pittsburgh amruta@cs.pitt.edu
Ted Pedersen, University of Minnesota, Duluth. tpederse@umn.edu
This program is free software; you can redistribute it and/or modify it
under the terms of the GNU General Public License as published by the
Free Software Foundation; either version 2 of the License, or (at your
option) any later version.
This program is distributed in the hope that it will be useful, but
WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General
Public License for more details.
You should have received a copy of the GNU General Public License along
with this program; if not, write to
The Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston,
MA 02111-1307, USA.
Changes:
Changes made in Sense-Clusters version 0.65 during version 0.67
Anagha Kulkarni kulka020@d.umn.edu
Amruta Purandare amruta@cs.pitt.edu
Ted Pedersen tpederse@d.umn.edu
1. Renamed the "co-occurrence" feature type for first order context
representation to "target co-occurrence" in discriminate.pl -Anagha
2. Made "bigram" as the default feature type for both order1 and
order2 context representation. -Anagha
3. Added the new feature type of normal "co-occurrences" to order1
representation -Anagha
4. Added the new feature type of "target co-occurrences" to order2
representation -Anagha
5. Updated the web-interface. Now it starts with a start-up page
where one can select the type of experiment (With Head,
Headless or Word Clustering) and then the applicable options
are only shown on the following screens. -Anagha
(Changelog-v0.65to0.67 Last Updated on 06/04/2005 by Anagha)
Copyright © 2010 Geeknet, Inc. All rights reserved. Terms of Use