File Release Notes and Changelog
Notes:
=head1 NAME
SenseClusters
=head1 SYNOPSIS
SenseClusters is a complete Word Sense Discrimination system that takes users
from preprocessing of raw text to actual discrimination that involves
selection of most discriminating features, context representations,
clustering, followed by extensive analysis and performance evaluation.
=head1 INTRODUCTION
Words in natural language are often ambiguous and their meaning changes as the
context around them changes. Word Sense Discrimination attempts to distinguish
different possible meanings of a word by identifying which occurrences of such
ambiguous words are likely to illustrate same or similar meaning. Since
semantically related word meanings are often used in similar contexts,
SenseClusters achieves sense discrimination by grouping together the contexts
of these target word/s (words that are being discriminated) that are strongly
similar to each other.
In short, SenseClusters creates clusters of given text instances
(an instance could be a group of one or more sentences that form a context of
the target word) such that instances grouped together in the same cluster are
contextually more similar to each other than to other instances and thus are
more likely to use same meaning of the target word. Each cluster thus
represents a single word meaning that is shared by the instances belonging to
that cluster.
=head1 APPLICATIONS
SenseClusters can be used for various Natural Language Processing tasks like
information retrieval, document clustering/indexing, synonymy identification,
word sense disambiguation, text/topic summarization, text classification and
even in special applications like plagiarism detection or
email sorting/indexing.
Broadly speaking, SenseClusters can be used for any task that requires
clustering of contextually similar text units.
=head1 UNIQUE FEATURES
=over 4
=item * Efficiency and Choices
SenseClusters integrates highly proficient and specialised tools like
the Ngram Statistics package (NSP), SVDPACK, CLUTO and GCLUTO to provide a
variety of choices and high efficiency at each step in its processing like
feature selection, dimensionality reduction, clustering, post-clustering
analysis and visualization.
=item * Features
SenseClusters supports a number of lexical features like unigrams, bigrams,
co-occurrences via NSP. Powerful features can be selected by performing
statistical tests of associations (like log-likelihood, fisher's exact tests,
point-wise mutual information) and by discarding features with low scores of
associations.
=item * LSA Model
SenseClusters simulates a cognitive/psychological learning model, LSA, by
converting a word level feature space into a concept level semantic space that
will avoid errors due to the polysemy and synonymy in natural languages. It
represents each feature word and context of the target word as a vector in a
high dimensional feature space and supports the dimensionality reduction
technique, SVD, by providing an interface to SVDPACK.
SenseClusters allows two distinct context vector representations
namely the first order (or order1) vectors and second order (or order2)
vectors. The first order context vectors represent context by a vector of
features that directly occur in the context while the second order context
vectors are the average of the word vectors of these contextual features.
=item * Clustering
SenseClusters implements 3 agglomerative algorithms namely single, complete
and average link clustering. For extensive clustering options, SenseClusters
recommends and seamlessly supports a specialised clustering toolkit, Cluto.
Cluto implements a number of clustering algorithms, criteria functions
as well as some clever cluster analysis and visualization techniques.
This will report to the user the inter-cluster and intra-cluster similarity,
standard deviation, most discriminating features characterizing each cluster,
dendogram tree view, 3D mountain view of clusters which might be helpful and
interesting to SenseClusters users too.
=item * Evaluation
In addition to the analysis and visualization techniques indirectly supported
via Cluto, SenseClusters also reports the accuracy of discrimination in terms
of the precision and recall values when the true classification of the
discriminated instances is known.
=back
In addition to above, SenseClusters is a open source software project that is
freely distributed under the GNU Public License (GPL). The packages like
NSP, SVDPACK, CLUTO that are used by SenseClusters in its processing are
also freely distributed and easy to install.
=head1 DOWNLOAD
=head2 SenseClusters
Recent version of SenseClusters can be downloaded from -
http://sourceforge.net/projects/senseclusters/
You can also checkout the current development version of SenseClusters from the
CVS repository of Sourceforge with -
cvs -d :pserver:anonymous@cvs.sf.net:/cvsroot/senseclusters checkout -P SC
command on your Unix terminal.
You may browse the SenseClusters CVS repository ONLINE from -
http://cvs.sourceforge.net/viewcvs.py/senseclusters/SC/
=head2 Other Packages Used by SenseClusters :
=head3 NSP
http://www.d.umn.edu/~tpederse/nsp.html
=head3 SenseTools
http://www.d.umn.edu/~tpederse/sensetools.html
=head3 PDL
http://search.cpan.org/dist/PDL/
=head3 CLUTO
http://www-users.cs.umn.edu/~karypis/cluto/
=head3 SVDPACK
http://netlib.org/svdpack/
=head3 Bit::Vector
http://search.cpan.org/dist/Bit-Vector/
=head3 Sparse
http://search.cpan.org/dist/Sparse/
=head3 Set::Scalar
http://search.cpan.org/dist/Set-Scalar/
=head1 DOCUMENTATION
SenseClusters' documentation is available ONLINE at -
http://senseclusters.sourceforge.net/Docs.html
For OFFLINE browsing, directory Docs/ is provided in SenseClusters' main
package directory.
All programs have inline source code documentation written in pod
style and this can be browsed from command line as a man page. As a part of
the installation, these man pages will be automatically installed on your
system. Documentation of any program can be viewed with man command.
e.g. 'man bitsimat.pl' will display the man page of program bitsimat.pl.
=head1 GETTING STARTED
You may first like to run and see the Demo scripts in Demos/ directory to
get an idea of SenseClusters' usage and functionality.
SenseClusters can be used in two ways -
=over
=item * using the wrapper programs that provide an easy to use interface and take you through the complete processing
=item * running the tools provided by SenseClusters in conjunction with NSP, SVDPACK and CLUTO separately for customized experiments.
=back
Directory Demos/ contains scripts that illustrate both these ways. If you are
planning to run the programs separately without wrappers, you will find the
flowcharts provided in Docs/Flows/ directory very useful.
=over 4
=item Step 1. Input
To use SenseClusters, you will first need to convert your data into Senseval-2
format. We provide with this distribution a pre-processing program
text2sval.pl in Toolkit/preprocess/plain/ that converts data in plain text
format (with a single text instance on each line) into Senseval-2 format.
=item Step 2. Using Wrappers
This is the easiest and quickest way to use SenseClusters. Wrappers
(setup.pl and discriminate.pl found in SenseClusters' main directory)
integrate SenseClusters' code with other packages like NSP, SenseTools, CLUTO
or SVDPACK to provide an easy command line interface to carry out a complete
discrimination experiment.
Once you have the data in Senseval-2 format, run setup.pl which will
tokenize and preprocess the data. This is not required but recommended as
this also performs some cleanup operations to make sure that the data is
in right format as required by the rest of the programs. Note that, the
setup.pl script splits the given data based on the 'lexelt' tag values,
separating the instances that belong to different lexelt tags (target words).
Use the preprocessed and tokenized test instance file in Senseval-2 format
as input to program discriminate.pl to discriminate the instances
from this file. You can also optionally provide a separate training file
in plain text format (*training.count file if setup.pl is run) as a source
for selecting features to this wrapper.
=item Step 3. Using Toolkit
Check the Demo scripts in Demos/ and flowcharts in Docs/Flows/ to understand
the sequence of commands to run a full experiment. Run the programs
with relevant options as shown in the demo scripts.
You may find the individual program descriptions in Toolkit/Toolkit.pod
helpful while using the Toolkit. All programs in Toolkit/ are documented in
pod format and their documentation can be browsed as a man page.
Selecting --help option of any program, will display a quick summary of
options and program usage.
=back
=head1 PACKAGE ORGANIZATION
After downloading and unpacking SenseClusters, you should find following
files/directories within SenseClusters' directory.
=over 4
=item * README.SC.pod
This file.
=item * INSTALL
An installation guide.
=item * setup.pl
A preprocessing wrapper that provides an interface to SenseClusters' data
preprocessing programs.
=item * discriminate.pl
Clusters the given text instances based on their contextual similarities.
=item * Demos/
A directory of scripts that demonstrate SenseClusters' usage and functionality.
=item * Toolkit/
A directory of Perl programs implemented and used by SenseClusters. Users
who are interested to use SenseClusters' tools individually and separately
without using the wrapper programs are encouraged to browse through the
Toolkit and Toolkit.pod.
=item * Docs/
A directory of SenseClusters' documentation in html format.
Directory Docs/Flows/ contains flow diagrams that illustrate how to put
together the programs provided in SenseClusters' Toolkit with other packages
like NSP, SenseTools, SVDPACK and CLUTO to run experiments without wrappers.
=item * Testing/
A directory of test cases written as C-shell scripts that will test if the
package is installed properly or not.
=item * Scripts/
A directory of C-Shell scripts used by SenseClusters.
=item * Changes/
A directory of changelogs that document the changes and improvements done
during each version.
=item * Makefile.PL
Generates a Makefile on running 'perl Makefile.PL'.
=item * GPL.txt
A copy of the GNU General Public License, the terms under which SenseClusters
is distributed.
=item * FDL.txt
A copy of the GNU Free Documentation License, the terms under which the
documentation of SenseClusters is distributed.
=back
=head1 CONTACT US
SenseClusters is developed and maintained by Amruta Purandare, Anagha
Kulkarni, and Ted Pedersen
via Sourceforge (http://sourceforge.net/). Please join our mailing lists to
participate in the package related discussions, to post your questions or
bugs and also to suggest enhancements to the package functionality.
To subscribe to the Users' mailing list, visit -
http://lists.sourceforge.net/lists/listinfo/senseclusters-users
News-group mailing list can be joined from site -
http://lists.sourceforge.net/lists/listinfo/senseclusters-news
Recent version of SenseClusters can be downloaded from
http://senseclusters.sourceforge.net/
=head1 SEE ALSO
SenseClusters' ONLINE Documentation at
http://senseclusters.sourceforge.net/Docs.html
=head1 AUTHORS
Amruta Purandare
University of Minnesota, Duluth
pura0010@d.umn.edu
Anagha Kulkarni
University of Minnesota, Duluth
kulka020@d.umn.edu
Ted Pedersen
University of Minnesota, Duluth
tpederse@d.umn.edu
=head1 ACKNOWLEDGMENTS
This work has been partially supported by a National Science Foundation
Faculty Early CAREER Development award (Grant #0092784).
We would also like to express our special thanks to
=over
=item * Dr. George Karypis and his research group for developing CLUTO and GCLUTO
=item * Michael Berry and co-developers of SVDPACK
=item * Christian Soeller and PDL developers' team
=item * Satanjeev Banerjee for developing the Ngram Statistics Package and
SenseTools package
=back
And finally, to Guergana Savova for her valuable feedback, use and testing of
SenseClusters' beta versions 0.37 and 0.39.
=head1 COPYRIGHT
Copyright (c) 2004,
Amruta Purandare, University of Minnesota, Duluth.
pura0010@d.umn.edu
Ted Pedersen, University of Minnesota, Duluth.
tpederse@umn.edu
This program is free software; you can redistribute it and/or modify it under
the terms of the GNU General Public License as published by the Free Software
Foundation; either version 2 of the License, or (at your option) any later
version.
This program is distributed in the hope that it will be useful, but WITHOUT
ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS
FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with
this program; if not, write to
The Free Software Foundation, Inc.,
59 Temple Place - Suite 330,
Boston, MA 02111-1307, USA.
=cut
Changes:
Changes made in Sense-Clusters version 0.55 during version 0.57
Amruta Purandare amruta@cs.pitt.edu
Ted Pedersen tpederse@d.umn.edu
Anagha Kulkarni kulka020@d.umn.edu
1. removed the following programs from SenseClusters:
Toolkit/cluster/agglom.pl --Ted
Toolkit/preprocess/plain/shuffle.pl --Ted
Toolkit/preprocess/plain/token.pl --Ted
Toolkit/evaluate/prelabel.pl --Ted
Toolkit/evaluate/tagger.pl --Anagha
2. removed all test cases found in /Testing for 1. --Ted and Anagha
3. Added pod-template.pl to Docs. --Amruta
4. Added script format_clusters.pl to
Toolkit/evaluate to format Cluto's
clustering solution file --Amruta
Added following functionality to
Toolkit/evaluate/format_clusters.pl : --Anagha
1. Displays contexts along with the
instance tags grouped by cluster id
2. Displays Senseval2 format input file
with cluster id assigned to each instance
in the answer tag.
5. Added test cases for format_clusters.pl at --Anagha
Testing/evaluate/format_clusters/
6. Updated discriminate.pl to create
prefix.clusters file that contains
the formatted clusters --Amruta
7. Updated Makefile.PL to reflect new and deleted programs. -- Ted
8. Updated discriminate.pl to support a --format option, to
allow for adjustment of format when using larger amounts
of data. -- Ted
9. Updated mat2harbo.pl to display message indicating minimum
size needed for las2.h parameter LMNTW. -- Ted
10. Removed status check in discriminate.pl for las2, vcluster,
and scluster. Return values don't appear reliable. --Ted
11. Updated discriminate.pl to have format_clusters.pl use
--senseval2 option by default. --Ted
12. Removed Toolkit.(dia|pdf|fig) - plan to incorporate info into
Flows/flowchart when those are updated. --Ted
13. Removed /Docs/Toolkit from CVS. This is automatically
generated and doesn't need to be a part of CVS. -- Ted
14. Moved Todo.pod to /Docs/Todo.SC.pod, renamed REAMDE.Intro.pod
as README.SC.pod. --Ted
15. Tried to clarify and simplify INSTALL instructions. -- Ted
16. Modified label.pl to use CPAN module Algorithm::Munkres
for solving the Assignment problem. --Anagha
17. Added a test case for label.pl which checks for 25x25
Sense X Cluster matrix. --Anagha
18. Modified Toolkit/evaluate/report.pl to make output
formatting more spacious. --Anagha
19. Renamed various README's to README.foldername.pod and
changed few from plain to pod documents.
Added Acknowledgement to all README's --Anagha
21. Changed README.Toolkit.pod's format --Anagha
22. Moved create_doc.sh and traverse.sh from Scripts
folder to a new folder HTML under Docs folder.
Added README.HTML.pod for this folder. --Anagha
23. Modified create_doc.sh to create .html files for SC's
root level .pl files in Docs/HTML directory and those
for Toolkit folder in Toolkit_Docs folder under
Docs/HTML directory --Anagha
24. Renamed GPL and FDL as GPL.txt and FDL.txt --Anagha
25. Update Makefile.PL to include new and renamed
pod files and to reflect change of location for
create_doc.sh and traverse.sh --Anagha
26. Changed test scripts that check erroneous conditions
for mat2harbo.pl. The modified test scripts check for
particular error message in the output file. --Anagha
27. Changed test-A3f.sh test-A1g.reqd, test-A1h.reqd
and test-A2.reqd to handle cross-platform compatibility
of test scripts. --Anagha
28. Renamed Changelogs as Changes
--Ted
(Changelog-v0.55to0.57 Last Updated on 12/11/2004 by Ted)