Share

SenseClusters

File Release Notes and Changelog

Release Name: SenseClusters-v0.57

Notes:
=head1 NAME

SenseClusters

=head1 SYNOPSIS

SenseClusters is a complete Word Sense Discrimination system that takes users 
from preprocessing of raw text to actual discrimination that involves 
selection of most discriminating features, context representations, 
clustering, followed by extensive analysis and performance evaluation.

=head1 INTRODUCTION

Words in natural language are often ambiguous and their meaning changes as the 
context around them changes. Word Sense Discrimination attempts to distinguish 
different possible meanings of a word by identifying which occurrences of such 
ambiguous words are likely to illustrate same or similar meaning. Since 
semantically related word meanings are often used in similar contexts, 
SenseClusters achieves sense discrimination by grouping together the contexts 
of these target word/s (words that are being discriminated) that are strongly 
similar to each other.

In short, SenseClusters creates clusters of given text instances 
(an instance could be a group of one or more sentences that form a context of 
the target word) such that instances grouped together in the same cluster are 
contextually more similar to each other than to other instances and thus are 
more likely to use same meaning of the target word. Each cluster thus 
represents a single word meaning that is shared by the instances belonging to 
that cluster.

=head1 APPLICATIONS

SenseClusters can be used for various Natural Language Processing tasks like 
information retrieval, document clustering/indexing, synonymy identification,
word sense disambiguation, text/topic summarization, text classification and 
even in special applications like plagiarism detection or 
email sorting/indexing. 

Broadly speaking, SenseClusters can be used for any task that requires 
clustering of contextually similar text units.

=head1 UNIQUE FEATURES

=over 4

=item * Efficiency and Choices 

SenseClusters integrates highly proficient and specialised tools like
the Ngram Statistics package (NSP), SVDPACK, CLUTO and GCLUTO to provide a 
variety of choices and high efficiency at each step in its processing like 
feature selection, dimensionality reduction, clustering, post-clustering 
analysis and visualization.

=item * Features 

SenseClusters supports a number of lexical features like unigrams, bigrams, 
co-occurrences via NSP. Powerful features can be selected by performing 
statistical tests of associations (like log-likelihood, fisher's exact tests, 
point-wise mutual information) and by discarding features with low scores of 
associations.

=item * LSA Model 

SenseClusters simulates a cognitive/psychological learning model, LSA, by 
converting a word level feature space into a concept level semantic space that 
will avoid errors due to the polysemy and synonymy in natural languages. It 
represents each feature word and context of the target word as a vector in a 
high dimensional feature space and supports the dimensionality reduction 
technique, SVD, by providing an interface to SVDPACK.

SenseClusters allows two distinct context vector representations
namely the first order (or order1) vectors and second order (or order2) 
vectors. The first order context vectors represent context by a vector of 
features that directly occur in the context while the second order context 
vectors are the average of the word vectors of these contextual features.

=item * Clustering

SenseClusters implements 3 agglomerative algorithms namely single, complete 
and average link clustering. For extensive clustering options, SenseClusters
recommends and seamlessly supports a specialised clustering toolkit, Cluto. 
Cluto implements a number of clustering algorithms, criteria functions 
as well as some clever cluster analysis and visualization techniques. 
This will report to the user the inter-cluster and intra-cluster similarity, 
standard deviation, most discriminating features characterizing each cluster, 
dendogram tree view, 3D mountain view of clusters which might be helpful and 
interesting to SenseClusters users too. 

=item * Evaluation

In addition to the analysis and visualization techniques indirectly supported 
via Cluto, SenseClusters also reports the accuracy of discrimination in terms 
of the precision and recall values when the true classification of the 
discriminated instances is known. 

=back

In addition to above, SenseClusters is a open source software project that is
freely distributed under the GNU Public License (GPL). The packages like
NSP, SVDPACK, CLUTO that are used by SenseClusters in its processing are 
also freely distributed and easy to install.

=head1 DOWNLOAD

=head2 SenseClusters

Recent version of SenseClusters can be downloaded from -
http://sourceforge.net/projects/senseclusters/

You can also checkout the current development version of SenseClusters from the 
CVS repository of Sourceforge with -

 cvs -d :pserver:anonymous@cvs.sf.net:/cvsroot/senseclusters checkout -P SC

command on your Unix terminal.

You may browse the SenseClusters CVS repository ONLINE from - 
http://cvs.sourceforge.net/viewcvs.py/senseclusters/SC/

=head2 Other Packages Used by SenseClusters :

=head3 NSP 

http://www.d.umn.edu/~tpederse/nsp.html

=head3 SenseTools 

http://www.d.umn.edu/~tpederse/sensetools.html

=head3 PDL 

http://search.cpan.org/dist/PDL/

=head3 CLUTO 

http://www-users.cs.umn.edu/~karypis/cluto/

=head3 SVDPACK 

http://netlib.org/svdpack/

=head3 Bit::Vector 

http://search.cpan.org/dist/Bit-Vector/

=head3 Sparse 

http://search.cpan.org/dist/Sparse/

=head3 Set::Scalar

http://search.cpan.org/dist/Set-Scalar/

=head1 DOCUMENTATION

SenseClusters' documentation is available ONLINE at -
http://senseclusters.sourceforge.net/Docs.html

For OFFLINE browsing, directory Docs/ is provided in SenseClusters' main
package directory.

All programs have inline source code documentation written in pod
style and this can be browsed from command line as a man page. As a part of
the installation, these man pages will be automatically installed on your
system. Documentation of any program can be viewed with man command.
e.g. 'man bitsimat.pl' will display the man page of program bitsimat.pl.

=head1 GETTING STARTED

You may first like to run and see the Demo scripts in Demos/ directory to 
get an idea of SenseClusters' usage and functionality. 

SenseClusters can be used in two ways - 

=over 

=item * using the wrapper programs that provide an easy to use interface and take you through the complete processing

=item * running the tools provided by SenseClusters in conjunction with NSP, SVDPACK and CLUTO separately for customized experiments. 

=back

Directory Demos/ contains scripts that illustrate both these ways. If you are 
planning to run the programs separately without wrappers, you will find the 
flowcharts provided in Docs/Flows/ directory very useful.

=over 4

=item Step 1. Input 

To use SenseClusters, you will first need to convert your data into Senseval-2
format. We provide with this distribution a pre-processing program 
text2sval.pl in Toolkit/preprocess/plain/ that converts data in plain text 
format (with a single text instance on each line) into Senseval-2 format.

=item Step 2. Using Wrappers 

This is the easiest and quickest way to use SenseClusters. Wrappers 
(setup.pl and discriminate.pl found in SenseClusters' main directory) 
integrate SenseClusters' code with other packages like NSP, SenseTools, CLUTO
or SVDPACK to provide an easy command line interface to carry out a complete 
discrimination experiment.

Once you have the data in Senseval-2 format, run setup.pl which will 
tokenize and preprocess the data. This is not required but recommended as
this also performs some cleanup operations to make sure that the data is 
in right format as required by the rest of the programs. Note that, the
setup.pl script splits the given data based on the 'lexelt' tag values,
separating the instances that belong to different lexelt tags (target words).

Use the preprocessed and tokenized test instance file in Senseval-2 format 
as input to program discriminate.pl to discriminate the instances 
from this file. You can also optionally provide a separate training file 
in plain text format (*training.count file if setup.pl is run) as a source 
for selecting features to this wrapper. 

=item Step 3. Using Toolkit

Check the Demo scripts in Demos/ and flowcharts in Docs/Flows/ to understand 
the sequence of commands to run a full experiment. Run the programs
with relevant options as shown in the demo scripts.

You may find the individual program descriptions in Toolkit/Toolkit.pod 
helpful while using the Toolkit. All programs in Toolkit/ are documented in 
pod format and their documentation can be browsed as a man page. 
Selecting --help option of any program, will display a quick summary of 
options and program usage.

=back

=head1 PACKAGE ORGANIZATION

After downloading and unpacking SenseClusters, you should find following 
files/directories within SenseClusters' directory.

=over 4

=item * README.SC.pod

This file.

=item * INSTALL

An installation guide.

=item * setup.pl 

A preprocessing wrapper that provides an interface to SenseClusters' data 
preprocessing programs.

=item * discriminate.pl

Clusters the given text instances based on their contextual similarities.

=item * Demos/

A directory of scripts that demonstrate SenseClusters' usage and functionality.

=item * Toolkit/

A directory of Perl programs implemented and used by SenseClusters. Users
who are interested to use SenseClusters' tools individually and separately 
without using the wrapper programs are encouraged to browse through the 
Toolkit and Toolkit.pod.

=item * Docs/

A directory of SenseClusters' documentation in html format. 

Directory Docs/Flows/ contains flow diagrams that illustrate how to put 
together the programs provided in SenseClusters' Toolkit with other packages 
like NSP, SenseTools, SVDPACK and CLUTO to run experiments without wrappers.

=item * Testing/ 

A directory of test cases written as C-shell scripts that will test if the 
package is installed properly or not. 

=item * Scripts/

A directory of C-Shell scripts used by SenseClusters. 

=item * Changes/

A directory of changelogs that document the changes and improvements done 
during each version.

=item * Makefile.PL 

Generates a Makefile on running 'perl Makefile.PL'.

=item * GPL.txt

A copy of the GNU General Public License, the terms under which SenseClusters
is distributed.

=item * FDL.txt

A copy of the GNU Free Documentation License, the terms under which the
documentation of SenseClusters is distributed.

=back

=head1 CONTACT US

SenseClusters is developed and maintained by Amruta Purandare, Anagha 
Kulkarni, and Ted Pedersen
via Sourceforge (http://sourceforge.net/). Please join our mailing lists to 
participate in the package related discussions, to post your questions or 
bugs and also to suggest enhancements to the package functionality.

To subscribe to the Users' mailing list, visit - 
http://lists.sourceforge.net/lists/listinfo/senseclusters-users

News-group mailing list can be joined from site - 
http://lists.sourceforge.net/lists/listinfo/senseclusters-news

Recent version of SenseClusters can be downloaded from
http://senseclusters.sourceforge.net/

=head1 SEE ALSO

SenseClusters' ONLINE Documentation at 
http://senseclusters.sourceforge.net/Docs.html

=head1 AUTHORS

Amruta Purandare
University of Minnesota, Duluth
pura0010@d.umn.edu

Anagha Kulkarni
University of Minnesota, Duluth
kulka020@d.umn.edu

Ted Pedersen
University of Minnesota, Duluth
tpederse@d.umn.edu

=head1 ACKNOWLEDGMENTS

This work has been partially supported by a National Science Foundation 
Faculty Early CAREER Development award (Grant #0092784). 

We would also like to express our special thanks to 

=over

=item * Dr. George Karypis and his research group for developing CLUTO and GCLUTO 

=item * Michael Berry and co-developers of SVDPACK

=item * Christian Soeller and PDL developers' team

=item * Satanjeev Banerjee for developing the Ngram Statistics Package and 
SenseTools package

=back

And finally, to Guergana Savova for her valuable feedback, use and testing of 
SenseClusters' beta versions 0.37 and 0.39.

=head1 COPYRIGHT

Copyright (c) 2004,

Amruta Purandare, University of Minnesota, Duluth.
pura0010@d.umn.edu

Ted Pedersen, University of Minnesota, Duluth.
tpederse@umn.edu

This program is free software; you can redistribute it and/or modify it under
the terms of the GNU General Public License as published by the Free Software
Foundation; either version 2 of the License, or (at your option) any later
version.

This program is distributed in the hope that it will be useful, but WITHOUT
ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS
FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with
this program; if not, write to

The Free Software Foundation, Inc.,
59 Temple Place - Suite 330,
Boston, MA  02111-1307, USA.

=cut


Changes: Changes made in Sense-Clusters version 0.55 during version 0.57 Amruta Purandare amruta@cs.pitt.edu Ted Pedersen tpederse@d.umn.edu Anagha Kulkarni kulka020@d.umn.edu 1. removed the following programs from SenseClusters: Toolkit/cluster/agglom.pl --Ted Toolkit/preprocess/plain/shuffle.pl --Ted Toolkit/preprocess/plain/token.pl --Ted Toolkit/evaluate/prelabel.pl --Ted Toolkit/evaluate/tagger.pl --Anagha 2. removed all test cases found in /Testing for 1. --Ted and Anagha 3. Added pod-template.pl to Docs. --Amruta 4. Added script format_clusters.pl to Toolkit/evaluate to format Cluto's clustering solution file --Amruta Added following functionality to Toolkit/evaluate/format_clusters.pl : --Anagha 1. Displays contexts along with the instance tags grouped by cluster id 2. Displays Senseval2 format input file with cluster id assigned to each instance in the answer tag. 5. Added test cases for format_clusters.pl at --Anagha Testing/evaluate/format_clusters/ 6. Updated discriminate.pl to create prefix.clusters file that contains the formatted clusters --Amruta 7. Updated Makefile.PL to reflect new and deleted programs. -- Ted 8. Updated discriminate.pl to support a --format option, to allow for adjustment of format when using larger amounts of data. -- Ted 9. Updated mat2harbo.pl to display message indicating minimum size needed for las2.h parameter LMNTW. -- Ted 10. Removed status check in discriminate.pl for las2, vcluster, and scluster. Return values don't appear reliable. --Ted 11. Updated discriminate.pl to have format_clusters.pl use --senseval2 option by default. --Ted 12. Removed Toolkit.(dia|pdf|fig) - plan to incorporate info into Flows/flowchart when those are updated. --Ted 13. Removed /Docs/Toolkit from CVS. This is automatically generated and doesn't need to be a part of CVS. -- Ted 14. Moved Todo.pod to /Docs/Todo.SC.pod, renamed REAMDE.Intro.pod as README.SC.pod. --Ted 15. Tried to clarify and simplify INSTALL instructions. -- Ted 16. Modified label.pl to use CPAN module Algorithm::Munkres for solving the Assignment problem. --Anagha 17. Added a test case for label.pl which checks for 25x25 Sense X Cluster matrix. --Anagha 18. Modified Toolkit/evaluate/report.pl to make output formatting more spacious. --Anagha 19. Renamed various README's to README.foldername.pod and changed few from plain to pod documents. Added Acknowledgement to all README's --Anagha 21. Changed README.Toolkit.pod's format --Anagha 22. Moved create_doc.sh and traverse.sh from Scripts folder to a new folder HTML under Docs folder. Added README.HTML.pod for this folder. --Anagha 23. Modified create_doc.sh to create .html files for SC's root level .pl files in Docs/HTML directory and those for Toolkit folder in Toolkit_Docs folder under Docs/HTML directory --Anagha 24. Renamed GPL and FDL as GPL.txt and FDL.txt --Anagha 25. Update Makefile.PL to include new and renamed pod files and to reflect change of location for create_doc.sh and traverse.sh --Anagha 26. Changed test scripts that check erroneous conditions for mat2harbo.pl. The modified test scripts check for particular error message in the output file. --Anagha 27. Changed test-A3f.sh test-A1g.reqd, test-A1h.reqd and test-A2.reqd to handle cross-platform compatibility of test scripts. --Anagha 28. Renamed Changelogs as Changes --Ted (Changelog-v0.55to0.57 Last Updated on 12/11/2004 by Ted)