Share

SenseClusters

File Release Notes and Changelog

Release Name: SenseClusters-v0.67

Notes:
NAME
    SenseClusters

SYNOPSIS
    SenseClusters is a complete Word Sense Discrimination system that takes
    users from preprocessing of raw text to actual discrimination that
    involves selection of most discriminating features, context
    representations, clustering, followed by extensive analysis and
    performance evaluation.

INTRODUCTION
    Words in natural language are often ambiguous and their meaning changes
    as the context around them changes. Word Sense Discrimination attempts
    to distinguish different possible meanings of a word by identifying
    which occurrences of such ambiguous words are likely to illustrate same
    or similar meaning. Since semantically related word meanings are often
    used in similar contexts, SenseClusters achieves sense discrimination by
    grouping together the contexts of these target word/s (words that are
    being discriminated) that are strongly similar to each other.

    In short, SenseClusters creates clusters of given text instances (an
    instance could be a group of one or more sentences that form a context
    of the target word) such that instances grouped together in the same
    cluster are contextually more similar to each other than to other
    instances and thus are more likely to use same meaning of the target
    word. Each cluster thus represents a single word meaning that is shared
    by the instances belonging to that cluster.

APPLICATIONS
    SenseClusters can be used for various Natural Language Processing tasks
    like information retrieval, document clustering/indexing, synonymy
    identification, word sense disambiguation, text/topic summarization,
    text classification and even in special applications like plagiarism
    detection or email sorting/indexing.

    Broadly speaking, SenseClusters can be used for any task that requires
    clustering of contextually similar text units.

UNIQUE FEATURES
    * Efficiency and Choices
        SenseClusters integrates highly proficient and specialised tools
        like the Ngram Statistics package (NSP), SVDPACK, CLUTO and GCLUTO
        to provide a variety of choices and high efficiency at each step in
        its processing like feature selection, dimensionality reduction,
        clustering, post-clustering analysis and visualization.

    * Features
        SenseClusters supports a number of lexical features like unigrams,
        bigrams, co-occurrences via NSP. Powerful features can be selected
        by performing statistical tests of associations (like
        log-likelihood, fisher's exact tests, point-wise mutual information)
        and by discarding features with low scores of associations.

    * LSA Model
        SenseClusters simulates a cognitive/psychological learning model,
        LSA, by converting a word level feature space into a concept level
        semantic space that will avoid errors due to the polysemy and
        synonymy in natural languages. It represents each feature word and
        context of the target word as a vector in a high dimensional feature
        space and supports the dimensionality reduction technique, SVD, by
        providing an interface to SVDPACK.

        SenseClusters allows two distinct context vector representations
        namely the first order (or order1) vectors and second order (or
        order2) vectors. The first order context vectors represent context
        by a vector of features that directly occur in the context while the
        second order context vectors are the average of the word vectors of
        these contextual features.

    * Clustering
        SenseClusters implements 3 agglomerative algorithms namely single,
        complete and average link clustering. For extensive clustering
        options, SenseClusters recommends and seamlessly supports a
        specialised clustering toolkit, Cluto. Cluto implements a number of
        clustering algorithms, criteria functions as well as some clever
        cluster analysis and visualization techniques. This will report to
        the user the inter-cluster and intra-cluster similarity, standard
        deviation, most discriminating features characterizing each cluster,
        dendogram tree view, 3D mountain view of clusters which might be
        helpful and interesting to SenseClusters users too.

    * Evaluation
        In addition to the analysis and visualization techniques indirectly
        supported via Cluto, SenseClusters also reports the accuracy of
        discrimination in terms of the precision, recall and f-measure
        values when the true classification of the discriminated instances
        is known.

    In addition to above, SenseClusters is a open source software project
    that is freely distributed under the GNU Public License (GPL). The
    packages like NSP, SVDPACK, CLUTO that are used by SenseClusters in its
    processing are also freely distributed and easy to install.

DOWNLOAD
  SenseClusters
    Recent version of SenseClusters can be downloaded from -
    http://sourceforge.net/projects/senseclusters/

    You can also checkout the current development version of SenseClusters
    from the CVS repository of Sourceforge with -

     cvs -d :pserver:anonymous@cvs.sf.net:/cvsroot/senseclusters checkout -P SC

    command on your Unix terminal.

    You may browse the SenseClusters CVS repository ONLINE from -
    http://cvs.sourceforge.net/viewcvs.py/senseclusters/SC/

  Other Packages Used by SenseClusters :
   NSP
    http://www.d.umn.edu/~tpederse/nsp.html

   SenseTools
    http://www.d.umn.edu/~tpederse/sensetools.html

   PDL
    http://search.cpan.org/dist/PDL/

   CLUTO
    http://www-users.cs.umn.edu/~karypis/cluto/

   SVDPACK
    http://netlib.org/svdpack/

   Bit::Vector
    http://search.cpan.org/dist/Bit-Vector/

   Sparse
    http://search.cpan.org/dist/Sparse/

   Set::Scalar
    http://search.cpan.org/dist/Set-Scalar/

DOCUMENTATION
    SenseClusters' documentation is available ONLINE at -
    http://senseclusters.sourceforge.net/Docs.html

    For OFFLINE browsing, directory Docs/ is provided in SenseClusters' main
    package directory.

    All programs have inline source code documentation written in pod style
    and this can be browsed from command line as a man page. As a part of
    the installation, these man pages will be automatically installed on
    your system. Documentation of any program can be viewed with man
    command. e.g. 'man bitsimat.pl' will display the man page of program
    bitsimat.pl.

GETTING STARTED
    You may first like to run and see the Demo scripts in Demos/ directory
    to get an idea of SenseClusters' usage and functionality.

    SenseClusters can be used in two ways -

    * using the wrapper programs that provide an easy to use interface and
    take you through the complete processing
    * running the tools provided by SenseClusters in conjunction with NSP,
    SVDPACK and CLUTO separately for customized experiments.

    Directory Demos/ contains scripts that illustrate both these ways. If
    you are planning to run the programs separately without wrappers, you
    will find the flowcharts provided in Docs/Flows/ directory very useful.

    Step 1. Input
        To use SenseClusters, you will first need to convert your data into
        Senseval-2 format. We provide with this distribution a
        pre-processing program text2sval.pl in Toolkit/preprocess/plain/
        that converts data in plain text format (with a single text instance
        on each line) into Senseval-2 format.

    Step 2. Using Wrappers
        This is the easiest and quickest way to use SenseClusters. Wrappers
        (setup.pl and discriminate.pl found in SenseClusters' main
        directory) integrate SenseClusters' code with other packages like
        NSP, SenseTools, CLUTO or SVDPACK to provide an easy command line
        interface to carry out a complete discrimination experiment.

        Once you have the data in Senseval-2 format, run setup.pl which will
        tokenize and preprocess the data. This is not required but
        recommended as this also performs some cleanup operations to make
        sure that the data is in right format as required by the rest of the
        programs. Note that, the setup.pl script splits the given data based
        on the 'lexelt' tag values, separating the instances that belong to
        different lexelt tags (target words).

        Use the preprocessed and tokenized test instance file in Senseval-2
        format as input to program discriminate.pl to discriminate the
        instances from this file. You can also optionally provide a separate
        training file in plain text format (*training.count file if setup.pl
        is run) as a source for selecting features to this wrapper.

    Step 3. Using Toolkit
        Check the Demo scripts in Demos/ and flowcharts in Docs/Flows/ to
        understand the sequence of commands to run a full experiment. Run
        the programs with relevant options as shown in the demo scripts.

        You may find the individual program descriptions in
        Toolkit/Toolkit.pod helpful while using the Toolkit. All programs in
        Toolkit/ are documented in pod format and their documentation can be
        browsed as a man page. Selecting --help option of any program, will
        display a quick summary of options and program usage.

PACKAGE ORGANIZATION
    After downloading and unpacking SenseClusters, you should find following
    files/directories within SenseClusters' directory.

    * README.SC.pod
        This file.

    * INSTALL
        An installation guide.

    * setup.pl
        A preprocessing wrapper that provides an interface to SenseClusters'
        data preprocessing programs.

    * discriminate.pl
        Clusters the given text instances based on their contextual
        similarities.

    * Demos/
        A directory of scripts that demonstrate SenseClusters' usage and
        functionality.

    * Toolkit/
        A directory of Perl programs implemented and used by SenseClusters.
        Users who are interested to use SenseClusters' tools individually
        and separately without using the wrapper programs are encouraged to
        browse through the Toolkit and Toolkit.pod.

    * Docs/
        A directory of SenseClusters' documentation in html format.

        Directory Docs/Flows/ contains flow diagrams that illustrate how to
        put together the programs provided in SenseClusters' Toolkit with
        other packages like NSP, SenseTools, SVDPACK and CLUTO to run
        experiments without wrappers.

    * Testing/
        A directory of test cases written as C-shell scripts that will test
        if the package is installed properly or not.

    * Scripts/
        A directory of C-Shell scripts used by SenseClusters.

    * Web/
        Contains an easy to use and install web interface for SenseClusters.

    * Changes/
        A directory of changelogs that document the changes and improvements
        done during each version.

    * Makefile.PL
        Generates a Makefile on running 'perl Makefile.PL'.

    * GPL.txt
        A copy of the GNU General Public License, the terms under which
        SenseClusters is distributed.

    * FDL.txt
        A copy of the GNU Free Documentation License, the terms under which
        the documentation of SenseClusters is distributed.

CONTACT US
    SenseClusters is developed and maintained by Amruta Purandare, Anagha
    Kulkarni, and Ted Pedersen via Sourceforge (http://sourceforge.net/).
    Please join our mailing lists to participate in the package related
    discussions, to post your questions or bugs and also to suggest
    enhancements to the package functionality.

    To subscribe to the Users' mailing list, visit -
    http://lists.sourceforge.net/lists/listinfo/senseclusters-users

    News-group mailing list can be joined from site -
    http://lists.sourceforge.net/lists/listinfo/senseclusters-news

    Recent version of SenseClusters can be downloaded from
    http://senseclusters.sourceforge.net/

SEE ALSO
    SenseClusters' ONLINE Documentation at
    http://senseclusters.sourceforge.net/Docs.html

AUTHORS
     Amruta Purandare 
     University of Pittsburgh
     amruta@cs.pitt.edu
     http://www.cs.pitt.edu/~amruta/

     Anagha Kulkarni
     University of Minnesota, Duluth
     kulka020@d.umn.edu
     http://www.d.umn.edu/~kulka020/

     Ted Pedersen
     University of Minnesota, Duluth
     tpederse@d.umn.edu
     http://www.d.umn.edu/~tpederse/

ACKNOWLEDGMENTS
    This work has been partially supported by a National Science Foundation
    Faculty Early CAREER Development award (Grant #0092784).

    We would also like to express our special thanks to

    * Dr. George Karypis and his research group for developing CLUTO and
    GCLUTO
    * Michael Berry and co-developers of SVDPACK
    * Christian Soeller and PDL developers' team
    * Satanjeev Banerjee for developing the Ngram Statistics Package and
    SenseTools package

    And finally, to Guergana Savova for her valuable feedback, use and
    testing of SenseClusters' starting with beta versions 0.37 and 0.39 and
    continuing to the present.

COPYRIGHT
    Copyright (c) 2003-2005

    Amruta Purandare, University of Pittsburgh amruta@cs.pitt.edu

    Ted Pedersen, University of Minnesota, Duluth. tpederse@umn.edu

    This program is free software; you can redistribute it and/or modify it
    under the terms of the GNU General Public License as published by the
    Free Software Foundation; either version 2 of the License, or (at your
    option) any later version.

    This program is distributed in the hope that it will be useful, but
    WITHOUT ANY WARRANTY; without even the implied warranty of
    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General
    Public License for more details.

    You should have received a copy of the GNU General Public License along
    with this program; if not, write to

    The Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston,
    MA 02111-1307, USA.



Changes: Changes made in Sense-Clusters version 0.65 during version 0.67 Anagha Kulkarni kulka020@d.umn.edu Amruta Purandare amruta@cs.pitt.edu Ted Pedersen tpederse@d.umn.edu 1. Renamed the "co-occurrence" feature type for first order context representation to "target co-occurrence" in discriminate.pl -Anagha 2. Made "bigram" as the default feature type for both order1 and order2 context representation. -Anagha 3. Added the new feature type of normal "co-occurrences" to order1 representation -Anagha 4. Added the new feature type of "target co-occurrences" to order2 representation -Anagha 5. Updated the web-interface. Now it starts with a start-up page where one can select the type of experiment (With Head, Headless or Word Clustering) and then the applicable options are only shown on the following screens. -Anagha (Changelog-v0.65to0.67 Last Updated on 06/04/2005 by Anagha)