Share

SenseClusters

File Release Notes and Changelog

Release Name: SenseClusters-v0.87

Notes:
NAME
    SenseClusters

SYNOPSIS
    SenseClusters is a suite of various Perl programs that collectively
    provide unsupervised clustering of similar contexts. SenseClusters is a
    complete system in a way that it takes users from preprocessing of raw
    text to actual discrimination that involves selection of most
    discriminating features, context representations, clustering, followed
    by extensive analysis and performance evaluation.

INTRODUCTION
    The problem of clustering a bunch of contexts can be approached in
    various ways - supervised approaches learn the necessary knowledge from
    a training data before clustering the test data. SenseClusters adapts an
    unsupervised approach where the clustering decisions are solely based on
    the knowledge embedded in the test data itself. The contextual
    similarity between the contexts is harnessed to perform the clustering.

    SenseClusters supports various degrees of context clustering where the
    degree of context clustering refers to the ganularity of the context. In
    other words the contexts to be clustered can be single words as in case
    of "Word-Clustering", few words surrounding the a particular head word
    as in case of "With Head Clustering" or complete documents as in case of
    "Headless Clustering".

    Though "Word-Clustering" is currently in a very nascent state one can
    expect the output to consist of various groups each containing words
    that exhibit a particular kind of relationship with all the other words
    in that group. "Word-Clustering" can be looked at as a potential synonym
    and antonym clustering method.

    The "With Head Clustering" discriminates the contexts containing a
    particular ambiguous word (head word). Thus the identified clusters
    should each contain contexts referring to a particular underlying
    identity. This clustering type can easily translate into applications
    like proper name discrimination, word sense disambiguation etc.

    "Headless Clustering" is performed on contexts without any particular
    head word and thus the name. In this method usually the contents of the
    complete document drive the clustering decision. This clustering method
    typically finds its way for applications like Email Clustering, Document
    Clustering, Text classifcation.

    SenseClusters has also been extended to compute/predict the number of
    clusters (k) that should be used for a given data, starting with version
    0.83. We refer to this process of predicting k as cluster stopping. Four
    such cluster stopping measures namely PK1, PK2, PK3 and Adapted Gap
    Statistic that each try to predict k using clustering criterion
    functions have been introduced.

APPLICATIONS
    SenseClusters can be used for various Natural Language Processing tasks
    like information retrieval, document clustering/indexing, synonymy
    identification, word sense disambiguation, text/topic summarization,
    text classification and even in special applications like plagiarism
    detection or email sorting/indexing.

    Broadly speaking, SenseClusters can be used for any task that requires
    clustering of contextually similar text units.

UNIQUE FEATURES
    * Efficiency and Choices
        SenseClusters integrates highly proficient and specialised tools
        like the Ngram Statistics package (NSP), SVDPACK and CLUTO to
        provide a variety of choices and high efficiency at each step in its
        processing like feature selection, dimensionality reduction,
        clustering, post-clustering analysis and visualization.

    * Features
        SenseClusters supports a number of lexical features like unigrams,
        bigrams, co-occurrences, target co-occurrences via NSP. Powerful
        features can be selected by performing statistical tests of
        associations (like log-likelihood, fisher's exact tests, point-wise
        mutual information) and by discarding features with low scores of
        associations.

    * LSA Model
        SenseClusters simulates a cognitive/psychological learning model,
        LSA, by converting a word level feature space into a concept level
        semantic space that will avoid errors due to the polysemy and
        synonymy in natural languages. It represents each feature word and
        context of the target word as a vector in a high dimensional feature
        space and supports the dimensionality reduction technique, SVD, by
        providing an interface to SVDPACK.

        SenseClusters allows two distinct context vector representations
        namely the first order (or order1) vectors and second order (or
        order2) vectors. The first order context vectors represent context
        by a vector of features that directly occur in the context while the
        second order context vectors are the average of the word vectors of
        these contextual features.

    * Clustering
        For extensive clustering options, SenseClusters seamlessly supports
        a specialised clustering toolkit, Cluto. Cluto implements a number
        of clustering algorithms, criteria functions as well as some clever
        cluster analysis and visualization techniques. This will report to
        the user the inter-cluster and intra-cluster similarity, standard
        deviation, most discriminating features characterizing each cluster,
        dendogram tree view which might be interesting to SenseClusters
        users too.

    * Evaluation
        In addition to the analysis and visualization techniques indirectly
        supported via Cluto, SenseClusters also reports the accuracy of
        discrimination in terms of the precision, recall and f-measure
        values when the true classification of the discriminated instances
        is known.

    In addition to above, SenseClusters is a open source software project
    that is freely distributed under the GNU Public License (GPL). The
    packages like NSP, SVDPACK, CLUTO that are used by SenseClusters in its
    processing are also freely distributed and easy to install.

DOWNLOAD
  SenseClusters
    Recent version of SenseClusters can be downloaded from -
    http://sourceforge.net/projects/senseclusters/

    You can also checkout the current development version of SenseClusters
    from the CVS repository of Sourceforge with -

     cvs -d :pserver:anonymous@cvs.sf.net:/cvsroot/senseclusters checkout -P SC

    command on your Unix terminal.

    You may browse the SenseClusters CVS repository ONLINE from -
    http://cvs.sourceforge.net/viewcvs.py/senseclusters/SC/

  Other Packages Used by SenseClusters :
   NSP
    http://www.d.umn.edu/~tpederse/nsp.html

   PDL
    http://search.cpan.org/dist/PDL/

   CLUTO
    http://www-users.cs.umn.edu/~karypis/cluto/

   SVDPACK
    http://netlib.org/svdpack/

   Bit::Vector
    http://search.cpan.org/dist/Bit-Vector/

   Sparse
    http://search.cpan.org/dist/Sparse/

   Set::Scalar
    http://search.cpan.org/dist/Set-Scalar/

   Algorithm::Munkres
    http://search.cpan.org/dist/Algorithm-Munkres/

   Algorithm::RandomMatrixGeneration
    http://search.cpan.org/dist/Algorithm-RandomMatrixGeneration/

DOCUMENTATION
    SenseClusters' documentation is available ONLINE at -
    http://senseclusters.sourceforge.net/SenseClusters-Code-README.html

    For OFFLINE browsing, directory Docs/HTML is provided in SenseClusters'
    main package directory and the SenseClusters-Code-README.html file can
    be found here and locally browsed.

    All programs have inline source code documentation written in pod style
    and this can be browsed from command line as a man page. As a part of
    the installation, these man pages will be automatically installed on
    your system. Documentation of any program can be viewed with man
    command. e.g. 'man bitsimat.pl' will display the man page of program
    bitsimat.pl.

GETTING STARTED
    You may first like to run and see the Demo scripts in Demos/ directory
    to get an idea of SenseClusters' usage and functionality.

    SenseClusters can be used in two ways -

    * using the wrapper programs that provide an easy to use interface and
    take you through the complete processing
    * running the tools provided by SenseClusters in conjunction with NSP,
    SVDPACK and CLUTO separately for customized experiments.

    Directory Demos/ contains scripts that illustrate both these ways. If
    you are planning to run the programs separately without wrappers, you
    will find the flowcharts provided in Docs/Flows/ directory very useful.

    Step 1. Input
        To use SenseClusters, you will first need to convert your data into
        Senseval-2 format. We provide with this distribution a
        pre-processing program text2sval.pl in Toolkit/preprocess/plain/
        that converts data in plain text format (with a single text instance
        on each line) into Senseval-2 format.

    Step 2. Using Wrapper
        This is the easiest and quickest way to use SenseClusters. The
        wrapper script (discriminate.pl found in SenseClusters' main
        directory) integrates SenseClusters' code with other packages like
        NSP, CLUTO or SVDPACK to provide an easy command line interface to
        carry out a complete discrimination experiment.

        Once you have the data in Senseval-2 format, you are ready to use
        the wrapper - discriminate.pl, and try its various options.
        (Detailed explanation about discriminate.pl can be viewed using
        "perldoc discriminate.pl)

        You can also optionally provide a separate training file in plain
        text format (which is not clustered) as a source for selecting
        features to this wrapper.

    Step 3. Using Toolkit
        Check the Demo scripts in Demos/ and flowcharts in Docs/Flows/ to
        understand the sequence of commands to run a full experiment. Run
        the programs with relevant options as shown in the demo scripts.

        Short descriptions and organization of various scripts under the
        Toolkit directory can be found in Toolkit/README.Toolkit.pod. All
        programs in Toolkit/ are documented in pod format and their
        documentation can be browsed as a man page. Selecting --help option
        of any program, will display a quick summary of options and program
        usage.

PACKAGE ORGANIZATION
    After downloading and unpacking SenseClusters, you should find following
    files/directories within SenseClusters' directory.

    * README.SC.pod
        This file.

    * INSTALL
        An installation guide.

    * discriminate.pl
        Clusters the given text instances based on their contextual
        similarities.

    * Demos/
        A directory of scripts that demonstrate SenseClusters' usage and
        functionality.

    * Toolkit/
        A directory of Perl programs implemented and used by SenseClusters.
        Users who are interested to use SenseClusters' tools individually
        and separately without using the wrapper programs are encouraged to
        browse through the Toolkit and Toolkit.pod.

    * Docs/
        A directory of SenseClusters' documentation in html format.

        Directory Docs/Flows/ contains flow diagrams that illustrate how to
        put together the programs provided in SenseClusters' Toolkit with
        other packages like NSP, SVDPACK and CLUTO to run experiments
        without wrappers.

    * Testing/
        A directory of test cases written as C-shell scripts that will test
        if the package is installed properly or not.

    * Web/
        Contains an easy to use and install web interface for SenseClusters.

    * Changes/
        A directory of changelogs that document the changes and improvements
        done during each version.

    * Makefile.PL
        Generates a Makefile on running 'perl Makefile.PL'.

    * GPL.txt
        A copy of the GNU General Public License, the terms under which
        SenseClusters is distributed.

    * FDL.txt
        A copy of the GNU Free Documentation License, the terms under which
        the documentation of SenseClusters is distributed.

CONTACT US
    SenseClusters was originally developed and maintained by Amruta
    Purandare and Ted Pedersen from September 2002 until August 2004. Since
    that time it is now developed and maintained by Anagha Kulkarni and Ted
    Pedersen.

    Please join our mailing lists to participate in the package related
    discussions, to post your questions or bugs and also to suggest
    enhancements to the package functionality.

    To subscribe to the user's mailing list, visit :
    http://lists.sourceforge.net/lists/listinfo/senseclusters-users

    A low volume news list can be joined by visiting:
    http://lists.sourceforge.net/lists/listinfo/senseclusters-news

    To subscribe to the developer's mailing list, visit :
    http://lists.sourceforge.net/lists/listinfo/senseclusters-developers

    Recent version of SenseClusters can be downloaded from :
    http://senseclusters.sourceforge.net/

SEE ALSO
    SenseClusters' ONLINE Documentation at
    http://senseclusters.sourceforge.net/SenseClusters-Code-README.html

AUTHORS
     Ted Pedersen
     University of Minnesota, Duluth
     tpederse@d.umn.edu
     http://www.d.umn.edu/~tpederse/

     Amruta Purandare
     University of Pittsburgh
     amruta@cs.pitt.edu
     http://www.cs.pitt.edu/~amruta/

     Anagha Kulkarni
     University of Minnesota, Duluth
     kulka020@d.umn.edu
     http://www.d.umn.edu/~kulka020/

ACKNOWLEDGMENTS
    This work has been partially supported by a National Science Foundation
    Faculty Early CAREER Development award (Grant #0092784).

    We would also like to express our special thanks to

    * Dr. George Karypis and his research group for developing CLUTO
    * Michael Berry and co-developers of SVDPACK
    * Christian Soeller and PDL developers' team
    * Satanjeev Banerjee for developing the Ngram Statistics Package

COPYRIGHT
    Copyright (c) 2003-2006

     Ted Pedersen, University of Minnesota, Duluth
     tpederse@d.umn.edu

     Amruta Purandare, University of Pittsburgh
     amruta@cs.pitt.edu

     Anagha Kulkarni, University of Minnesota, Duluth
     kulka020@d.umn.edu

    This program is free software; you can redistribute it and/or modify it
    under the terms of the GNU General Public License as published by the
    Free Software Foundation; either version 2 of the License, or (at your
    option) any later version.

    This program is distributed in the hope that it will be useful, but
    WITHOUT ANY WARRANTY; without even the implied warranty of
    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General
    Public License for more details.

    You should have received a copy of the GNU General Public License along
    with this program; if not, write to

    The Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston,
    MA 02111-1307, USA.




Changes: Changes made in Sense-Clusters version 0.85 during version 0.87 Ted Pedersen tpederse@d.umn.edu Anagha Kulkarni kulka020@d.umn.edu 1. Fixed a bug in clusterstopping.pl related to the case of empty column, i.e, when a feature(s) does not occur in any of the contexts/instances. -Anagha 2. Updated INSTALL and Makefile.PL to require v0.03 of Algorithm::RandomMatrixGe neration. -Anagha (Changelog-v0.85to0.87 Last Updated on 05/16/2006 by Anagha)