senseclusters-developers Mailing List for SenseClusters (Page 2)

Status: Beta

Brought to you by: tpederse

senseclusters-developers — SenseClusters Developers Mailing List

You can subscribe to this list here.

2003	Jan	Feb	Mar	Apr	May	Jun	Jul	Aug	Sep	Oct	Nov (2)	Dec
2004	Jan (6)	Feb	Mar	Apr (2)	May	Jun (1)	Jul (1)	Aug (2)	Sep (5)	Oct (30)	Nov (7)	Dec (11)
2005	Jan (51)	Feb (8)	Mar (3)	Apr (2)	May (2)	Jun (2)	Jul (5)	Aug (20)	Sep (5)	Oct (2)	Nov (2)	Dec
2006	Jan (8)	Feb (2)	Mar (7)	Apr (2)	May (4)	Jun (16)	Jul (7)	Aug (6)	Sep (1)	Oct (4)	Nov (1)	Dec
2007	Jan	Feb	Mar (1)	Apr	May	Jun	Jul	Aug	Sep	Oct	Nov	Dec
2008	Jan	Feb	Mar (2)	Apr (10)	May	Jun (1)	Jul (1)	Aug	Sep	Oct	Nov	Dec (2)
2009	Jan	Feb	Mar	Apr	May	Jun	Jul	Aug (2)	Sep	Oct	Nov	Dec
2010	Jan	Feb	Mar	Apr	May (2)	Jun	Jul	Aug	Sep	Oct	Nov	Dec
2013	Jan	Feb	Mar	Apr	May	Jun (3)	Jul	Aug	Sep	Oct	Nov	Dec
2015	Jan	Feb	Mar	Apr	May (1)	Jun	Jul	Aug	Sep	Oct (1)	Nov	Dec

Flat | Threaded

<< < 1 2 3 4 .. 11 > >> (Page 2 of 11)

[Senseclusters-developers] the perils of ill formed input files

From: Ted P. <tpederse@d.umn.edu> - 2007-03-21 22:18:49

This is a low priority item, but SenseClusters should probably have
more aggressive checking of Senseval-2 input formats, or
format_clusters or preceeding programs should fail more gracefully.

For some reason I confused myself about the Senseval-2 input format,
and instead of creating input formatted like this:

<instance id="7">
<answer instance="7" senseid="1"/>
<context>
 The Mahatma <,> or <``> great souled one <, ''> instigated several campaigns of
 passive resistance against the British government in India <.> Unfortunately <,
> according to Webster <'> s Biographical Dictionary <, ``> His policies went be
yond his control and resulted <...> in riots and disturbances <''> and later a r
enewed campaign of civil disobedience <``> resulted in rioting and a second impr
isonment <. ''> I am not a proponent of everything Gandhi did <,> but some of hi
s law breaking was justified because India was then under occupation by a foreig
n power <,> and Indians were not able to participate fully in decisions that vit
ally <head> affected </head> them <.> It is difficult <,> however <,> to justify
 civil disobedience <,> non <-> violent or not <,> where citizens have full reco
urse to the ballot box to effect change <.> Where truly representative governmen
ts are safeguarded by constitutional protections of human rights and an independ
ent judiciary to construe those rights <,> there is no excuse for breaking the l
aw because some individual or group disagrees with it <.>
</context>
</instance>

I created something  that looks like this:

<instance id="7"/>
<answer instance="7" senseid="1"/>
<context>
 The Mahatma <,> or <``> great souled one <, ''> instigated several campaigns of
 passive resistance against the British government in India <.> Unfortunately <,
> according to Webster <'> s Biographical Dictionary <, ``> His policies went be
yond his control and resulted <...> in riots and disturbances <''> and later a r
enewed campaign of civil disobedience <``> resulted in rioting and a second impr
isonment <. ''> I am not a proponent of everything Gandhi did <,> but some of hi
s law breaking was justified because India was then under occupation by a foreig
n power <,> and Indians were not able to participate fully in decisions that vit
ally <head> affected </head> them <.> It is difficult <,> however <,> to justify
 civil disobedience <,> non <-> violent or not <,> where citizens have full reco
urse to the ballot box to effect change <.> Where truly representative governmen
ts are safeguarded by constitutional protections of human rights and an independ
ent judiciary to construe those rights <,> there is no excuse for breaking the l
aw because some individual or group disagrees with it <.>
</context>

The only real difference is in the <instance> tag, and while both are
valid XML (I think), only the first is valid Senseval-2 format.
However, since I tried to process the second one via the web
interface, I ended up getting a huge number of errors/warnings from
format_clusters, and the web interface was essentially hung. These
errors appeared in logfile in :
/usr/local/apache2/cgi-bin/SC-cgi/user_data

Use of uninitialized value in pattern match (m//) at /space/SC095/tools/bin/form
at_clusters.pl line 318, <SCON> line 231.
Use of uninitialized value in pattern match (m//) at /space/SC095/tools/bin/form
at_clusters.pl line 318, <SCON> line 231.
Use of uninitialized value in pattern match (m//) at /space/SC095/tools/bin/form
at_clusters.pl line 318, <SCON> line 231.

-- 
Ted Pedersen
http://www.d.umn.edu/~tpederse

[Senseclusters-developers] Updated svdpackout.pl

From: Richard W. <rwi...@sw...> - 2006-11-29 19:41:17

Attachments: svdpackout.pl

Hi Ted and other SenseClusters folks,

I've updated the svdpackout.pl file so that I can extract the  
component U, S, and V' (V-transpose) matrices rather than only being  
able to extract the reconstructed matrix or the rows-only.

I've attached my version to this message.  Feel free to use it as you  
see fit.

Here's a brief changelog:
* Added feature to output the component U, S, and V' matrices.
* Added a new command-line option "--output" with three options:
	reconstruct - reconstructs the rank-k matrix (default)
	rowonly - same as --rowonly
         components - output U, S, V' matrices to U.txt, S.txt, VT.txt
* Added a new command-line option "--negatives":
         Allows negative values; otherwise all negative values are set
         to 0 (except in component output).
* New options maintain backward compatibility
* Updated the documentation.
* Passes all tests (testA1-A4,B1-B2)

Hope this helps you -- it helped my students!

As an aside -- and I'd be happy to post this to the main newsgroup if  
you'd rather -- what is the purpose of the "rowonly" feature?  Why do  
you multiply U by the sqrt of the S values?  Is there some  
theoretical reason to do this?

Thanks!

-Rich

--                                                                       
        
Richard Wicentowski
Assistant Professor
Computer Science Department
Swarthmore College
(610) 690-5643

[Senseclusters-developers] script to run external training feature experiments with exact number of clusters

From: ted p. <tpederse@d.umn.edu> - 2006-10-15 19:09:19

Archival Post. Script used to run CICLING 2007 experiments where the 
number of clusters is specified ahead of time. A related script uses
cluster stopping instead. 

=========================================================================

#!/bin/csh

########### This script shows how to acquire features from a separate
########### set of training data and use them to represent context
########### vectors in the SenseClusters native order 2 methodology.

###########
########### By Ted Pedersen, October 2006
###########

########### DATA PREPARATION

## root directory
set HOMEDIR = /home/ted/Web

## where test files are, in sval2 (xml) format

set TESTDIR = $HOMEDIR/Test

# where training data resides, in plain text format

set TRAINDIR = $HOMEDIR/TrainNYT

# make sure test and training directories are really there!

if (! -e $TESTDIR) then
       echo "No Test Dir <$TESTDIR>"
       exit 1;
endif

if (! -e $TRAINDIR) then
       echo "No Train Dir <$TRAINDIR>"
       exit 1;
endif

# run through several different combinations of corpora and settings...


foreach CORPUS (25 75)
        foreach STAT (leftFisher ll pmi odds)
                foreach REMOVE (5 10 20 50)
                        foreach MEASURE (pk2 pk3 gap)
                                foreach TEST (alston2.xml connor2.xml miller3.xml collins4.xml pedersen4.xml)

                                set TRAIN = nyt-$CORPUS-$REMOVE.$STAT

				echo "---------running $TRAIN $MEASURE--------"
                                ########### CREATE FEATURE MATCH PATTERNS

                                nsp2regex.pl $TRAINDIR/$TRAIN > $TRAINDIR/$TRAIN.regex

                                ########### SECOND ORDER CONTEXT REPRESENTATION

                                # create order 2 vec with bigram features

                                wordvec.pl $TRAINDIR/$TRAIN --feats $TRAIN.feats > $TRAIN.wordvec
                                nsp2regex.pl $TRAIN.feats > $TRAIN.regex.feats


				order2vec.pl --rclass $TRAIN.rclass --rlabel $TRAIN.rlabel $TESTDIR/$TEST $TRAIN.wordvec $TRAIN.regex.feats > $TRAIN.vector

				echo "order2vec done"

				########### best case, set number of clusters exactly

        			if ($TEST == "alston2.xml") then
			                set CLUSTERS = 2
			        else if ($TEST == "connor2.xml") then
			                set CLUSTERS = 2
			        else if ($TEST == "miller3.xml") then
			                set CLUSTERS = 3
			        else if ($TEST == "collins4.xml") then
			                set CLUSTERS = 4
			        else if ($TEST == "pedersen4.xml") then
			                set CLUSTERS = 4
			        else then
			             echo "cluster setting error"
			             exit	
			        endif

			       vcluster -rclass $TRAIN.rclass -rlabel $TRAIN.rlabel  $TRAIN.vector $CLUSTERS -clustfile $TRAIN.cluto.out > $TRAIN.cluto.report

			       ########### EVALUATION

			       format_clusters.pl $TRAIN.cluto.out $TRAIN.rlabel --context $TESTDIR/$TEST > $TRAIN.clusters.context

			       clusterlabeling.pl $TRAIN.clusters.context > $TRAIN.clusterlabeling

			       cluto2label.pl $TRAIN.cluto.out key*key > $TRAIN.prelabel
			       label.pl $TRAIN.prelabel > $TRAIN.label

			       report.pl $TRAIN.label $TRAIN.prelabel > $TRAIN.report
			
			mkdir $TEST-$TRAIN-$MEASURE
			mv $TRAIN* $TEST-$TRAIN-$MEASURE
			rm -fr key*
			rm -fr expr*

			end
		end
	end
   end
end

[Senseclusters-developers] script to process web data with external training features

From: ted p. <tpederse@d.umn.edu> - 2006-10-15 18:51:22

Archival Post. This script was used to run experiments for CICLING 2007
submission where features were generated from external training data. 

=======================================================================

#!/bin/csh

########### This script shows how to acquire features from a separate
########### set of training data and use them to represent context
########### vectors in the SenseClusters native order 2 methodology.

###########
########### By Ted Pedersen, October 2006
###########

########### DATA PREPARATION

## root directory
set HOMEDIR = /home/ted/Web

## where test files are, in sval2 (xml) format

set TESTDIR = $HOMEDIR/Test

# where training data resides, in plain text format

set TRAINDIR = $HOMEDIR/TrainNYT

# make sure test and training directories are really there!

if (! -e $TESTDIR) then
       echo "No Test Dir <$TESTDIR>"
       exit 1;
endif

if (! -e $TRAINDIR) then
       echo "No Train Dir <$TRAINDIR>"
       exit 1;
endif

# run through several different combinations of corpora and settings...


foreach CORPUS (25 75)
        foreach STAT (leftFisher ll pmi odds)
                foreach REMOVE (5 10 20 50)
                        foreach MEASURE (pk2 pk3 gap)
                                foreach TEST (alston2.xml connor2.xml miller3.xml collins4.xml pedersen4.xml)

                                set TRAIN = nyt-$CORPUS-$REMOVE.$STAT

				echo "---------running $TRAIN $MEASURE--------"
                                ########### CREATE FEATURE MATCH PATTERNS

                                nsp2regex.pl $TRAINDIR/$TRAIN > $TRAINDIR/$TRAIN.regex

                                ########### SECOND ORDER CONTEXT REPRESENTATION

                                # create order 2 vec with bigram features

                                wordvec.pl $TRAINDIR/$TRAIN --feats $TRAIN.feats > $TRAIN.wordvec
                                nsp2regex.pl $TRAIN.feats > $TRAIN.regex.feats


				order2vec.pl --rclass $TRAIN.rclass --rlabel $TRAIN.rlabel $TESTDIR/$TEST $TRAIN.wordvec $TRAIN.regex.feats > $TRAIN.vector

				echo "order2vec done"

				########### CLUSTERSTOPPING AND CLUSTERING

				clusterstopping.pl $TRAIN.vector --prefix $TRAIN > $TRAIN.prediction

			       if (! -e $TRAIN.prediction) then
			               echo "No Cluster Prediction, Assume 2"
			               set CLUSTERS = 2
				else
			               set CLUSTERS = `cat $TRAIN.prediction`
			               echo "Predict $CLUSTERS"
			       endif

			       vcluster -rclass $TRAIN.rclass -rlabel $TRAIN.rlabel  $TRAIN.vector $CLUSTERS -clustfile $TRAIN.cluto.out > $TRAIN.cluto.report

			       ########### EVALUATION

			       format_clusters.pl $TRAIN.cluto.out $TRAIN.rlabel --context $TESTDIR/$TEST > $TRAIN.clusters.context

			       clusterlabeling.pl $TRAIN.clusters.context > $TRAIN.clusterlabeling

			       cluto2label.pl $TRAIN.cluto.out key*key > $TRAIN.prelabel
			       label.pl $TRAIN.prelabel > $TRAIN.label

			       report.pl $TRAIN.label $TRAIN.prelabel > $TRAIN.report
			
			mkdir $TEST-$TRAIN-$MEASURE
			mv $TRAIN* $TEST-$TRAIN-$MEASURE
			rm -fr key*
			rm -fr expr*

			end
		end
	end
    end
end

[Senseclusters-developers] script to create feature files from held out training data

From: ted p. <tpederse@d.umn.edu> - 2006-10-15 18:50:14

Archival post. This script was used to create external training data
for CICLING 2007 submission.

=================================================================

#!/bin/csh

# This script was used to create statistic files for different measures
# to be used as features for some other set of test/evaluation data. 

# by ted pedersen, october 2006

set STOPLIST = /home/ted/Web/StopLists
# nyt-25.stop
# nyt-75.stop

set TRAINDATA = /home/CICLING/Train
# nyt-plain-clean-25-tr.txt
# nyt-plain-clean-75-tr.txt

foreach CORPUS (1 25 75) 

	foreach REMOVE (5 10 20 50)

	set PREFIX = nyt-$CORPUS-$REMOVE

	echo "running $PREFIX count"

	count.pl --ngram 2 \
		--token token.regex \
		--remove $REMOVE \
		--stop $STOPLIST/nyt-$CORPUS.stop \
		$PREFIX.cnt2 \
		$TRAINDATA/nyt-plain-clean-$CORPUS-tr.txt 

		foreach STAT (ll leftFisher pmi odds)

			echo "running $PREFIX $STAT statistic"
	
			if ($STAT == ll) then
				set SCORE = 3.84
			else if ($STAT == leftFisher) then
				set SCORE = 0.95
			else if ($STAT == pmi) then
				set SCORE = 5.00
			else if ($STAT == odds) then
				set SCORE = 10000.00
			else 
				echo "statistic error"
				exit
			endif

			statistic.pl $STAT --precision 4 --score $SCORE $PREFIX.$STAT $PREFIX.cnt2 

		end 

	end 

end

[Senseclusters-developers] encoding for spanish and portuguese

From: ted p. <tpederse@d.umn.edu> - 2006-10-06 15:55:40

---------- Forwarded message ----------
Date: Fri, 06 Oct 2006 11:39:50 -0400
From: Anagha Kulkarni <an...@cs...>
To: Zori Kozareva <zko...@dl...>
Cc: ted pedersen <tpederse@d.umn.edu>, zko...@gm...
Subject: some more information regarding the encoding issue

Hi Zori,

Few link that I think I had used.

http://perldoc.perl.org/functions/binmode.html

http://rf.net/~james/perli18n.html

http://perldoc.perl.org/utf8.html

http://groups.google.com/group/comp.lang.perl.misc/browse_thread/thread/4e1800f6eac52650/86cf1b6ba0841e1f%2386cf1b6ba0841e1f?sa=X&oi=groupsr&start=1&num=3

http://perldoc.perl.org/perllocale.html#NAME

------------------------------------------------------------------------
Below is a more elaborate version of the senseclusters note:

I tried using locale and setting it to various different locales but it
does not help, all it does is just ignores the accented characters.

As i thought about it with the help of this mailing-list entry:
http://groups.google.com/group/comp.lang.perl.misc/browse_thread/thread/4e1800f6eac52650/86cf1b6ba0841e1f%2386cf1b6ba0841e1f?sa=X&oi=groupsr&start=1&num=3
(sorry for the links length!)

I think i understand why binmode works for us and not locale, whereas
it
worked for the NSP user.

In our case where the file was created using different encoding (and
locale) than our systems encoding - binmode helps.

Whereas in case of the NSP user i guess - his file encoding and the
system's encoding must be the same.

I am reproducing a small part of the conversation from the above link
which explains when to use binmode:

"if the file contains the Operating System's definition
of "text", then you *don't* have to use binmode.

If you have a file which contains utf8 text, and the Operation System's
definition of text is utf8, then you don't need binmode.

If you have a file which contains latin1 text, and the Operation
System's
definition of text is latin1, then you don't need binmode.

But if the Operating System's definition of text is utf8, and a file
contains latin1 text, or vice versa, then binmode is needed."

Secondly changing the $ENV{LANG} variable is not recommended because
the
original value of this setting (in our case: Redhat: utf8) is the
"system's definition of text" which means that the Redhat claims that
all the utilities provided by them are utf8 compatible and by changing
that we will be breaking this assurance. (Source of information: the
above mailing-list entry)

Next i tried the iconv utility which converts files from one encoding
to
another so i ran the following command:
iconv -f latin1 -t utf8 spanish.stoplist >& converted

and tried count.pl (original without binmode) on this "converted" file
and it executed without any "Malformatted" error msg. But the accented
characters were again just ignored and were not present in the output.

So as i see it we have 2 options:
1. Add the binmode(FH,":encoding(latin1)") statements to the programs
which will handle the latin1 encoded data (just for spanish
experiments)
and then get the output with proper accented characters.
2. Convert the spanish stoplist and the input data to utf8 format using
iconv which will save us the pains of modifying the programs but at the
cost of ignoring the accented characters.

---------------------------------------------------------------------------

I hope this helps and does not just add to your reading.

Thanks,
Anagha

[Senseclusters-developers] using senseclusters to generate a kernel for supervised learning

From: ted p. <tpederse@d.umn.edu> - 2006-09-08 00:57:33

Archiving note from Mahesh re: senseclusters and kernels...

---------- Forwarded message ----------
Date: Thu, 07 Sep 2006 14:20:53 -0400
From: Mahesh Joshi <mah...@cs...>

The similarity matrices generated by simat.pl in SenseClusters are the 
kernel matrices. But there's one crucial difference in the methodology 
that SC has and the one that we used in my thesis:

SC *always* uses training data (if separate from test) for feature 
selection and feature selection only. It never creates a matrix 
representation of the training data - except the second order case of 
wordvec.pl, actually (since it will be based on bigrams/cocs found in 
the train data, and their scores).

Whereas the kernel.pl script (which is a modification of 
discriminate.pl) which I used for thesis experiments *always* creates a 
matrix representation out of the training data (which is a required 
parameter). So both order1vec and wordvec create a matrix from training 
data. This matrix representation is then used by order2vec to represent 
test contexts and find the similarity matrix for the test data, which 
serves as the kernel for SVMs. So essentially the test data is 
represented in terms of a matrix found from training data, thus giving 
(hopefully) additional knowledge about the test contexts apart from what 
they themselves contain.

Now, unfortunately I don't see an easy way to do this directly in SC 
without making use of the kernel.pl script (which changes the 
discriminate.pl flow somewhat radically).

Ahead of kernel.pl (which produces a .simat file, which should be 
converted to a dense format if its already not), the SVM Light wrapper 
that I had written takes over. It takes an arff file as input along with 
the similarity matrix (note that the number and order of instances in 
the arff file and the number and order of the square similarity matrix 
are one-to-one. So for 100 test instances, the .arff file should contain 
100 instances, and the simat file should contain 100x100 dense matrix 
with the 100 contexts/instances in the same order as the arff (which is 
in turn the same as the order in the test sval2 file). This wrapper 
calls a modified version of SVM light that I had created (to handle the 
similarity matrix input file).

=========================================================

Re: [Senseclusters-developers] resolved XML::Simple test errors, testXML files still ?

From: Anagha K. <an...@cs...> - 2006-08-31 02:50:46

Hi Ted,

>So, will testXML.n.xml either be empty (if the XML file is well
>formed) or contain an error message? 
>
Yes, that is correct.

>Could it possibly contain anything 
>else? 
>  
>
No. It should either be empty or should contain the error message(s).

Thanks,
Anagha

>
>On Wed, 30 Aug 2006 an...@cs... wrote:
>
>  
>
>>Hi Ted,
>>
>>Sorry to learn that XML::Simple gave some hard time.
>>
>>    
>>
>>>The good news is that we aren't getting those XML::Simple errors
>>>any more. The bad news, I think, is that the testXML.n.out files
>>>are empty, as they are in talisker
>>>      
>>>
>>No, this is a good news! :) In callwrap.pl's words (with a typo :) "if the
>>xml is Not well-formed or not parsable then the output file not be empty"
>>
>>So, the empty files indicate that the generated xml file is well-formed
>>and thus the xml version of the file should be linked (and the txt version
>>need not be used.)
>>
>>Thanks,
>>Anagha
>>
>>    
>>
>>>http://marimba.d.umn.edu/SC-htdocs/user1156951411/
>>>
>>>So, what are we hoping to see in these files, and why are they
>>>empty? I am guessing that the XML parser previously didn't handle
>>>empty files gracefully, and that is why we were seeing those
>>>errors.
>>>
>>>BTW, talisker is using 0.12 of XML::SAX, so this is consistent
>>>with what I describe above. There was apparently a bug in 0.13 of
>>>XML::SAX::PurePerl that might still existing in 0.14 based on what
>>>we are seeing, so switching over to XML::SAX::Expat seems reasonable,
>>>I think. Let me know if that poses a concern of some sort.
>>>
>>>Description of bug in XML::SAX is here:
>>>http://www.cpanforum.com/threads/1473
>>>
>>>Thanks!
>>>Ted
>>>
>>>--
>>>Ted Pedersen
>>>http://www.d.umn.edu/~tpederse
>>>
>>>
>>>
>>>      
>>>
>>
>>-------------------------------------------------------------------------
>>Using Tomcat but need to do more? Need to support web services, security?
>>Get stuff done quickly with pre-integrated technology to make your job easier
>>Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
>>http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
>>_______________________________________________
>>senseclusters-developers mailing list
>>sen...@li...
>>https://lists.sourceforge.net/lists/listinfo/senseclusters-developers
>>
>>    
>>
>
>  
>

Re: [Senseclusters-developers] resolved XML::Simple test errors, testXML files still ?

From: ted p. <tpederse@d.umn.edu> - 2006-08-31 02:17:10

Hi Anagha,

Thanks for clarifying this. I didn't realize the files were supposed to be 
empty! So, will testXML.n.xml either be empty (if the XML file is well
formed) or contain an error message? Could it possibly contain anything 
else? 

Thanks,
Ted

On Wed, 30 Aug 2006 an...@cs... wrote:

> 
> Hi Ted,
> 
> Sorry to learn that XML::Simple gave some hard time.
> 
> > The good news is that we aren't getting those XML::Simple errors
> > any more. The bad news, I think, is that the testXML.n.out files
> > are empty, as they are in talisker
> 
> No, this is a good news! :) In callwrap.pl's words (with a typo :) "if the
> xml is Not well-formed or not parsable then the output file not be empty"
> 
> So, the empty files indicate that the generated xml file is well-formed
> and thus the xml version of the file should be linked (and the txt version
> need not be used.)
> 
> Thanks,
> Anagha
> 
> >
> > http://marimba.d.umn.edu/SC-htdocs/user1156951411/
> >
> > So, what are we hoping to see in these files, and why are they
> > empty? I am guessing that the XML parser previously didn't handle
> > empty files gracefully, and that is why we were seeing those
> > errors.
> >
> > BTW, talisker is using 0.12 of XML::SAX, so this is consistent
> > with what I describe above. There was apparently a bug in 0.13 of
> > XML::SAX::PurePerl that might still existing in 0.14 based on what
> > we are seeing, so switching over to XML::SAX::Expat seems reasonable,
> > I think. Let me know if that poses a concern of some sort.
> >
> > Description of bug in XML::SAX is here:
> > http://www.cpanforum.com/threads/1473
> >
> > Thanks!
> > Ted
> >
> > --
> > Ted Pedersen
> > http://www.d.umn.edu/~tpederse
> >
> >
> >
> 
> 
> 
> -------------------------------------------------------------------------
> Using Tomcat but need to do more? Need to support web services, security?
> Get stuff done quickly with pre-integrated technology to make your job easier
> Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
> http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
> _______________________________________________
> senseclusters-developers mailing list
> sen...@li...
> https://lists.sourceforge.net/lists/listinfo/senseclusters-developers
> 

-- 
--
Ted Pedersen
http://www.d.umn.edu/~tpederse

Re: [Senseclusters-developers] resolved XML::Simple test errors, testXML files still ?

From: <an...@cs...> - 2006-08-31 01:13:38

Hi Ted,

Sorry to learn that XML::Simple gave some hard time.

> The good news is that we aren't getting those XML::Simple errors
> any more. The bad news, I think, is that the testXML.n.out files
> are empty, as they are in talisker

No, this is a good news! :) In callwrap.pl's words (with a typo :) "if the
xml is Not well-formed or not parsable then the output file not be empty"

So, the empty files indicate that the generated xml file is well-formed
and thus the xml version of the file should be linked (and the txt version
need not be used.)

Thanks,
Anagha

>
> http://marimba.d.umn.edu/SC-htdocs/user1156951411/
>
> So, what are we hoping to see in these files, and why are they
> empty? I am guessing that the XML parser previously didn't handle
> empty files gracefully, and that is why we were seeing those
> errors.
>
> BTW, talisker is using 0.12 of XML::SAX, so this is consistent
> with what I describe above. There was apparently a bug in 0.13 of
> XML::SAX::PurePerl that might still existing in 0.14 based on what
> we are seeing, so switching over to XML::SAX::Expat seems reasonable,
> I think. Let me know if that poses a concern of some sort.
>
> Description of bug in XML::SAX is here:
> http://www.cpanforum.com/threads/1473
>
> Thanks!
> Ted
>
> --
> Ted Pedersen
> http://www.d.umn.edu/~tpederse
>
>
>

[Senseclusters-developers] resolved XML::Simple test errors, testXML files still ?

From: ted p. <tpederse@d.umn.edu> - 2006-08-30 15:31:11

I resolved the XML::Simple test errors by installing 
XML::SAX::Expat, which is now being used at the default
XML parser (rather than XML::SAX:PurePerl, which was
being used before). The other option would have been to
back off from 0.14 of XML::SAX to 0.12, but moving backwards
with versions to fix problems always seems to create more
problems down the the road, so I didn't really want to do
that. 

The good news is that we aren't getting those XML::Simple errors
any more. The bad news, I think, is that the testXML.n.out files
are empty, as they are in talisker 

http://marimba.d.umn.edu/SC-htdocs/user1156951411/

So, what are we hoping to see in these files, and why are they
empty? I am guessing that the XML parser previously didn't handle
empty files gracefully, and that is why we were seeing those 
errors. 

BTW, talisker is using 0.12 of XML::SAX, so this is consistent 
with what I describe above. There was apparently a bug in 0.13 of
XML::SAX::PurePerl that might still existing in 0.14 based on what 
we are seeing, so switching over to XML::SAX::Expat seems reasonable,
I think. Let me know if that poses a concern of some sort. 

Description of bug in XML::SAX is here:
http://www.cpanforum.com/threads/1473

Thanks!
Ted

--
Ted Pedersen
http://www.d.umn.edu/~tpederse

[Senseclusters-developers] previous error, missing feature

From: ted p. <tpederse@d.umn.edu> - 2006-08-30 13:05:11

Hi Anagha,

I realized I tried to do target word discrimination with a headless
file. That seems to cause all sorts of pretty crazy looking errors
depending on the options selected, so at some point it would probably
be good to check and make sure we have a head word in the data when
target word discrimination is selected. The reverse case is no problem,
that is if we have a head word tag in data that is being processed
as headless. 

Thanks,
Ted

--
Ted Pedersen
http://www.d.umn.edu/~tpederse

[Senseclusters-developers] SenseClusters version 0.95 released

From: ted p. <tpederse@d.umn.edu> - 2006-08-26 17:46:49

We are pleased at announce the release of SenseClusters version 0.95.   

SenseClusters is a freely available package that allows you to cluster   
similar contexts, or to cluster words that occur in similar contexts. 
It is fully unsupervised, and can automatically discover the optimal 
number of clusters in your text. 

As of version 0.95, we now fully support Latent Semantic Analysis for      
context and word clustering, and we continue to improve the native   
SenseClusters methods, which includes the ability to cluster first and  
second order representations of context.

SenseClusters can be downloaded from :

	http://senseclusters.sourceforge.net/

You can also try out SenseClusters via our web interface:

	http://marimba.d.umn.edu/cgi-bin/SC-cgi/index.cgi

In both native and LSA modes, SenseClusters relies on lexical features   
(such as unigrams, bigrams, and co--occurrences) that can be identified in    
raw text. The tokenization is very flexible - a user can define this via    
Perl regular expressions - so it is possible to work with many other     
languages besides English, and you can easily work with tokenization  
schemes other than white-space separated words, such as character based 
tokens, like 2 letter sequences, etc.

The native SenseClusters methods support traditional first order context    
clustering, where you identify a feature set, and then determine which of  
those features occur in the contexts you are clustering. The native   
methods also support second order context clustering, where each word 
is represented by a vector of the words with which it co-occurs. 
All the words in a context to be clustered are replaced by their 
associated vectors, and these vectors are averaged together to represent 
that context. Note that you can also cluster the word vectors to identify 
sets of related words. 

Latent Semantic Analysis differs from the native SenseClusters methods in  
that each feature is represented by a vector that shows the contexts in  
which that feature occurs. Then, all the features in a context to be   
clustered are replaced by their associated vectors, and these are  
averaged together to represent the context. Note that you can also  
cluster the feature vectors directly to identify sets of related features. 

This release represents a major step forward in the functionality of    
SenseClusters. Much of work in providing LSA support was carried out by  
Mahesh Joshi this past spring and summer. And has always been the case 
over the last two years, Anagha Kulkarni played a large role in this     
release, and she has included many improvements to automated cluster 
stopping and other areas in 0.95.

Please give this a try, and let us know if you have any comments or 
questions! If you aren't certain if your problem can be approached using 
SenseClusters, please let us know what you would like to do and maybe we 
can help you get started. 

Cordially,
Ted, Anagha, and Mahesh

====================================================================

ChangeLog:
http://www.d.umn.edu/~tpederse/Code/Changelog.SenseClusters-v0.95.txt

Installation Instructions: 
http://www.d.umn.edu/~tpederse/Code/SenseClusters-v0.95-INSTALL.txt

Related Publications (includes links to data you can use):
http://www.d.umn.edu/~tpederse/senseclusters-pubs.html

--
Ted Pedersen
http://www.d.umn.edu/~tpederse

[Senseclusters-developers] Anagha Kulkarni MS thesis available!

From: ted p. <tpederse@d.umn.edu> - 2006-07-15 05:03:06

Greetings all,

I am pleased to report that Anagha has finished her MS thesis, which means 
she is now officially a Master of Science! :) Congratulations on a job 
very well done! Her thesis is entitled: 

Unsupervised Context Discrimination and Cluster Stopping

and is available from : 
http://www.d.umn.edu/~tpederse/senseclusters-pubs.html or 
http://www.d.umn.edu/~tpederse/masters.html

This is the most complete (and best) description of the automatic cluster 
stopping methods that are now available in SenseClusters. It also contains 
a great deal of other significant content, including a new and impressive 
set of experiments on newsgroup data, name conflate data, word sense 
data, and manually annotated web search data! (All of this data is 
available at http://www.d.umn.edu/~tpederse/Data/anagha-thesis-data.zip 
btw). 

So, please do check this out, and also join me in wishing Anagha well
as she finishes her work here at UMD, and prepares to move on to CMU!!

Enjoy, 
Ted

--
Ted Pedersen
http://www.d.umn.edu/~tpederse

[Senseclusters-developers] SenseClusters events at AAAI-2006

From: ted p. <tpederse@d.umn.edu> - 2006-07-12 15:22:30

Greetings all,

I wanted to mention that there will be two SenseClusters related events
at AAAI in Boston next week.

First, I will be presenting a tutorial called "Language Independent 
Methods of Clustering Similar Contexts (with applications)" that will
take place on Monday July 17, from 2-6 pm. This is meant to be a general
overview of the methodology that underlies SenseClusters. You can see
the material from this tutorial (and previous ones) at:

http://www.d.umn.edu/~tpederse/SCTutorial.html

Second, Anagha Kulkarni will be presenting a poster entitled "How many 
different "John Smiths", and who are they?" which is all about name 
discrimination and how we have tackeled that with SenseClusters. The 
poster will be presented on Wednesday evening, July 19, as a part of the 
demo/poster session. Here is the paper that accompanies the poster :

http://www.d.umn.edu/~tpederse/Pubs/aaai06-anagha-poster.pdf

So, if you are in Boston for AAAI, please do check these out, and 
stop by and say hi!

Cordially,
Ted and Anagha

--
Ted Pedersen
http://www.d.umn.edu/~tpederse

[Senseclusters-developers] SenseClusters v0.93 released!

From: ted p. <tpederse@d.umn.edu> - 2006-07-08 05:55:49

We are very pleased to announce the release of SenseClusters
version 0.93. This version marks our first steps towards supporting
Latent Semantic Analysis in addition to our native SenseClusters
methods.

In this version we now support word clustering (feature clustering
really, as it is not limited to just unigrams or single words) that
is based on a feature by context representation. In other words,
features are clustered based on the contexts in which they occur.
These matrices can optionally be reduced with SVD prior to clustering.
We refer to this as LSA feature clustering.

These feature by context representations are what we believe
characterizes LSA, and makes it different from our native SenseClusters
methods. We have supported a form of word clustering prior to this
release, and it is based on a word by word representation, that is words
are clustered based on the words with which they occur.

You can download version 0.93 from sourceforge:

http://sourceforge.net/projects/senseclusters/

As a preview, in version 0.95 we will have support for doing context
discrimination "the LSA way". The features found in contexts to be
discriminated will be represented by vectors that show which contexts
those features occur in, thus providing a second way of doing order 2
representations.

At present our native SenseClusters order 2 methodology is based on
replacing the words in the contexts to be clustered with vectors showing
the words with which they occur.

There are some other significant changes in version 0.93, among them that
SenseClusters now requires the use of Perl 5.8.5 or better. The most
current version of Perl is 5.8.8 now, and 5.8.5 is several years old, so
it is probably time to upgrade anyway if you are running something less
than 5.8.5.

Also, we have attempted to clarify the installation instructions further.
We will continue to work on that in 0.95, hopefully making SenseClusters
much easier to install. We think the instructions are quite a bit better
now, so please check them out:

http://www.d.umn.edu/~tpederse/Code/SenseClusters-v0.93-INSTALL.txt

The more detailed ChangeLog for 0.93 can be found here:

http://www.d.umn.edu/~tpederse/Code/Changelog.SenseClusters-v0.93.txt

Please let us know if there are any questions, and please do plan
on upgrading to 0.93, or trying it out on the web interface:

http://marimba.d.umn.edu/cgi-bin/SC-cgi/index.cgi

We would be happy to answer any questions or receive any comments you
might have.

Enjoy,
Ted, Mahesh, and Anagha

--
Ted Pedersen
http://www.d.umn.edu/~tpederse

[Senseclusters-developers] ripple effect of adding new script, what to check before new release

From: ted p. <tpederse@d.umn.edu> - 2006-07-06 03:18:36

The following is from Anagha, in a note of June 13, 2006 that included
other material, and a different subject header. I thought I would resend
this portion of the note with a new subject, so that it would be easier
to find in future, and it might be useful now as we contemplate 0.93.
Note that even though we aren't adding any new scripts to toolkit, most
of the below is still relevant I think. 

-------------------------------------------------------------------------

With respect to the ripple effect, whenever we add a new script to 
SenseClusters (more specifically to Toolkit) I typically do the 
following things (some of the points below are obvious but i went ahead 
and included them anyways) - let me know if you find anything that 
should be in this list but is not:

1. if applicable, update Docs/Flows/flowchart.*

2. add FILE.html documentation file to Docs/HTML/Toolkit_Docs/DIR

3. update Docs/HTML/SenseClusters-Code-README.* to link the html file 
added in 2. above

4. update Docs/HTML/discriminate.html

5. copy the new Docs/HTML/discriminate.html as Web/SC-htdocs/help.html

6. update Makefile.PL

7. create a new folder under Testing/ for the new script and adding 
test-cases

8. modify the web-interface

9. update the Changes/Changelog-v*.txt

Thanks,
Anagha

Re: [Senseclusters-developers] stoplist program, design issues

From: ted p. <tpederse@d.umn.edu> - 2006-07-02 18:25:40

Thanks Anagha, I appreciate your comments on this, and I am glad it
is sounding like a reasonable design.

> I have a very minor point about the mode of the resulting stoplist - may 
> be we can have the "OR" mode as the default and provide an option for 
> user to set the mode - then one can use the stoplist that comes output 
> of this program as is.

Agreed. I think the idea is to have something that can be used immediately 
as an NSP stop list. So...we might also want to consider having an option
that would allow a user to indicate if they want a case sensitive list
or not. In other words,

/\bin\b/

versus

/\b[Ii]n\b

or maybe even
/\b[Ii][Nn]\b/

remembering that NSP does not support the use of the /in/i directive.

I think this is important actually, so I would suggest we default to
case sensitive, and let the user turn that off with a flag like 

--caseinsensitive

if they wish...

Now, this does introduce an issue of possible duplicates in the stoplist,
if we find

In
and 
in 

as stopwords, and then ask for the list to be case-insensitive, we will
end up with two equivalent entries. I do not think checking for duplicates
will be too difficult though. 

> Another point - a speculative one - although this script would not 
> support GigaWord format if we were to use such files as input to this 
> program with --inpformat as plain text then we should expect all the 
> meta-token along with functional words in the generated stoplist. May be 
> we can use this as a test case for the script.

Agreed. Good idea! And in fact, I think the GigaWord corpus raises a
few other interesting issues. For example, we could use --nontoken with
a regex like /\<.*\>/ in order to disregard all of the meta characters
from the stoplist. Now, there is still some content surrounded by the
metacharacters that would be included (like title perhaps) but I think
that is ok.

I was tempted in fact to include support for GigaWord format, but I 
realized (I think) that most of the articles are about the same size, 
I think, and if we used the plain format, along with --nontoken for
metacharacters, and a context size of approximately 200, we could
probably get a pretty good stoplist. Now, you raise a nice point above,
in that all these metachaters will show up as stop words, so maybe we
don't even need to worry about --nontoken.

The great thing about stoplist generation, I think, is that I think
it's ok for it to be a fairly noisy process. Stop words should stand
out in corpora blatently as you look for them, and those that are on
the borderline are best left as real words I think. So the fact that
using non-token and an assumed context size might miss some stop words
seems ok to me, we would rather error on the side of missing. I do not
think there is any way the approach I describe above would error on the
side of including too many stop words. But it will be great fun to
experiment with.

Further thoughts and comments are of course welcome. Stoplist creation
is an interesting and important issue I think, and one that is badly
neglected. We all just download the SMART list and use that. :) 

Thanks,
Ted

Re: [Senseclusters-developers] stoplist program, design issues

From: Anagha K. <kulka020@d.umn.edu> - 2006-07-02 16:36:35

Hi Ted,

This looks like a good design!

I have a very minor point about the mode of the resulting stoplist - may 
be we can have the "OR" mode as the default and provide an option for 
user to set the mode - then one can use the stoplist that comes output 
of this program as is.

Another point - a speculative one - although this script would not 
support GigaWord format if we were to use such files as input to this 
program with --inpformat as plain text then we should expect all the 
meta-token along with functional words in the generated stoplist. May be 
we can use this as a test case for the script.

Thanks,
Anagha


ted pedersen wrote:
> Here are some thoughts the design of a stoplist generating script. 
> At this point, as a practical matter I think it should be a stand-alone
> script dedicated to stoplist generation. In theory one might incorporate
> this with count.pl or put within SenseClusters somehow, but I think
> those are somewhat more time consuming options. 
> 
> I am also quite convinced that these stoplists will be very useful indeed, 
> and will result in better performance for SenseClusters and perhaps the 
> vector measure in WordNet-Similarity. We have previously seen in both 
> cases a great impact on overall results as the stoplist is adjusted. 
> 
> So, in some respects I think this should be somewhat like nameconflate, in  
> that it should handle two different formats of text, and should be able   
> to control the size of the context it is working with, at least in the 
> case of plain text.
> 
> The goal of this program is to take as input either plain text, or text 
> formatted in the senseval2 format. The output would be an NSP compatible 
> stoplist based on tf.idf. I also think we need a trace mode (so to speak) 
> that shows the tf.idf, tf and idf values (so that a user can see the 
> actual values if they are unsure of what is happening). 
> 
> Now, clearly we don't really have documents here, so we need to redefine 
> that a little. 
> 
> If the text is senseval2 formatted, then each context defines a "document" 
> for purposes of computing tf.idf. If the text is plain, then the user must 
> input a value that defines the size of their context. I would suggest a 
> default of 100 tokens. And so the idea would be that the program would 
> chop the input text into blocks of 100 tokens, and consider those to be 
> documents. The user could reset this size of course, based on whatever 
> they think might be most useful. Also, I think if the text is plain, we 
> should allow the user the option of saying that each line of plain text 
> constitutes a context. 
> 
> Now, above I mention tokens and not words, which implies that the program 
> must support tokenization in the NSP style, which means supporting --token 
> and --nontoken. This is important for supporting other languages, and then 
> for controlling things like whether or not numeric values should be 
> included in stop list (they could be removed via --nontoken). 
> 
> There are quite a few variations on how to compute tf.idf, and I think we
> probably ought to just pick a standard version. What is described here:
> http://www.answers.com/topic/tf-idf-1 strikes me as a pretty reasonable
> formulation. We could use that, and of course describe in the perldoc.
> 
> To summarize, here are the options that I think need to be supported...
> 
> --inpformat FMT              	The format of the input file(s)
>                                	FMT = plain (default) / sval2
> 
> --linecontext			only valid for plain mode, one line per context
> 				do not use with contextsize or sval2
> 
> --token 			same as nsp
> 
> ---nontoken			same as nsp
> 
> --contextsize WINDOW 		How large are contexts/documents (only 
> 				valid for plain text, default 100 tokens).
> 
> --score REAL		     	The tf.idf score that acts as a cutoff
> 			     	for stopwords. Should be set to some 
> 			     	default, I am not sure what this should be.
> 
> --trace				display tf, idf, and tf.idf values for
> 				each token that has tf.idf above score
> 
> What do we think? Is anything missing or misguided in the above? 
> 
> Thanks,
> Ted
> 
> --
> Ted Pedersen
> http://www.d.umn.edu/~tpederse

[Senseclusters-developers] stoplist program, design issues

From: ted p. <tpederse@d.umn.edu> - 2006-07-02 15:16:33

Here are some thoughts the design of a stoplist generating script. 
At this point, as a practical matter I think it should be a stand-alone
script dedicated to stoplist generation. In theory one might incorporate
this with count.pl or put within SenseClusters somehow, but I think
those are somewhat more time consuming options. 

I am also quite convinced that these stoplists will be very useful indeed, 
and will result in better performance for SenseClusters and perhaps the 
vector measure in WordNet-Similarity. We have previously seen in both 
cases a great impact on overall results as the stoplist is adjusted. 

So, in some respects I think this should be somewhat like nameconflate, in  
that it should handle two different formats of text, and should be able   
to control the size of the context it is working with, at least in the 
case of plain text.

The goal of this program is to take as input either plain text, or text 
formatted in the senseval2 format. The output would be an NSP compatible 
stoplist based on tf.idf. I also think we need a trace mode (so to speak) 
that shows the tf.idf, tf and idf values (so that a user can see the 
actual values if they are unsure of what is happening). 

Now, clearly we don't really have documents here, so we need to redefine 
that a little. 

If the text is senseval2 formatted, then each context defines a "document" 
for purposes of computing tf.idf. If the text is plain, then the user must 
input a value that defines the size of their context. I would suggest a 
default of 100 tokens. And so the idea would be that the program would 
chop the input text into blocks of 100 tokens, and consider those to be 
documents. The user could reset this size of course, based on whatever 
they think might be most useful. Also, I think if the text is plain, we 
should allow the user the option of saying that each line of plain text 
constitutes a context. 

Now, above I mention tokens and not words, which implies that the program 
must support tokenization in the NSP style, which means supporting --token 
and --nontoken. This is important for supporting other languages, and then 
for controlling things like whether or not numeric values should be 
included in stop list (they could be removed via --nontoken). 

There are quite a few variations on how to compute tf.idf, and I think we
probably ought to just pick a standard version. What is described here:
http://www.answers.com/topic/tf-idf-1 strikes me as a pretty reasonable
formulation. We could use that, and of course describe in the perldoc.

To summarize, here are the options that I think need to be supported...

--inpformat FMT              	The format of the input file(s)
                               	FMT = plain (default) / sval2

--linecontext			only valid for plain mode, one line per context
				do not use with contextsize or sval2

--token 			same as nsp

---nontoken			same as nsp

--contextsize WINDOW 		How large are contexts/documents (only 
				valid for plain text, default 100 tokens).

--score REAL		     	The tf.idf score that acts as a cutoff
			     	for stopwords. Should be set to some 
			     	default, I am not sure what this should be.

--trace				display tf, idf, and tf.idf values for
				each token that has tf.idf above score

What do we think? Is anything missing or misguided in the above? 

Thanks,
Ted

--
Ted Pedersen
http://www.d.umn.edu/~tpederse

[Senseclusters-developers] naming of senseclusters methodology!

From: ted p. <tpederse@d.umn.edu> - 2006-06-27 02:03:25

Hi Anagha and Mahesh,

As we contemplate the incorporation of Latent Semantic Analysis into 
SenseClusters, there is in fact a rather difficult naming issue we must 
deal with. SenseClusters right now refers rather generically to headed and 
headless clustering, and word clustering.

Now, when we include Latent Semantic Analysis, probably we want to view 
that as being a part of SenseClusters, which is the name of the package 
really, but not the methodology. In fact, it points out we don't really 
have a name for the methodology! 

And indeed, we should be able to do headed, headless, and word clustering 
with Latent Semantic Analysis, or with (unamed SenseClusters methodology).

For the first page of the web interface, I am imagining a layout something 
like this (note a small bit of rewriting that we should think about):

			SensClusters Web Interface
      		Clusters contexts based on their similarity (?)

(unamed SenseClusters methodology)
	target word (headed) clustering (e.g., word sense discrimination)
	headless clustering (e.g., email categorization)
	word clustering (e.g., synonym finding)

Latent Semantic Analysis
	target word (headed) clustering 
	headless clustering
	feature clustering 

[We will of course add this functionality in two stages, the first stage
(0.93) will add the feature clustering for LSA, and then the next
stage will add the headed and headless clustering.]

Now, the essential difference, I think, is in whether or not we are 
dealing with feature by context representations (LSA) or context by 
feature (SC). But while that is a good explanation, it doesn't lead to
a colorful or interesting name. :)

Unfortunately the terms first order and second order representation 
become a bit ambiguous too, since we will have second order LSA (where 
features are replaced by a feature by context vector). Now, I guess
we will not have a first order LSA version of target word clustering,
so perhaps first order refers "uniquely" to one of our methods. But,
second order and word clustering do not...

So, this is something to think about. :)

Please note that the design of the first page above makes the
difference between LSA and our methods seem rather stark, when in 
fact they are quite closely related. However, I think it is best
to keep them stark like this, especially since LSA has such high
name recognition. However, if you think there is another organization
of the main page that makes more sense, and still makes the
availability of LSA clear to the casual user, then I am very 
interested. 

Thanks,
Ted

--
Ted Pedersen
http://www.d.umn.edu/~tpederse

[Senseclusters-developers] SenseClusters version 0.91 released

From: ted p. <tpederse@d.umn.edu> - 2006-06-18 00:29:17

We are pleased to announce the release of SenseClusters version 0.91.
This release includes a number of significant improvements to our
web interface, and hopefully simplifies the setup of the web interface
if you would like to run your own version of that. 

You can download this new version of SenseClusters at:
http://sourceforge.net/projects/senseclusters/

BTW, please note that you do not need to install the Web interface if 
you don't want too, ours is always available at:
http://marimba.d.umn.edu/cgi-bin/SC-cgi/index.cgi

The main change to the web interface visible to users is that it now  
provides plots (as pdf files) that illustrate the cluster stopping 
decision making process, showing essentially the change in the criterion 
function values and where our different measures elect to stop clustering. 

Also note that we have cleaned up our FAQ a little bit, and would welcome
new questions to include in that. 

You can find the more detailed ChangeLog below.

Please let us know if you have any questions or comments!

Enjoy,
Ted and Anagha

==================================================================

Changes made in Sense-Clusters version 0.89 during version 0.91

Ted Pedersen 	 tpederse@d.umn.edu
Anagha Kulkarni  kulka020@d.umn.edu

1. Added config.txt under SC-cgi dir and now the settings for PATH, 
PERL5LIB, complete path to SC-cgi and SC-htdocs and name of the cgi dir 
are read by second.cgi, fifth.cgi and callwrap.pl from this single file.					 
- Anagha 

2. Modified fourth.cgi to include the missing case for --cluststop "gap"  
option setting.																	
- Anagha

3. Included plot generation scripts under SC-cgi dir and updated the 
callwrap.pl accordingly.														
- Anagha

4. Modified /Web/README.Web.pod to indicate the following pre-requisites 
for the plot generation: gnuplot, latex and ps2pdf			  						
- Anagha

5. Updated /Web/README.Web.pod for the new config.txt related changes.			
- Anagha

6. Updated Docs/FAQs.pod														
- Anagha

7. Added FAQs.html to Docs/HTML dir.											
- Anagha

(Changelog-v0.89to0.91 Last Updated on 06/16/2006 by Anagha)

--
Ted Pedersen
http://www.d.umn.edu/~tpederse

Re: [Senseclusters-developers] options and naming conventions for lsa support

From: Anagha K. <kulka020@d.umn.edu> - 2006-06-14 16:12:44

Hi Ted,

 > Is it true that word
> clustering only allows input as test data? If so, this does not allow one  
> natural thing, and that would be to find word clusters in plain text  
> (training data that is not in senseval-2 format). 

Yes, currently word clustering accepts senseval2 formatted test data only.

Thanks,
Anagha

Re: [Senseclusters-developers] options and naming conventions for lsa support

From: ted p. <tpederse@d.umn.edu> - 2006-06-14 00:08:19

Hi Anagha and Mahesh,

Very good. Thanks to both of you for your comments. So I think I am ready 
to say that we should go ahead and use the --lsa convention that has been 
previously described. I was thinking about this some more today, and 
really did not find an option or set of options that I liked better. And I 
also do not want to change --wordclust, for the reasons Mahesh has 
mentioned. I think the distinction between word clustering and context 
clustering is fairly intuitive, and we can point out that our word 
clustering is in fact more generic than that, and can be considered  
feature clustering. I do not think that will be too confusing.

Thanks for the interesting discussion. I think we have made some good
decisions here, although if there are problems that we did not expect
because of this, let's raise them immediately. 

I will start to compose a note or two to the users list, describing
our plans. 

Thanks,
Ted

 On Tue, 13 Jun 2006, Mahesh Joshi wrote:

> 
> Hi Ted,
> 
> I too think the idea of having a "--lsa" option is better. It does  
> give a convenient switch internally for programming purposes (rather  
> than multiple option values to handle) and also maintains the  
> backwards compatibility, which would have been a concern otherwise.
> 
> As you mention, let us stick to "--wordclust" as the option name for  
> feature clustering for now with the understanding that it also  
> provides feature clustering and we will have explicit and visible  
> documentation mentioning the same (for the option itself, in the  
> CHANGELOG and any other places). This has the further advantage of  
> maintaining absolute backwards compatibility (not even renaming the  
> option).
> 
> I do understand that training data does not make sense for feature  
> clustering, however I am not sure about the headed/headless issue -  
> so I will not comment on that for now.
> 
> Thanks,
> Mahesh
>

Re: [Senseclusters-developers] options and naming conventions for lsa support

From: ted p. <tpederse@d.umn.edu> - 2006-06-14 00:03:30

Hi Anagha,

> Went back and looked at our correspondence regarding this issue of 
> performing word clustering only with headless data and more or less the 
> summary is that we did not want to restrict the word clustering to 
> finding words similar to some specific target word but wanted to cluster 
> as many open-class words as possible into sets of related words.
> 
> So I would like to take back what I had suggest regarding feature 
> clustering and type of data. Thus I think, we should carry ahead the 
> restriction of using only headless type of data with word clustering to 
> feature clustering too.

Very good, thanks for this clarification. I agree. I think "word" 
clustering should mean to take all the features that are found in a given 
set of test data, and to cluster those. 

So the input to word clustering should be headless (no target words) 
contexts formatted in the senseval-2 format. Is it true that word 
clustering only allows input as test data? If so, this does not allow one  
natural thing, and that would be to find word clusters in plain text  
(training data that is not in senseval-2 format). I do not think this is  
a huge problem, and I am not too worried about fixing this now, but it is  
something we want to be clear about. 


Thanks,
Ted

--
Ted Pedersen
http://www.d.umn.edu/~tpederse

2 messages has been excluded from this view by a project administrator.

Flat | Threaded

<< < 1 2 3 4 .. 11 > >> (Page 2 of 11)