senseclusters-developers Mailing List for SenseClusters (Page 2)
Status: Beta
Brought to you by:
tpederse
You can subscribe to this list here.
2003 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(2) |
Dec
|
---|---|---|---|---|---|---|---|---|---|---|---|---|
2004 |
Jan
(6) |
Feb
|
Mar
|
Apr
(2) |
May
|
Jun
(1) |
Jul
(1) |
Aug
(2) |
Sep
(5) |
Oct
(30) |
Nov
(7) |
Dec
(11) |
2005 |
Jan
(51) |
Feb
(8) |
Mar
(3) |
Apr
(2) |
May
(2) |
Jun
(2) |
Jul
(5) |
Aug
(20) |
Sep
(5) |
Oct
(2) |
Nov
(2) |
Dec
|
2006 |
Jan
(8) |
Feb
(2) |
Mar
(7) |
Apr
(2) |
May
(4) |
Jun
(16) |
Jul
(7) |
Aug
(6) |
Sep
(1) |
Oct
(4) |
Nov
(1) |
Dec
|
2007 |
Jan
|
Feb
|
Mar
(1) |
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2008 |
Jan
|
Feb
|
Mar
(2) |
Apr
(10) |
May
|
Jun
(1) |
Jul
(1) |
Aug
|
Sep
|
Oct
|
Nov
|
Dec
(2) |
2009 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
(2) |
Sep
|
Oct
|
Nov
|
Dec
|
2010 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(2) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2013 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
(3) |
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2015 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(1) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(1) |
Nov
|
Dec
|
From: Ted P. <tpederse@d.umn.edu> - 2007-03-21 22:18:49
|
This is a low priority item, but SenseClusters should probably have more aggressive checking of Senseval-2 input formats, or format_clusters or preceeding programs should fail more gracefully. For some reason I confused myself about the Senseval-2 input format, and instead of creating input formatted like this: <instance id="7"> <answer instance="7" senseid="1"/> <context> The Mahatma <,> or <``> great souled one <, ''> instigated several campaigns of passive resistance against the British government in India <.> Unfortunately <, > according to Webster <'> s Biographical Dictionary <, ``> His policies went be yond his control and resulted <...> in riots and disturbances <''> and later a r enewed campaign of civil disobedience <``> resulted in rioting and a second impr isonment <. ''> I am not a proponent of everything Gandhi did <,> but some of hi s law breaking was justified because India was then under occupation by a foreig n power <,> and Indians were not able to participate fully in decisions that vit ally <head> affected </head> them <.> It is difficult <,> however <,> to justify civil disobedience <,> non <-> violent or not <,> where citizens have full reco urse to the ballot box to effect change <.> Where truly representative governmen ts are safeguarded by constitutional protections of human rights and an independ ent judiciary to construe those rights <,> there is no excuse for breaking the l aw because some individual or group disagrees with it <.> </context> </instance> I created something that looks like this: <instance id="7"/> <answer instance="7" senseid="1"/> <context> The Mahatma <,> or <``> great souled one <, ''> instigated several campaigns of passive resistance against the British government in India <.> Unfortunately <, > according to Webster <'> s Biographical Dictionary <, ``> His policies went be yond his control and resulted <...> in riots and disturbances <''> and later a r enewed campaign of civil disobedience <``> resulted in rioting and a second impr isonment <. ''> I am not a proponent of everything Gandhi did <,> but some of hi s law breaking was justified because India was then under occupation by a foreig n power <,> and Indians were not able to participate fully in decisions that vit ally <head> affected </head> them <.> It is difficult <,> however <,> to justify civil disobedience <,> non <-> violent or not <,> where citizens have full reco urse to the ballot box to effect change <.> Where truly representative governmen ts are safeguarded by constitutional protections of human rights and an independ ent judiciary to construe those rights <,> there is no excuse for breaking the l aw because some individual or group disagrees with it <.> </context> The only real difference is in the <instance> tag, and while both are valid XML (I think), only the first is valid Senseval-2 format. However, since I tried to process the second one via the web interface, I ended up getting a huge number of errors/warnings from format_clusters, and the web interface was essentially hung. These errors appeared in logfile in : /usr/local/apache2/cgi-bin/SC-cgi/user_data Use of uninitialized value in pattern match (m//) at /space/SC095/tools/bin/form at_clusters.pl line 318, <SCON> line 231. Use of uninitialized value in pattern match (m//) at /space/SC095/tools/bin/form at_clusters.pl line 318, <SCON> line 231. Use of uninitialized value in pattern match (m//) at /space/SC095/tools/bin/form at_clusters.pl line 318, <SCON> line 231. -- Ted Pedersen http://www.d.umn.edu/~tpederse |
From: Richard W. <rwi...@sw...> - 2006-11-29 19:41:17
|
Hi Ted and other SenseClusters folks, I've updated the svdpackout.pl file so that I can extract the component U, S, and V' (V-transpose) matrices rather than only being able to extract the reconstructed matrix or the rows-only. I've attached my version to this message. Feel free to use it as you see fit. Here's a brief changelog: * Added feature to output the component U, S, and V' matrices. * Added a new command-line option "--output" with three options: reconstruct - reconstructs the rank-k matrix (default) rowonly - same as --rowonly components - output U, S, V' matrices to U.txt, S.txt, VT.txt * Added a new command-line option "--negatives": Allows negative values; otherwise all negative values are set to 0 (except in component output). * New options maintain backward compatibility * Updated the documentation. * Passes all tests (testA1-A4,B1-B2) Hope this helps you -- it helped my students! As an aside -- and I'd be happy to post this to the main newsgroup if you'd rather -- what is the purpose of the "rowonly" feature? Why do you multiply U by the sqrt of the S values? Is there some theoretical reason to do this? Thanks! -Rich -- Richard Wicentowski Assistant Professor Computer Science Department Swarthmore College (610) 690-5643 |
From: ted p. <tpederse@d.umn.edu> - 2006-10-15 19:09:19
|
Archival Post. Script used to run CICLING 2007 experiments where the number of clusters is specified ahead of time. A related script uses cluster stopping instead. ========================================================================= #!/bin/csh ########### This script shows how to acquire features from a separate ########### set of training data and use them to represent context ########### vectors in the SenseClusters native order 2 methodology. ########### ########### By Ted Pedersen, October 2006 ########### ########### DATA PREPARATION ## root directory set HOMEDIR = /home/ted/Web ## where test files are, in sval2 (xml) format set TESTDIR = $HOMEDIR/Test # where training data resides, in plain text format set TRAINDIR = $HOMEDIR/TrainNYT # make sure test and training directories are really there! if (! -e $TESTDIR) then echo "No Test Dir <$TESTDIR>" exit 1; endif if (! -e $TRAINDIR) then echo "No Train Dir <$TRAINDIR>" exit 1; endif # run through several different combinations of corpora and settings... foreach CORPUS (25 75) foreach STAT (leftFisher ll pmi odds) foreach REMOVE (5 10 20 50) foreach MEASURE (pk2 pk3 gap) foreach TEST (alston2.xml connor2.xml miller3.xml collins4.xml pedersen4.xml) set TRAIN = nyt-$CORPUS-$REMOVE.$STAT echo "---------running $TRAIN $MEASURE--------" ########### CREATE FEATURE MATCH PATTERNS nsp2regex.pl $TRAINDIR/$TRAIN > $TRAINDIR/$TRAIN.regex ########### SECOND ORDER CONTEXT REPRESENTATION # create order 2 vec with bigram features wordvec.pl $TRAINDIR/$TRAIN --feats $TRAIN.feats > $TRAIN.wordvec nsp2regex.pl $TRAIN.feats > $TRAIN.regex.feats order2vec.pl --rclass $TRAIN.rclass --rlabel $TRAIN.rlabel $TESTDIR/$TEST $TRAIN.wordvec $TRAIN.regex.feats > $TRAIN.vector echo "order2vec done" ########### best case, set number of clusters exactly if ($TEST == "alston2.xml") then set CLUSTERS = 2 else if ($TEST == "connor2.xml") then set CLUSTERS = 2 else if ($TEST == "miller3.xml") then set CLUSTERS = 3 else if ($TEST == "collins4.xml") then set CLUSTERS = 4 else if ($TEST == "pedersen4.xml") then set CLUSTERS = 4 else then echo "cluster setting error" exit endif vcluster -rclass $TRAIN.rclass -rlabel $TRAIN.rlabel $TRAIN.vector $CLUSTERS -clustfile $TRAIN.cluto.out > $TRAIN.cluto.report ########### EVALUATION format_clusters.pl $TRAIN.cluto.out $TRAIN.rlabel --context $TESTDIR/$TEST > $TRAIN.clusters.context clusterlabeling.pl $TRAIN.clusters.context > $TRAIN.clusterlabeling cluto2label.pl $TRAIN.cluto.out key*key > $TRAIN.prelabel label.pl $TRAIN.prelabel > $TRAIN.label report.pl $TRAIN.label $TRAIN.prelabel > $TRAIN.report mkdir $TEST-$TRAIN-$MEASURE mv $TRAIN* $TEST-$TRAIN-$MEASURE rm -fr key* rm -fr expr* end end end end end |
From: ted p. <tpederse@d.umn.edu> - 2006-10-15 18:51:22
|
Archival Post. This script was used to run experiments for CICLING 2007 submission where features were generated from external training data. ======================================================================= #!/bin/csh ########### This script shows how to acquire features from a separate ########### set of training data and use them to represent context ########### vectors in the SenseClusters native order 2 methodology. ########### ########### By Ted Pedersen, October 2006 ########### ########### DATA PREPARATION ## root directory set HOMEDIR = /home/ted/Web ## where test files are, in sval2 (xml) format set TESTDIR = $HOMEDIR/Test # where training data resides, in plain text format set TRAINDIR = $HOMEDIR/TrainNYT # make sure test and training directories are really there! if (! -e $TESTDIR) then echo "No Test Dir <$TESTDIR>" exit 1; endif if (! -e $TRAINDIR) then echo "No Train Dir <$TRAINDIR>" exit 1; endif # run through several different combinations of corpora and settings... foreach CORPUS (25 75) foreach STAT (leftFisher ll pmi odds) foreach REMOVE (5 10 20 50) foreach MEASURE (pk2 pk3 gap) foreach TEST (alston2.xml connor2.xml miller3.xml collins4.xml pedersen4.xml) set TRAIN = nyt-$CORPUS-$REMOVE.$STAT echo "---------running $TRAIN $MEASURE--------" ########### CREATE FEATURE MATCH PATTERNS nsp2regex.pl $TRAINDIR/$TRAIN > $TRAINDIR/$TRAIN.regex ########### SECOND ORDER CONTEXT REPRESENTATION # create order 2 vec with bigram features wordvec.pl $TRAINDIR/$TRAIN --feats $TRAIN.feats > $TRAIN.wordvec nsp2regex.pl $TRAIN.feats > $TRAIN.regex.feats order2vec.pl --rclass $TRAIN.rclass --rlabel $TRAIN.rlabel $TESTDIR/$TEST $TRAIN.wordvec $TRAIN.regex.feats > $TRAIN.vector echo "order2vec done" ########### CLUSTERSTOPPING AND CLUSTERING clusterstopping.pl $TRAIN.vector --prefix $TRAIN > $TRAIN.prediction if (! -e $TRAIN.prediction) then echo "No Cluster Prediction, Assume 2" set CLUSTERS = 2 else set CLUSTERS = `cat $TRAIN.prediction` echo "Predict $CLUSTERS" endif vcluster -rclass $TRAIN.rclass -rlabel $TRAIN.rlabel $TRAIN.vector $CLUSTERS -clustfile $TRAIN.cluto.out > $TRAIN.cluto.report ########### EVALUATION format_clusters.pl $TRAIN.cluto.out $TRAIN.rlabel --context $TESTDIR/$TEST > $TRAIN.clusters.context clusterlabeling.pl $TRAIN.clusters.context > $TRAIN.clusterlabeling cluto2label.pl $TRAIN.cluto.out key*key > $TRAIN.prelabel label.pl $TRAIN.prelabel > $TRAIN.label report.pl $TRAIN.label $TRAIN.prelabel > $TRAIN.report mkdir $TEST-$TRAIN-$MEASURE mv $TRAIN* $TEST-$TRAIN-$MEASURE rm -fr key* rm -fr expr* end end end end end |
From: ted p. <tpederse@d.umn.edu> - 2006-10-15 18:50:14
|
Archival post. This script was used to create external training data for CICLING 2007 submission. ================================================================= #!/bin/csh # This script was used to create statistic files for different measures # to be used as features for some other set of test/evaluation data. # by ted pedersen, october 2006 set STOPLIST = /home/ted/Web/StopLists # nyt-25.stop # nyt-75.stop set TRAINDATA = /home/CICLING/Train # nyt-plain-clean-25-tr.txt # nyt-plain-clean-75-tr.txt foreach CORPUS (1 25 75) foreach REMOVE (5 10 20 50) set PREFIX = nyt-$CORPUS-$REMOVE echo "running $PREFIX count" count.pl --ngram 2 \ --token token.regex \ --remove $REMOVE \ --stop $STOPLIST/nyt-$CORPUS.stop \ $PREFIX.cnt2 \ $TRAINDATA/nyt-plain-clean-$CORPUS-tr.txt foreach STAT (ll leftFisher pmi odds) echo "running $PREFIX $STAT statistic" if ($STAT == ll) then set SCORE = 3.84 else if ($STAT == leftFisher) then set SCORE = 0.95 else if ($STAT == pmi) then set SCORE = 5.00 else if ($STAT == odds) then set SCORE = 10000.00 else echo "statistic error" exit endif statistic.pl $STAT --precision 4 --score $SCORE $PREFIX.$STAT $PREFIX.cnt2 end end end |
From: ted p. <tpederse@d.umn.edu> - 2006-10-06 15:55:40
|
---------- Forwarded message ---------- Date: Fri, 06 Oct 2006 11:39:50 -0400 From: Anagha Kulkarni <an...@cs...> To: Zori Kozareva <zko...@dl...> Cc: ted pedersen <tpederse@d.umn.edu>, zko...@gm... Subject: some more information regarding the encoding issue Hi Zori, Few link that I think I had used. http://perldoc.perl.org/functions/binmode.html http://rf.net/~james/perli18n.html http://perldoc.perl.org/utf8.html http://groups.google.com/group/comp.lang.perl.misc/browse_thread/thread/4e1800f6eac52650/86cf1b6ba0841e1f%2386cf1b6ba0841e1f?sa=X&oi=groupsr&start=1&num=3 http://perldoc.perl.org/perllocale.html#NAME ------------------------------------------------------------------------ Below is a more elaborate version of the senseclusters note: I tried using locale and setting it to various different locales but it does not help, all it does is just ignores the accented characters. As i thought about it with the help of this mailing-list entry: http://groups.google.com/group/comp.lang.perl.misc/browse_thread/thread/4e1800f6eac52650/86cf1b6ba0841e1f%2386cf1b6ba0841e1f?sa=X&oi=groupsr&start=1&num=3 (sorry for the links length!) I think i understand why binmode works for us and not locale, whereas it worked for the NSP user. In our case where the file was created using different encoding (and locale) than our systems encoding - binmode helps. Whereas in case of the NSP user i guess - his file encoding and the system's encoding must be the same. I am reproducing a small part of the conversation from the above link which explains when to use binmode: "if the file contains the Operating System's definition of "text", then you *don't* have to use binmode. If you have a file which contains utf8 text, and the Operation System's definition of text is utf8, then you don't need binmode. If you have a file which contains latin1 text, and the Operation System's definition of text is latin1, then you don't need binmode. But if the Operating System's definition of text is utf8, and a file contains latin1 text, or vice versa, then binmode is needed." Secondly changing the $ENV{LANG} variable is not recommended because the original value of this setting (in our case: Redhat: utf8) is the "system's definition of text" which means that the Redhat claims that all the utilities provided by them are utf8 compatible and by changing that we will be breaking this assurance. (Source of information: the above mailing-list entry) Next i tried the iconv utility which converts files from one encoding to another so i ran the following command: iconv -f latin1 -t utf8 spanish.stoplist >& converted and tried count.pl (original without binmode) on this "converted" file and it executed without any "Malformatted" error msg. But the accented characters were again just ignored and were not present in the output. So as i see it we have 2 options: 1. Add the binmode(FH,":encoding(latin1)") statements to the programs which will handle the latin1 encoded data (just for spanish experiments) and then get the output with proper accented characters. 2. Convert the spanish stoplist and the input data to utf8 format using iconv which will save us the pains of modifying the programs but at the cost of ignoring the accented characters. --------------------------------------------------------------------------- I hope this helps and does not just add to your reading. Thanks, Anagha |
From: ted p. <tpederse@d.umn.edu> - 2006-09-08 00:57:33
|
Archiving note from Mahesh re: senseclusters and kernels... ---------- Forwarded message ---------- Date: Thu, 07 Sep 2006 14:20:53 -0400 From: Mahesh Joshi <mah...@cs...> The similarity matrices generated by simat.pl in SenseClusters are the kernel matrices. But there's one crucial difference in the methodology that SC has and the one that we used in my thesis: SC *always* uses training data (if separate from test) for feature selection and feature selection only. It never creates a matrix representation of the training data - except the second order case of wordvec.pl, actually (since it will be based on bigrams/cocs found in the train data, and their scores). Whereas the kernel.pl script (which is a modification of discriminate.pl) which I used for thesis experiments *always* creates a matrix representation out of the training data (which is a required parameter). So both order1vec and wordvec create a matrix from training data. This matrix representation is then used by order2vec to represent test contexts and find the similarity matrix for the test data, which serves as the kernel for SVMs. So essentially the test data is represented in terms of a matrix found from training data, thus giving (hopefully) additional knowledge about the test contexts apart from what they themselves contain. Now, unfortunately I don't see an easy way to do this directly in SC without making use of the kernel.pl script (which changes the discriminate.pl flow somewhat radically). Ahead of kernel.pl (which produces a .simat file, which should be converted to a dense format if its already not), the SVM Light wrapper that I had written takes over. It takes an arff file as input along with the similarity matrix (note that the number and order of instances in the arff file and the number and order of the square similarity matrix are one-to-one. So for 100 test instances, the .arff file should contain 100 instances, and the simat file should contain 100x100 dense matrix with the 100 contexts/instances in the same order as the arff (which is in turn the same as the order in the test sval2 file). This wrapper calls a modified version of SVM light that I had created (to handle the similarity matrix input file). ========================================================= |
From: Anagha K. <an...@cs...> - 2006-08-31 02:50:46
|
Hi Ted, >So, will testXML.n.xml either be empty (if the XML file is well >formed) or contain an error message? > Yes, that is correct. >Could it possibly contain anything >else? > > No. It should either be empty or should contain the error message(s). Thanks, Anagha > >On Wed, 30 Aug 2006 an...@cs... wrote: > > > >>Hi Ted, >> >>Sorry to learn that XML::Simple gave some hard time. >> >> >> >>>The good news is that we aren't getting those XML::Simple errors >>>any more. The bad news, I think, is that the testXML.n.out files >>>are empty, as they are in talisker >>> >>> >>No, this is a good news! :) In callwrap.pl's words (with a typo :) "if the >>xml is Not well-formed or not parsable then the output file not be empty" >> >>So, the empty files indicate that the generated xml file is well-formed >>and thus the xml version of the file should be linked (and the txt version >>need not be used.) >> >>Thanks, >>Anagha >> >> >> >>>http://marimba.d.umn.edu/SC-htdocs/user1156951411/ >>> >>>So, what are we hoping to see in these files, and why are they >>>empty? I am guessing that the XML parser previously didn't handle >>>empty files gracefully, and that is why we were seeing those >>>errors. >>> >>>BTW, talisker is using 0.12 of XML::SAX, so this is consistent >>>with what I describe above. There was apparently a bug in 0.13 of >>>XML::SAX::PurePerl that might still existing in 0.14 based on what >>>we are seeing, so switching over to XML::SAX::Expat seems reasonable, >>>I think. Let me know if that poses a concern of some sort. >>> >>>Description of bug in XML::SAX is here: >>>http://www.cpanforum.com/threads/1473 >>> >>>Thanks! >>>Ted >>> >>>-- >>>Ted Pedersen >>>http://www.d.umn.edu/~tpederse >>> >>> >>> >>> >>> >> >>------------------------------------------------------------------------- >>Using Tomcat but need to do more? Need to support web services, security? >>Get stuff done quickly with pre-integrated technology to make your job easier >>Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo >>http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 >>_______________________________________________ >>senseclusters-developers mailing list >>sen...@li... >>https://lists.sourceforge.net/lists/listinfo/senseclusters-developers >> >> >> > > > |
From: ted p. <tpederse@d.umn.edu> - 2006-08-31 02:17:10
|
Hi Anagha, Thanks for clarifying this. I didn't realize the files were supposed to be empty! So, will testXML.n.xml either be empty (if the XML file is well formed) or contain an error message? Could it possibly contain anything else? Thanks, Ted On Wed, 30 Aug 2006 an...@cs... wrote: > > Hi Ted, > > Sorry to learn that XML::Simple gave some hard time. > > > The good news is that we aren't getting those XML::Simple errors > > any more. The bad news, I think, is that the testXML.n.out files > > are empty, as they are in talisker > > No, this is a good news! :) In callwrap.pl's words (with a typo :) "if the > xml is Not well-formed or not parsable then the output file not be empty" > > So, the empty files indicate that the generated xml file is well-formed > and thus the xml version of the file should be linked (and the txt version > need not be used.) > > Thanks, > Anagha > > > > > http://marimba.d.umn.edu/SC-htdocs/user1156951411/ > > > > So, what are we hoping to see in these files, and why are they > > empty? I am guessing that the XML parser previously didn't handle > > empty files gracefully, and that is why we were seeing those > > errors. > > > > BTW, talisker is using 0.12 of XML::SAX, so this is consistent > > with what I describe above. There was apparently a bug in 0.13 of > > XML::SAX::PurePerl that might still existing in 0.14 based on what > > we are seeing, so switching over to XML::SAX::Expat seems reasonable, > > I think. Let me know if that poses a concern of some sort. > > > > Description of bug in XML::SAX is here: > > http://www.cpanforum.com/threads/1473 > > > > Thanks! > > Ted > > > > -- > > Ted Pedersen > > http://www.d.umn.edu/~tpederse > > > > > > > > > > ------------------------------------------------------------------------- > Using Tomcat but need to do more? Need to support web services, security? > Get stuff done quickly with pre-integrated technology to make your job easier > Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo > http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 > _______________________________________________ > senseclusters-developers mailing list > sen...@li... > https://lists.sourceforge.net/lists/listinfo/senseclusters-developers > -- -- Ted Pedersen http://www.d.umn.edu/~tpederse |
From: <an...@cs...> - 2006-08-31 01:13:38
|
Hi Ted, Sorry to learn that XML::Simple gave some hard time. > The good news is that we aren't getting those XML::Simple errors > any more. The bad news, I think, is that the testXML.n.out files > are empty, as they are in talisker No, this is a good news! :) In callwrap.pl's words (with a typo :) "if the xml is Not well-formed or not parsable then the output file not be empty" So, the empty files indicate that the generated xml file is well-formed and thus the xml version of the file should be linked (and the txt version need not be used.) Thanks, Anagha > > http://marimba.d.umn.edu/SC-htdocs/user1156951411/ > > So, what are we hoping to see in these files, and why are they > empty? I am guessing that the XML parser previously didn't handle > empty files gracefully, and that is why we were seeing those > errors. > > BTW, talisker is using 0.12 of XML::SAX, so this is consistent > with what I describe above. There was apparently a bug in 0.13 of > XML::SAX::PurePerl that might still existing in 0.14 based on what > we are seeing, so switching over to XML::SAX::Expat seems reasonable, > I think. Let me know if that poses a concern of some sort. > > Description of bug in XML::SAX is here: > http://www.cpanforum.com/threads/1473 > > Thanks! > Ted > > -- > Ted Pedersen > http://www.d.umn.edu/~tpederse > > > |
From: ted p. <tpederse@d.umn.edu> - 2006-08-30 15:31:11
|
I resolved the XML::Simple test errors by installing XML::SAX::Expat, which is now being used at the default XML parser (rather than XML::SAX:PurePerl, which was being used before). The other option would have been to back off from 0.14 of XML::SAX to 0.12, but moving backwards with versions to fix problems always seems to create more problems down the the road, so I didn't really want to do that. The good news is that we aren't getting those XML::Simple errors any more. The bad news, I think, is that the testXML.n.out files are empty, as they are in talisker http://marimba.d.umn.edu/SC-htdocs/user1156951411/ So, what are we hoping to see in these files, and why are they empty? I am guessing that the XML parser previously didn't handle empty files gracefully, and that is why we were seeing those errors. BTW, talisker is using 0.12 of XML::SAX, so this is consistent with what I describe above. There was apparently a bug in 0.13 of XML::SAX::PurePerl that might still existing in 0.14 based on what we are seeing, so switching over to XML::SAX::Expat seems reasonable, I think. Let me know if that poses a concern of some sort. Description of bug in XML::SAX is here: http://www.cpanforum.com/threads/1473 Thanks! Ted -- Ted Pedersen http://www.d.umn.edu/~tpederse |
From: ted p. <tpederse@d.umn.edu> - 2006-08-30 13:05:11
|
Hi Anagha, I realized I tried to do target word discrimination with a headless file. That seems to cause all sorts of pretty crazy looking errors depending on the options selected, so at some point it would probably be good to check and make sure we have a head word in the data when target word discrimination is selected. The reverse case is no problem, that is if we have a head word tag in data that is being processed as headless. Thanks, Ted -- Ted Pedersen http://www.d.umn.edu/~tpederse |
From: ted p. <tpederse@d.umn.edu> - 2006-08-26 17:46:49
|
We are pleased at announce the release of SenseClusters version 0.95. SenseClusters is a freely available package that allows you to cluster similar contexts, or to cluster words that occur in similar contexts. It is fully unsupervised, and can automatically discover the optimal number of clusters in your text. As of version 0.95, we now fully support Latent Semantic Analysis for context and word clustering, and we continue to improve the native SenseClusters methods, which includes the ability to cluster first and second order representations of context. SenseClusters can be downloaded from : http://senseclusters.sourceforge.net/ You can also try out SenseClusters via our web interface: http://marimba.d.umn.edu/cgi-bin/SC-cgi/index.cgi In both native and LSA modes, SenseClusters relies on lexical features (such as unigrams, bigrams, and co--occurrences) that can be identified in raw text. The tokenization is very flexible - a user can define this via Perl regular expressions - so it is possible to work with many other languages besides English, and you can easily work with tokenization schemes other than white-space separated words, such as character based tokens, like 2 letter sequences, etc. The native SenseClusters methods support traditional first order context clustering, where you identify a feature set, and then determine which of those features occur in the contexts you are clustering. The native methods also support second order context clustering, where each word is represented by a vector of the words with which it co-occurs. All the words in a context to be clustered are replaced by their associated vectors, and these vectors are averaged together to represent that context. Note that you can also cluster the word vectors to identify sets of related words. Latent Semantic Analysis differs from the native SenseClusters methods in that each feature is represented by a vector that shows the contexts in which that feature occurs. Then, all the features in a context to be clustered are replaced by their associated vectors, and these are averaged together to represent the context. Note that you can also cluster the feature vectors directly to identify sets of related features. This release represents a major step forward in the functionality of SenseClusters. Much of work in providing LSA support was carried out by Mahesh Joshi this past spring and summer. And has always been the case over the last two years, Anagha Kulkarni played a large role in this release, and she has included many improvements to automated cluster stopping and other areas in 0.95. Please give this a try, and let us know if you have any comments or questions! If you aren't certain if your problem can be approached using SenseClusters, please let us know what you would like to do and maybe we can help you get started. Cordially, Ted, Anagha, and Mahesh ==================================================================== ChangeLog: http://www.d.umn.edu/~tpederse/Code/Changelog.SenseClusters-v0.95.txt Installation Instructions: http://www.d.umn.edu/~tpederse/Code/SenseClusters-v0.95-INSTALL.txt Related Publications (includes links to data you can use): http://www.d.umn.edu/~tpederse/senseclusters-pubs.html -- Ted Pedersen http://www.d.umn.edu/~tpederse |
From: ted p. <tpederse@d.umn.edu> - 2006-07-15 05:03:06
|
Greetings all, I am pleased to report that Anagha has finished her MS thesis, which means she is now officially a Master of Science! :) Congratulations on a job very well done! Her thesis is entitled: Unsupervised Context Discrimination and Cluster Stopping and is available from : http://www.d.umn.edu/~tpederse/senseclusters-pubs.html or http://www.d.umn.edu/~tpederse/masters.html This is the most complete (and best) description of the automatic cluster stopping methods that are now available in SenseClusters. It also contains a great deal of other significant content, including a new and impressive set of experiments on newsgroup data, name conflate data, word sense data, and manually annotated web search data! (All of this data is available at http://www.d.umn.edu/~tpederse/Data/anagha-thesis-data.zip btw). So, please do check this out, and also join me in wishing Anagha well as she finishes her work here at UMD, and prepares to move on to CMU!! Enjoy, Ted -- Ted Pedersen http://www.d.umn.edu/~tpederse |
From: ted p. <tpederse@d.umn.edu> - 2006-07-12 15:22:30
|
Greetings all, I wanted to mention that there will be two SenseClusters related events at AAAI in Boston next week. First, I will be presenting a tutorial called "Language Independent Methods of Clustering Similar Contexts (with applications)" that will take place on Monday July 17, from 2-6 pm. This is meant to be a general overview of the methodology that underlies SenseClusters. You can see the material from this tutorial (and previous ones) at: http://www.d.umn.edu/~tpederse/SCTutorial.html Second, Anagha Kulkarni will be presenting a poster entitled "How many different "John Smiths", and who are they?" which is all about name discrimination and how we have tackeled that with SenseClusters. The poster will be presented on Wednesday evening, July 19, as a part of the demo/poster session. Here is the paper that accompanies the poster : http://www.d.umn.edu/~tpederse/Pubs/aaai06-anagha-poster.pdf So, if you are in Boston for AAAI, please do check these out, and stop by and say hi! Cordially, Ted and Anagha -- Ted Pedersen http://www.d.umn.edu/~tpederse |
From: ted p. <tpederse@d.umn.edu> - 2006-07-08 05:55:49
|
We are very pleased to announce the release of SenseClusters version 0.93. This version marks our first steps towards supporting Latent Semantic Analysis in addition to our native SenseClusters methods. In this version we now support word clustering (feature clustering really, as it is not limited to just unigrams or single words) that is based on a feature by context representation. In other words, features are clustered based on the contexts in which they occur. These matrices can optionally be reduced with SVD prior to clustering. We refer to this as LSA feature clustering. These feature by context representations are what we believe characterizes LSA, and makes it different from our native SenseClusters methods. We have supported a form of word clustering prior to this release, and it is based on a word by word representation, that is words are clustered based on the words with which they occur. You can download version 0.93 from sourceforge: http://sourceforge.net/projects/senseclusters/ As a preview, in version 0.95 we will have support for doing context discrimination "the LSA way". The features found in contexts to be discriminated will be represented by vectors that show which contexts those features occur in, thus providing a second way of doing order 2 representations. At present our native SenseClusters order 2 methodology is based on replacing the words in the contexts to be clustered with vectors showing the words with which they occur. There are some other significant changes in version 0.93, among them that SenseClusters now requires the use of Perl 5.8.5 or better. The most current version of Perl is 5.8.8 now, and 5.8.5 is several years old, so it is probably time to upgrade anyway if you are running something less than 5.8.5. Also, we have attempted to clarify the installation instructions further. We will continue to work on that in 0.95, hopefully making SenseClusters much easier to install. We think the instructions are quite a bit better now, so please check them out: http://www.d.umn.edu/~tpederse/Code/SenseClusters-v0.93-INSTALL.txt The more detailed ChangeLog for 0.93 can be found here: http://www.d.umn.edu/~tpederse/Code/Changelog.SenseClusters-v0.93.txt Please let us know if there are any questions, and please do plan on upgrading to 0.93, or trying it out on the web interface: http://marimba.d.umn.edu/cgi-bin/SC-cgi/index.cgi We would be happy to answer any questions or receive any comments you might have. Enjoy, Ted, Mahesh, and Anagha -- Ted Pedersen http://www.d.umn.edu/~tpederse |
From: ted p. <tpederse@d.umn.edu> - 2006-07-06 03:18:36
|
The following is from Anagha, in a note of June 13, 2006 that included other material, and a different subject header. I thought I would resend this portion of the note with a new subject, so that it would be easier to find in future, and it might be useful now as we contemplate 0.93. Note that even though we aren't adding any new scripts to toolkit, most of the below is still relevant I think. ------------------------------------------------------------------------- With respect to the ripple effect, whenever we add a new script to SenseClusters (more specifically to Toolkit) I typically do the following things (some of the points below are obvious but i went ahead and included them anyways) - let me know if you find anything that should be in this list but is not: 1. if applicable, update Docs/Flows/flowchart.* 2. add FILE.html documentation file to Docs/HTML/Toolkit_Docs/DIR 3. update Docs/HTML/SenseClusters-Code-README.* to link the html file added in 2. above 4. update Docs/HTML/discriminate.html 5. copy the new Docs/HTML/discriminate.html as Web/SC-htdocs/help.html 6. update Makefile.PL 7. create a new folder under Testing/ for the new script and adding test-cases 8. modify the web-interface 9. update the Changes/Changelog-v*.txt Thanks, Anagha |
From: ted p. <tpederse@d.umn.edu> - 2006-07-02 18:25:40
|
Thanks Anagha, I appreciate your comments on this, and I am glad it is sounding like a reasonable design. > I have a very minor point about the mode of the resulting stoplist - may > be we can have the "OR" mode as the default and provide an option for > user to set the mode - then one can use the stoplist that comes output > of this program as is. Agreed. I think the idea is to have something that can be used immediately as an NSP stop list. So...we might also want to consider having an option that would allow a user to indicate if they want a case sensitive list or not. In other words, /\bin\b/ versus /\b[Ii]n\b or maybe even /\b[Ii][Nn]\b/ remembering that NSP does not support the use of the /in/i directive. I think this is important actually, so I would suggest we default to case sensitive, and let the user turn that off with a flag like --caseinsensitive if they wish... Now, this does introduce an issue of possible duplicates in the stoplist, if we find In and in as stopwords, and then ask for the list to be case-insensitive, we will end up with two equivalent entries. I do not think checking for duplicates will be too difficult though. > Another point - a speculative one - although this script would not > support GigaWord format if we were to use such files as input to this > program with --inpformat as plain text then we should expect all the > meta-token along with functional words in the generated stoplist. May be > we can use this as a test case for the script. Agreed. Good idea! And in fact, I think the GigaWord corpus raises a few other interesting issues. For example, we could use --nontoken with a regex like /\<.*\>/ in order to disregard all of the meta characters from the stoplist. Now, there is still some content surrounded by the metacharacters that would be included (like title perhaps) but I think that is ok. I was tempted in fact to include support for GigaWord format, but I realized (I think) that most of the articles are about the same size, I think, and if we used the plain format, along with --nontoken for metacharacters, and a context size of approximately 200, we could probably get a pretty good stoplist. Now, you raise a nice point above, in that all these metachaters will show up as stop words, so maybe we don't even need to worry about --nontoken. The great thing about stoplist generation, I think, is that I think it's ok for it to be a fairly noisy process. Stop words should stand out in corpora blatently as you look for them, and those that are on the borderline are best left as real words I think. So the fact that using non-token and an assumed context size might miss some stop words seems ok to me, we would rather error on the side of missing. I do not think there is any way the approach I describe above would error on the side of including too many stop words. But it will be great fun to experiment with. Further thoughts and comments are of course welcome. Stoplist creation is an interesting and important issue I think, and one that is badly neglected. We all just download the SMART list and use that. :) Thanks, Ted |
From: Anagha K. <kulka020@d.umn.edu> - 2006-07-02 16:36:35
|
Hi Ted, This looks like a good design! I have a very minor point about the mode of the resulting stoplist - may be we can have the "OR" mode as the default and provide an option for user to set the mode - then one can use the stoplist that comes output of this program as is. Another point - a speculative one - although this script would not support GigaWord format if we were to use such files as input to this program with --inpformat as plain text then we should expect all the meta-token along with functional words in the generated stoplist. May be we can use this as a test case for the script. Thanks, Anagha ted pedersen wrote: > Here are some thoughts the design of a stoplist generating script. > At this point, as a practical matter I think it should be a stand-alone > script dedicated to stoplist generation. In theory one might incorporate > this with count.pl or put within SenseClusters somehow, but I think > those are somewhat more time consuming options. > > I am also quite convinced that these stoplists will be very useful indeed, > and will result in better performance for SenseClusters and perhaps the > vector measure in WordNet-Similarity. We have previously seen in both > cases a great impact on overall results as the stoplist is adjusted. > > So, in some respects I think this should be somewhat like nameconflate, in > that it should handle two different formats of text, and should be able > to control the size of the context it is working with, at least in the > case of plain text. > > The goal of this program is to take as input either plain text, or text > formatted in the senseval2 format. The output would be an NSP compatible > stoplist based on tf.idf. I also think we need a trace mode (so to speak) > that shows the tf.idf, tf and idf values (so that a user can see the > actual values if they are unsure of what is happening). > > Now, clearly we don't really have documents here, so we need to redefine > that a little. > > If the text is senseval2 formatted, then each context defines a "document" > for purposes of computing tf.idf. If the text is plain, then the user must > input a value that defines the size of their context. I would suggest a > default of 100 tokens. And so the idea would be that the program would > chop the input text into blocks of 100 tokens, and consider those to be > documents. The user could reset this size of course, based on whatever > they think might be most useful. Also, I think if the text is plain, we > should allow the user the option of saying that each line of plain text > constitutes a context. > > Now, above I mention tokens and not words, which implies that the program > must support tokenization in the NSP style, which means supporting --token > and --nontoken. This is important for supporting other languages, and then > for controlling things like whether or not numeric values should be > included in stop list (they could be removed via --nontoken). > > There are quite a few variations on how to compute tf.idf, and I think we > probably ought to just pick a standard version. What is described here: > http://www.answers.com/topic/tf-idf-1 strikes me as a pretty reasonable > formulation. We could use that, and of course describe in the perldoc. > > To summarize, here are the options that I think need to be supported... > > --inpformat FMT The format of the input file(s) > FMT = plain (default) / sval2 > > --linecontext only valid for plain mode, one line per context > do not use with contextsize or sval2 > > --token same as nsp > > ---nontoken same as nsp > > --contextsize WINDOW How large are contexts/documents (only > valid for plain text, default 100 tokens). > > --score REAL The tf.idf score that acts as a cutoff > for stopwords. Should be set to some > default, I am not sure what this should be. > > --trace display tf, idf, and tf.idf values for > each token that has tf.idf above score > > What do we think? Is anything missing or misguided in the above? > > Thanks, > Ted > > -- > Ted Pedersen > http://www.d.umn.edu/~tpederse |
From: ted p. <tpederse@d.umn.edu> - 2006-07-02 15:16:33
|
Here are some thoughts the design of a stoplist generating script. At this point, as a practical matter I think it should be a stand-alone script dedicated to stoplist generation. In theory one might incorporate this with count.pl or put within SenseClusters somehow, but I think those are somewhat more time consuming options. I am also quite convinced that these stoplists will be very useful indeed, and will result in better performance for SenseClusters and perhaps the vector measure in WordNet-Similarity. We have previously seen in both cases a great impact on overall results as the stoplist is adjusted. So, in some respects I think this should be somewhat like nameconflate, in that it should handle two different formats of text, and should be able to control the size of the context it is working with, at least in the case of plain text. The goal of this program is to take as input either plain text, or text formatted in the senseval2 format. The output would be an NSP compatible stoplist based on tf.idf. I also think we need a trace mode (so to speak) that shows the tf.idf, tf and idf values (so that a user can see the actual values if they are unsure of what is happening). Now, clearly we don't really have documents here, so we need to redefine that a little. If the text is senseval2 formatted, then each context defines a "document" for purposes of computing tf.idf. If the text is plain, then the user must input a value that defines the size of their context. I would suggest a default of 100 tokens. And so the idea would be that the program would chop the input text into blocks of 100 tokens, and consider those to be documents. The user could reset this size of course, based on whatever they think might be most useful. Also, I think if the text is plain, we should allow the user the option of saying that each line of plain text constitutes a context. Now, above I mention tokens and not words, which implies that the program must support tokenization in the NSP style, which means supporting --token and --nontoken. This is important for supporting other languages, and then for controlling things like whether or not numeric values should be included in stop list (they could be removed via --nontoken). There are quite a few variations on how to compute tf.idf, and I think we probably ought to just pick a standard version. What is described here: http://www.answers.com/topic/tf-idf-1 strikes me as a pretty reasonable formulation. We could use that, and of course describe in the perldoc. To summarize, here are the options that I think need to be supported... --inpformat FMT The format of the input file(s) FMT = plain (default) / sval2 --linecontext only valid for plain mode, one line per context do not use with contextsize or sval2 --token same as nsp ---nontoken same as nsp --contextsize WINDOW How large are contexts/documents (only valid for plain text, default 100 tokens). --score REAL The tf.idf score that acts as a cutoff for stopwords. Should be set to some default, I am not sure what this should be. --trace display tf, idf, and tf.idf values for each token that has tf.idf above score What do we think? Is anything missing or misguided in the above? Thanks, Ted -- Ted Pedersen http://www.d.umn.edu/~tpederse |
From: ted p. <tpederse@d.umn.edu> - 2006-06-27 02:03:25
|
Hi Anagha and Mahesh, As we contemplate the incorporation of Latent Semantic Analysis into SenseClusters, there is in fact a rather difficult naming issue we must deal with. SenseClusters right now refers rather generically to headed and headless clustering, and word clustering. Now, when we include Latent Semantic Analysis, probably we want to view that as being a part of SenseClusters, which is the name of the package really, but not the methodology. In fact, it points out we don't really have a name for the methodology! And indeed, we should be able to do headed, headless, and word clustering with Latent Semantic Analysis, or with (unamed SenseClusters methodology). For the first page of the web interface, I am imagining a layout something like this (note a small bit of rewriting that we should think about): SensClusters Web Interface Clusters contexts based on their similarity (?) (unamed SenseClusters methodology) target word (headed) clustering (e.g., word sense discrimination) headless clustering (e.g., email categorization) word clustering (e.g., synonym finding) Latent Semantic Analysis target word (headed) clustering headless clustering feature clustering [We will of course add this functionality in two stages, the first stage (0.93) will add the feature clustering for LSA, and then the next stage will add the headed and headless clustering.] Now, the essential difference, I think, is in whether or not we are dealing with feature by context representations (LSA) or context by feature (SC). But while that is a good explanation, it doesn't lead to a colorful or interesting name. :) Unfortunately the terms first order and second order representation become a bit ambiguous too, since we will have second order LSA (where features are replaced by a feature by context vector). Now, I guess we will not have a first order LSA version of target word clustering, so perhaps first order refers "uniquely" to one of our methods. But, second order and word clustering do not... So, this is something to think about. :) Please note that the design of the first page above makes the difference between LSA and our methods seem rather stark, when in fact they are quite closely related. However, I think it is best to keep them stark like this, especially since LSA has such high name recognition. However, if you think there is another organization of the main page that makes more sense, and still makes the availability of LSA clear to the casual user, then I am very interested. Thanks, Ted -- Ted Pedersen http://www.d.umn.edu/~tpederse |
From: ted p. <tpederse@d.umn.edu> - 2006-06-18 00:29:17
|
We are pleased to announce the release of SenseClusters version 0.91. This release includes a number of significant improvements to our web interface, and hopefully simplifies the setup of the web interface if you would like to run your own version of that. You can download this new version of SenseClusters at: http://sourceforge.net/projects/senseclusters/ BTW, please note that you do not need to install the Web interface if you don't want too, ours is always available at: http://marimba.d.umn.edu/cgi-bin/SC-cgi/index.cgi The main change to the web interface visible to users is that it now provides plots (as pdf files) that illustrate the cluster stopping decision making process, showing essentially the change in the criterion function values and where our different measures elect to stop clustering. Also note that we have cleaned up our FAQ a little bit, and would welcome new questions to include in that. You can find the more detailed ChangeLog below. Please let us know if you have any questions or comments! Enjoy, Ted and Anagha ================================================================== Changes made in Sense-Clusters version 0.89 during version 0.91 Ted Pedersen tpederse@d.umn.edu Anagha Kulkarni kulka020@d.umn.edu 1. Added config.txt under SC-cgi dir and now the settings for PATH, PERL5LIB, complete path to SC-cgi and SC-htdocs and name of the cgi dir are read by second.cgi, fifth.cgi and callwrap.pl from this single file. - Anagha 2. Modified fourth.cgi to include the missing case for --cluststop "gap" option setting. - Anagha 3. Included plot generation scripts under SC-cgi dir and updated the callwrap.pl accordingly. - Anagha 4. Modified /Web/README.Web.pod to indicate the following pre-requisites for the plot generation: gnuplot, latex and ps2pdf - Anagha 5. Updated /Web/README.Web.pod for the new config.txt related changes. - Anagha 6. Updated Docs/FAQs.pod - Anagha 7. Added FAQs.html to Docs/HTML dir. - Anagha (Changelog-v0.89to0.91 Last Updated on 06/16/2006 by Anagha) -- Ted Pedersen http://www.d.umn.edu/~tpederse |
From: Anagha K. <kulka020@d.umn.edu> - 2006-06-14 16:12:44
|
Hi Ted, > Is it true that word > clustering only allows input as test data? If so, this does not allow one > natural thing, and that would be to find word clusters in plain text > (training data that is not in senseval-2 format). Yes, currently word clustering accepts senseval2 formatted test data only. Thanks, Anagha |
From: ted p. <tpederse@d.umn.edu> - 2006-06-14 00:08:19
|
Hi Anagha and Mahesh, Very good. Thanks to both of you for your comments. So I think I am ready to say that we should go ahead and use the --lsa convention that has been previously described. I was thinking about this some more today, and really did not find an option or set of options that I liked better. And I also do not want to change --wordclust, for the reasons Mahesh has mentioned. I think the distinction between word clustering and context clustering is fairly intuitive, and we can point out that our word clustering is in fact more generic than that, and can be considered feature clustering. I do not think that will be too confusing. Thanks for the interesting discussion. I think we have made some good decisions here, although if there are problems that we did not expect because of this, let's raise them immediately. I will start to compose a note or two to the users list, describing our plans. Thanks, Ted On Tue, 13 Jun 2006, Mahesh Joshi wrote: > > Hi Ted, > > I too think the idea of having a "--lsa" option is better. It does > give a convenient switch internally for programming purposes (rather > than multiple option values to handle) and also maintains the > backwards compatibility, which would have been a concern otherwise. > > As you mention, let us stick to "--wordclust" as the option name for > feature clustering for now with the understanding that it also > provides feature clustering and we will have explicit and visible > documentation mentioning the same (for the option itself, in the > CHANGELOG and any other places). This has the further advantage of > maintaining absolute backwards compatibility (not even renaming the > option). > > I do understand that training data does not make sense for feature > clustering, however I am not sure about the headed/headless issue - > so I will not comment on that for now. > > Thanks, > Mahesh > |
From: ted p. <tpederse@d.umn.edu> - 2006-06-14 00:03:30
|
Hi Anagha, > Went back and looked at our correspondence regarding this issue of > performing word clustering only with headless data and more or less the > summary is that we did not want to restrict the word clustering to > finding words similar to some specific target word but wanted to cluster > as many open-class words as possible into sets of related words. > > So I would like to take back what I had suggest regarding feature > clustering and type of data. Thus I think, we should carry ahead the > restriction of using only headless type of data with word clustering to > feature clustering too. Very good, thanks for this clarification. I agree. I think "word" clustering should mean to take all the features that are found in a given set of test data, and to cluster those. So the input to word clustering should be headless (no target words) contexts formatted in the senseval-2 format. Is it true that word clustering only allows input as test data? If so, this does not allow one natural thing, and that would be to find word clusters in plain text (training data that is not in senseval-2 format). I do not think this is a huge problem, and I am not too worried about fixing this now, but it is something we want to be clear about. Thanks, Ted -- Ted Pedersen http://www.d.umn.edu/~tpederse |