senseclusters-developers Mailing List for SenseClusters (Page 3)
Status: Beta
Brought to you by:
tpederse
You can subscribe to this list here.
2003 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(2) |
Dec
|
---|---|---|---|---|---|---|---|---|---|---|---|---|
2004 |
Jan
(6) |
Feb
|
Mar
|
Apr
(2) |
May
|
Jun
(1) |
Jul
(1) |
Aug
(2) |
Sep
(5) |
Oct
(30) |
Nov
(7) |
Dec
(11) |
2005 |
Jan
(51) |
Feb
(8) |
Mar
(3) |
Apr
(2) |
May
(2) |
Jun
(2) |
Jul
(5) |
Aug
(20) |
Sep
(5) |
Oct
(2) |
Nov
(2) |
Dec
|
2006 |
Jan
(8) |
Feb
(2) |
Mar
(7) |
Apr
(2) |
May
(4) |
Jun
(16) |
Jul
(7) |
Aug
(6) |
Sep
(1) |
Oct
(4) |
Nov
(1) |
Dec
|
2007 |
Jan
|
Feb
|
Mar
(1) |
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2008 |
Jan
|
Feb
|
Mar
(2) |
Apr
(10) |
May
|
Jun
(1) |
Jul
(1) |
Aug
|
Sep
|
Oct
|
Nov
|
Dec
(2) |
2009 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
(2) |
Sep
|
Oct
|
Nov
|
Dec
|
2010 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(2) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2013 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
(3) |
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2015 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(1) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(1) |
Nov
|
Dec
|
From: Anagha K. <kulka020@d.umn.edu> - 2006-06-13 18:05:58
|
Hi Ted, > Also, we should think about --space similarity. Are there any issues > associated with that we should be aware of? In my mind we should still > have the ability to create similarity spaces as we do now, since > I think the similarity matrices are created *after* the context > representation is created. But, we should of course check that and make > sure everything will continue to work (and that simat and bitsimat > will work on the results of our --lsa mode). I was pretty sure that the flow of control in similarity space is as you have described above: creation of context/word vectors -> creating similarity matrix -> clustering in sim space however i went ahead and have verified this and it is as we have expected it to be. So in short, we if are fine! Thanks, Anagha |
From: Mahesh J. <joshi031@d.umn.edu> - 2006-06-13 17:55:38
|
Hi Ted, I too think the idea of having a "--lsa" option is better. It does give a convenient switch internally for programming purposes (rather than multiple option values to handle) and also maintains the backwards compatibility, which would have been a concern otherwise. As you mention, let us stick to "--wordclust" as the option name for feature clustering for now with the understanding that it also provides feature clustering and we will have explicit and visible documentation mentioning the same (for the option itself, in the CHANGELOG and any other places). This has the further advantage of maintaining absolute backwards compatibility (not even renaming the option). I do understand that training data does not make sense for feature clustering, however I am not sure about the headed/headless issue - so I will not comment on that for now. Thanks, Mahesh On Tuesday, Jun 13, 2006, at 9:42 AM, ted pedersen wrote: > > Hi Anagha, > > Thanks for your comments and suggestions. > >> I like your second idea of adding a new option "--lsa" - it looks >> cleaner. > > Yes, I found myself liking the fact that it makes the lsa connection > explicit, which I think will help avoid option overload. > >> For the issue with using the option name "--wordclust" for both >> word-clustering and feature-clustering - may be you could use >> something >> more generic like "--termclust" ? > > Mahesh and I discussed --termclust a little, but I was not crazy > about the > idea because "term" has a specific meaning, and I don't think it will > include all of our different bigrams or co-occurrences, for example. > > One option would be the more accurate --featclust, which would imply > feature clustering. This is perhaps a better option that --wordclust, > which really clearly says/means "word" clustering, and while that > is what > we support now, in future what we support will be more generic... > > Of course, we might also want to be consistent with respect to how we > specify context clustering. We simply say > > --context o1 > --context o2 > > That is actually quite nice I think, as it is clear and relatively > clean. > Unfortunately an option like > > --feature > --feature --lsa > > is too vague and it's sort of confusing. Mahesh and I had talked > about the > idea of an option like > > --rowclustering > > instead of --wordclust, but there are some options for svd and cluto > that start with --row, so I'm a little concerned about overloading > that. > > In some respects I would like to find an alternative to wordclust, > which > is both a little awkward and also going to be inaccurate. Ideally it > would be somewhat "symmetric" to the --context option... > > For now, I think I prefer --wordclust to --featclust and --termclust, > but I am not sure that I am convinced it is the best possible option > name... > > I will admit that I am growing relatively fond of the --lsa > convention, > but am still open to other ideas. > >> As to how the current restrictions will translate to the new "lsa" >> mode >> - i think, headed or headless either type of data should be fine. But >> the restriction on no-training data would persist, i think. > > Just to clarify that, I think for word clustering we still do not > want to > allow training training (it doesn't really make sense), but for > context > clustering it should of course be ok to have training data. > > For word clustering, I am not sure about the issue of headed or > headless > data. Right now we only allow headless data, I think. So perhaps we > would > want to retain that distinction? > > Thanks, > Ted > > -- > Ted Pedersen > http://www.d.umn.edu/~tpederse |
From: Anagha K. <kulka020@d.umn.edu> - 2006-06-13 17:50:21
|
> Just to clarify that, I think for word clustering we still do not want to > allow training training (it doesn't really make sense), but for context > clustering it should of course be ok to have training data. Yes, I agree. > For word clustering, I am not sure about the issue of headed or headless > data. Right now we only allow headless data, I think. So perhaps we would > want to retain that distinction? Went back and looked at our correspondence regarding this issue of performing word clustering only with headless data and more or less the summary is that we did not want to restrict the word clustering to finding words similar to some specific target word but wanted to cluster as many open-class words as possible into sets of related words. So I would like to take back what I had suggest regarding feature clustering and type of data. Thus I think, we should carry ahead the restriction of using only headless type of data with word clustering to feature clustering too. Thanks, Anagha |
From: ted p. <tpederse@d.umn.edu> - 2006-06-13 15:09:20
|
Hi Anagha and Mahesh, Just a few thoughts here on the --lsa option. I think when we use that option we are saying two things... 1) represent features with respect to the contexts in which they occur. This will require the use of order1vec, which will figure out which contexts include the feature, and produce a context by feature matrix. 2) transpose that context by feature matrix created in 1). Now, 1) is a little confusing since when we use --context o2 --lsa as we are asking for an order 2 context representation, but we will create it using order1vec. We will create a context by feature representation with order1vec, transpose it, and then use the resulting feature by context representation as input to order2vec to build the representation of the context vectors to be clustered. Actually, that isn't so confusing... If we do --wordclust --lsa then we are simply saying create a context by feature representation, again with order1vec, and then we transpose that, and take the resulting feature by context vector and cluster that. Note that in both of the above cases we should be able to use svd after the transpose step. Also, we should think about --space similarity. Are there any issues associated with that we should be aware of? In my mind we should still have the ability to create similarity spaces as we do now, since I think the similarity matrices are created *after* the context representation is created. But, we should of course check that and make sure everything will continue to work (and that simat and bitsimat will work on the results of our --lsa mode). Thanks, Ted -- Ted Pedersen http://www.d.umn.edu/~tpederse |
From: ted p. <tpederse@d.umn.edu> - 2006-06-13 14:45:07
|
Hi Anagha, Thanks for your comments and suggestions. > I like your second idea of adding a new option "--lsa" - it looks cleaner. Yes, I found myself liking the fact that it makes the lsa connection explicit, which I think will help avoid option overload. > For the issue with using the option name "--wordclust" for both > word-clustering and feature-clustering - may be you could use something > more generic like "--termclust" ? Mahesh and I discussed --termclust a little, but I was not crazy about the idea because "term" has a specific meaning, and I don't think it will include all of our different bigrams or co-occurrences, for example. One option would be the more accurate --featclust, which would imply feature clustering. This is perhaps a better option that --wordclust, which really clearly says/means "word" clustering, and while that is what we support now, in future what we support will be more generic... Of course, we might also want to be consistent with respect to how we specify context clustering. We simply say --context o1 --context o2 That is actually quite nice I think, as it is clear and relatively clean. Unfortunately an option like --feature --feature --lsa is too vague and it's sort of confusing. Mahesh and I had talked about the idea of an option like --rowclustering instead of --wordclust, but there are some options for svd and cluto that start with --row, so I'm a little concerned about overloading that. In some respects I would like to find an alternative to wordclust, which is both a little awkward and also going to be inaccurate. Ideally it would be somewhat "symmetric" to the --context option... For now, I think I prefer --wordclust to --featclust and --termclust, but I am not sure that I am convinced it is the best possible option name... I will admit that I am growing relatively fond of the --lsa convention, but am still open to other ideas. > As to how the current restrictions will translate to the new "lsa" mode > - i think, headed or headless either type of data should be fine. But > the restriction on no-training data would persist, i think. Just to clarify that, I think for word clustering we still do not want to allow training training (it doesn't really make sense), but for context clustering it should of course be ok to have training data. For word clustering, I am not sure about the issue of headed or headless data. Right now we only allow headless data, I think. So perhaps we would want to retain that distinction? Thanks, Ted -- Ted Pedersen http://www.d.umn.edu/~tpederse |
From: ted p. <tpederse@d.umn.edu> - 2006-06-13 13:37:18
|
On Tue, 13 Jun 2006, Anagha Kulkarni wrote: Thanks Anagha, > I would like to clarify just one point - if a user requests feature > clustering and svd then svd will be applied to the transposed matrix > (feature by context) and not to the context by feature matrix, right? Yes. SVD should always be done on the representation that we are going to cluster. So we should be able to do SVD with word/feature clustering and context clustering. So, we should be able to do the following...btw, while I am not sure if the --lsa option is the best way to go, I do find it convenient as a shorthand. --context o1 --context o1 --svd --context o2 --context o2 --svd --context o2 --lsa --context o2 --lsa --svd --wordclust --wordclust --svd --wordclust -lsa --wordclust -lsa --svd So this does suggest a slight possible confusion with the --lsa option, in that it does not imply svd is being used, and that svd must be requested. That is ok, I think. So in effect, --lsa means that we want the feature by context representation, and we may optionally apply svd to that. > With respect to the ripple effect, whenever we add a new script to > SenseClusters (more specifically to Toolkit) I typically do the > following things (some of the points below are obvious but i went ahead > and included them anyways) - let me know if you find anything that > should be in this list but is not: > 1. if applicable, update Docs/Flows/flowchart.* Yes, the --lsa changes will require flowchart updates. > 2. add FILE.html documentation file to Docs/HTML/Toolkit_Docs/DIR > 3. update Docs/HTML/SenseClusters-Code-README.* to link the html file > added in 2. above > 4. update Docs/HTML/discriminate.html > 5. copy the new Docs/HTML/discriminate.html as Web/SC-htdocs/help.html > 6. update Makefile.PL > 7. create a new folder under Testing/ for the new script and adding > test-cases > 8. modify the web-interface > 9. update the Changes/Changelog-v*.txt In addition, in this case I think the overall documentation of the package (the main README) will need some revising, to reflect the fact that we are now supporting LSA and that we have added a new sort of representation to the package. It is probably time to revisit our overall documentation anyway, so this can be a part of that. Thanks! Ted |
From: Anagha K. <kulka020@d.umn.edu> - 2006-06-13 07:36:57
|
Hi Ted, I like your second idea of adding a new option "--lsa" - it looks cleaner. For the issue with using the option name "--wordclust" for both word-clustering and feature-clustering - may be you could use something more generic like "--termclust" ? As to how the current restrictions will translate to the new "lsa" mode - i think, headed or headless either type of data should be fine. But the restriction on no-training data would persist, i think. Thanks, Anagha ted pedersen wrote: > Hi Mahesh, > > We have been discussing the naming conventions and terminology that > we should use for "word clustering" versus context clustering, and how > in general lsa support should be incorporated into discriminate.pl > > One important point that we've made is that order1 and order2 only apply > to context clustering. order1 refers to representing a context with > a vector that shows the features that occur in that context, and order2 > refers to representing a content with a vector that is an average of > the vectors that represent the words or features in the contexts. > > Now, with our support for feature by context representation that is in the > works, we will introduce a new type of order2 representation. Rather than > representing words in the contexts to be clustered with vectors consisting > of other words (the co-occurrences of the words) we will be able to > represent the contexts to be clustered by averaging together vectors of > features that represent the contexts in which those features occur. So we > will have a word by word representation (current o2) and a feature by > context representation (new order 2). > > Right now we have in discriminate.pl the option > > --context o1 > or > --content o2 > > we need something that indicates our new order 2, that is the one that > uses the feature by context vectors to represent the context to be > clustered. > > One idea might be to simply create a new value for context, like... > > --context o2_lsa > > Another idea would be to create a new "switch" that would turn on "lsa" > style processing, which would mean rather than using a word by word > representation, we would use feature by context... > > --context o2 --lsa > > The idea here would be that the --lsa switch could also be applied to > our --wordclust option, to essentially change the word clustering option > from word by word to feature by context (and thereby cluster features). > > --wordclust --lsa > > This plan of attack, *might* have the benefit of minimizing the changes > required in discrimiate.pl, but I am not sure of that. The possible > drawback is that --wordclust means "word clustering" and wordclust --lsa > actually means feature clustering rather than word clustering... > > Now, the advantage of this is that --lsa makes it very clear where we > are using lsa and where we are not, and I think that is a good thing, > since I want that to be clear when we introduce this functionality. > > So these would be the main "modes" of operation in SenseClusters after > the inclusion of the LSA support. > > --context o1 > --context o2 > --context o2 --lsa > --wordclust > --wordclust --lsa > > The --lsa option would only be allowed with --context and --wordclust, > and it would not be valid with --context o1. > > BTW, --wordclust should only allow for headless data, and it should not > be possible to use training data. These are the current restrictions, and > I think they remain valid for --lsa mode. > > So this is one idea that I was kicking around. There are more, but I > wanted to get the discussion started sooner than later. > > Any drawbacks to the above that are apparent? > > Thanks, > Ted > > -- > Ted Pedersen > http://www.d.umn.edu/~tpederse |
From: ted p. <tpederse@d.umn.edu> - 2006-06-13 02:46:43
|
Hi Mahesh, We have been discussing the naming conventions and terminology that we should use for "word clustering" versus context clustering, and how in general lsa support should be incorporated into discriminate.pl One important point that we've made is that order1 and order2 only apply to context clustering. order1 refers to representing a context with a vector that shows the features that occur in that context, and order2 refers to representing a content with a vector that is an average of the vectors that represent the words or features in the contexts. Now, with our support for feature by context representation that is in the works, we will introduce a new type of order2 representation. Rather than representing words in the contexts to be clustered with vectors consisting of other words (the co-occurrences of the words) we will be able to represent the contexts to be clustered by averaging together vectors of features that represent the contexts in which those features occur. So we will have a word by word representation (current o2) and a feature by context representation (new order 2). Right now we have in discriminate.pl the option --context o1 or --content o2 we need something that indicates our new order 2, that is the one that uses the feature by context vectors to represent the context to be clustered. One idea might be to simply create a new value for context, like... --context o2_lsa Another idea would be to create a new "switch" that would turn on "lsa" style processing, which would mean rather than using a word by word representation, we would use feature by context... --context o2 --lsa The idea here would be that the --lsa switch could also be applied to our --wordclust option, to essentially change the word clustering option from word by word to feature by context (and thereby cluster features). --wordclust --lsa This plan of attack, *might* have the benefit of minimizing the changes required in discrimiate.pl, but I am not sure of that. The possible drawback is that --wordclust means "word clustering" and wordclust --lsa actually means feature clustering rather than word clustering... Now, the advantage of this is that --lsa makes it very clear where we are using lsa and where we are not, and I think that is a good thing, since I want that to be clear when we introduce this functionality. So these would be the main "modes" of operation in SenseClusters after the inclusion of the LSA support. --context o1 --context o2 --context o2 --lsa --wordclust --wordclust --lsa The --lsa option would only be allowed with --context and --wordclust, and it would not be valid with --context o1. BTW, --wordclust should only allow for headless data, and it should not be possible to use training data. These are the current restrictions, and I think they remain valid for --lsa mode. So this is one idea that I was kicking around. There are more, but I wanted to get the discussion started sooner than later. Any drawbacks to the above that are apparent? Thanks, Ted -- Ted Pedersen http://www.d.umn.edu/~tpederse |
From: ted p. <tpe...@ma...> - 2006-06-02 03:18:52
|
Hi Mahesh, I just wanted to record a few thoughts and ideas from our meeting of Wed May 31 before they slipped away from me. I have cc'd Anagha on this since we get into some issues relating to Senseclusters changes and versions and I want to make sure we don't clash. I also copied this to the SenseClusters developers list, as that provides a nice archiving mechanism. We discussed a plan of action that orders things more or less like this: 1) vector/matrix CPAN module, to perform transpose of order1vec output (which is context by feature). 2) perform feature clustering based on transposed output of order1vec. This represents a second kind of word clustering, the one we currently support (via --wordclust) uses word by word representation. In thinking about both of these methods of word clustering, in some sense they both represent first order methods. The word by word representation clusters words based on the words they co-occur with, and the feature by context representation clusters features based on the contexts in which they occur. So one is a word co-occurrence based method (--wordcluster, the existing method), while the new method is more a context co-occurrence techique (features will be clustering together if they occur in similar contexts). It is not yet clear ot me how to exactly articulate or phrase this, but I don't think calling them first and second order word clusters is quite right. We discussed whether or not we should extend the word by word matrices now used to create 2nd order representations to be feature by feature. We decided that this was probably not too essential at this point, in that if someone really wanted to cluster features they could use our new feature by context representation. After 1 and 2 are completed, we will release a new version of SenseClusters, that includes the new word clustering method. This new version should include support in discriminate.pl for this, test cases, and support in the web interface. The ordering of the points below is a little less clear. 1 and 2 are clearly sequential, we need to think a little more about the points below before ordering I think. 3) add support for "feature matching" to order2vec.pl, currently it just matches words (unigrams) in the context with those in the word by word matrix. This will allow for the creation of order 2. This will allow us to create second order representations of contexts, where features are replaced with a vector of contexts. These would of course be created by point 2 above. So we will have two ways to create second order representations. The first is what we now provide, where words are replaced with vectors or word co-occurrences. The second (the new way) will be to replace features with vectors of context co-occurrences. Both our new way of clustering features (feature by context) and our new way of second order representation (replace features in context with vector of context co-occurrences) are very similar to LSA. 4) In discussing feature matching, you proposed a very interesting idea that makes sense to me. Rather than using the xml2arff methodology, which matches the contexts to be represented with regular expressions, you poposed that we run NSP without any cutoffs (frequency scores, etc.) on the test data (as well as the training data) and then find out if the candidate features identified in the test data actually are features according to our features selection data. This has the potential to be much faster, and if so we would want to do this with both order1vec.pl and order2vec.pl. Order1vec.pl is currently based on xml2arff, and it is very slow. The new order2vec.pl, that would match features, would also likely be based on xml2arff, and as such could also be very slow. So this method of matching features might allow us to speed up the existing order1vec, and extend order2vec to features without making is slower. 5) add support for the automatic generation of stoplists. We have several options here, one is to create a standalone utility that would generate a list of stopwords based on something like tf/idf. We could also do this internally in senseclusters, where we look at the feature by context representation, and remove those features that occur in "too many" contexts. We would also like to be able to provide tf/idf scores in our feature by context representation, which suggests that order1vec.pl would need to be extended to output these values (right now it supports binary values and frequency counts). The standlone idea would result in a stoplist that would simply be input exaxctly like the stoplists we now use. We would not be able to use the tf/idf scores internal to SenseClusters, but we would be able to quickly derive stoplists for domain specific corpora or other languages. In NSP we have a mode for count where each line is considered to represent a context. We could use that when the data is formatted like that, otherwise we could simply define a value N that tells us how big a context is, and then we go through a corpus of plain text and figure out tf/idf based on that assumption. Note that instead of documents here we are talking about contexts. Of the above, I would rate 3 as essential, and 4 and 5 as highly desirable. So, the above is pretty much taken off the top of my head, and so it is possible I have missed some important points, or said things poorly. Please do add any comments, additions, or disagreements you may have. I think it is important to hammer out a plan for 3, 4, and 5 as soon as possible, since that will help us plan the rest of the summer pretty well. The most important thing is to try and anticipate all the changes we need or want to make now, rather than adding them later, that doesn't tend to work too well. Anagha, any comments or observations you have are of course welcome. If you have any concerns about any of the above being feasible or possibly clashing with some of your work, please do raise that asap so we can plan accordingly. Thanks, Ted -- Ted Pedersen http://www.d.umn.edu/~tpederse |
From: ted p. <tpe...@ma...> - 2006-06-01 03:16:39
|
Final list of changes for version 0.89 of Knoppix CD. ---------- Forwarded message ---------- Date: Wed, 31 May 2006 18:29:59 -0500 From: Anagha Kulkarni <kulka020@d.umn.edu> To: ted pedersen <tpederse@d.umn.edu> Subject: Knoppix CD Hi Ted, Following is the list of things that i found might need some changing in the latest knoppix CD: * Time zone. * Icons for SC Data folder and SC Live! browser overlap. * Some more information about SC on the homepage - may be from README.SC.pod's synopsis/introduction ?? * The transition from the first paragraph to the note that follows feels a bit abrupt - may be we can put the note in [] or use smaller and different font. * Use the README.SC.html from http://senseclusters.sourceforge.net/README.SC.html (The current one has unparsed items) * Move (Docs/HTML) html pages (and the Toolkit_Docs dir) under the "documentation" link to htdocs * FAQ - remove the question about ClusterStopping * FAQ - Question about email - needs updating - headless mode of SC. (looks like the FAQ document in general might need some updating) * Link to SenseClusters on the Publication page is the external link - change to the local link * Web-interface - Increase the font size of the text "SenseClusters Web Interface" in the banner - Change the SC external link to local link in the banner - Copy SC/Docs/HTML/discriminate.html to htdocs/SC-htdocs/help.html Thanks, Anagha |
From: Anagha K. <kulka020@d.umn.edu> - 2006-06-01 00:16:45
|
* Update FAQs document * Web/SC-htdocs/help.html not updated to the latest /Docs/HTML/discriminate.html * Web-interface: if an experiment fails the reason/error is logged into the logfile but does not get displayed at the browser. * Web-interface: when experiment with word-clustering if the option of setting the #clusters manually is selected then on the final screen the specified #clusters is not displayed. * Add README.SC.html to the distribution. * If input file is split into training and test data and if both scopes (train and test) are specified then the train-scope gets applied to the test data instead of getting applied to the train data. |
From: ted p. <tpederse@d.umn.edu> - 2006-05-28 04:38:53
|
We are pleased to announce the release of SenseClusters version 0.89. This includes a small but important fix to 0.87, which itself included a small but important fix to 0.85. So, you probably want to make sure you are running 0.89 to avoid these small but important problems or discrepencies that we found in the earlier releases! You can download this version from : http://senseclusters.sourceforge.net/ or http://www.d.umn.edu/~tpederse/senseclusters.html Here are the Changelogs for both 0.89 and 0.87: First, in 0.87 : Changes made in Sense-Clusters version 0.85 during version 0.87 Ted Pedersen tpederse@d.umn.edu Anagha Kulkarni kulka020@d.umn.edu 1. Fixed a bug in clusterstopping.pl related to the case of empty column, i.e, when a feature(s) does not occur in any of the contexts/instances. -Anagha 2. Updated INSTALL and Makefile.PL to require v0.03 of Algorithm::RandomMatrixGe neration. -Anagha (Changelog-v0.85to0.87 Last Updated on 05/16/2006 by Anagha) ------------------------------------------------------------------------ And then in 0.89: Changes made in Sense-Clusters version 0.87 during version 0.89 Ted Pedersen tpederse@d.umn.edu Anagha Kulkarni kulka020@d.umn.edu 1. Modified the Makefile.PL and INSTALL document to require v0.04 of Algorithm::RandomMatrixGeneration instead of 0.03 -Anagha 2. Changed the default precision from 4 to 6 in discriminate.pl and /Web/SC-cgi/first.cgi -Anagha (Changelog-v0.87to0.89 Last Updated on 05/27/2006 by Anagha) ------------------------------------------------------------------------ Let us know if you have any questions, comments, or requests! Enjoy! Ted and Anagha -- Ted Pedersen http://www.d.umn.edu/~tpederse |
From: Anagha K. <kulka020@d.umn.edu> - 2006-05-10 05:56:16
|
Notes on how to statistically assess/compare performance of different algorithms/experimental-settings when using same datasets with each of the algorithms. From "Empirical Methods for Artificial Intelligence" by Paul R. Cohen. More specifically from Chapter 4 - "Hypothesis testing and estimation" and largely Chapter 6 - "Performance Assessment". -------------------------------------------------------------------------- What is it exactly that we wish to show/prove? Given results such as below, we want to primarily show that all the 5 settings are not performing equally. We also want to show: - as a sanity check, that A is significantly different from E (and B is significantly different from E) - A is better than C (and B is better than D) - And in this case, A and B are not significantly different. Note: The numbers in the bracket are the #clusters used by that experimental setting. A B C D E Order1(2) Order2(2) Order1(6) Order2(6) Baseline ------------------------------------------------------------------------- 94.88 96.22 61.44 76.17 55.45 60.11 59.16 51.37 51.89 50.00 68.42 70.26 54.37 57.57 50.00 53.09 68.95 51.23 63.39 51.41 89.15 91.03 60.12 54.37 50.45 To show all the above we start with hypothesis testing and thus defining the hypotheses: - the null hypothesis (H0) - all the five settings are performing equally. - the alternative hypothesis (H1) - the five settings are not equal. Now we will analyze the variance in the above results i.e. we will perform "analysis of variance" to show which of the differences in the performances are statistically significant and which are not. Note: Henceforth I will be referring to various terms and computations from the worksheet named "analysis of variance" in the attached excel sheet. Rows 1-6 is the above data (x(i,j)) where i=5 (#rows) & j=5 (#cols). Row 8 is total of individual columns (settings/groups). Row 9 is mean/average (m(j)). Row 10 is standard deviation computed as: s(j) = Sqrt(SummationOver_i((x(i,j) - m(j))^2)/(i-1)) Row 12 gives the Grand Mean (gm) which is computed by merging all the data (rows and columns) into a single sample of size N = 25 experiments (5 * 5 = 25) and then computing mean as usual. Row 13 gives the standard deviation (gs) for this sample of size N. Row 16 and 17 are intermediate calculations. Note: Henceforth, the "within" term refers to computations performed over individual groups i.e., columns/settings. While the "between" term refers to computations performed across groups i.e., columns/settings. Row 16: w(j) = SummationOver_i((x(i,j) - m(j))^2) Row 17: b(j) = (m(j) - gm)^2 Row 20 - 23 is the final table for the analysis of variance. Row 21: The "between" group calculations: - The degrees of freedom are computed as: j - 1 (#settings - 1). - The Sums of Squared deviations = SummationOver_j(b(j)) * #experiments = SummationOver_j(b(j)) * i = SummationOver_j(b(j)) * 5 - Mean Square deviation: MS-between = SS-between / df-between Row 21: The "within" group calculations: - The degrees of freedom are computed as: N - j - The Sums of Squared deviations = SummationOver_j(w(j) - Mean Square deviation: MS-within = SS-within / df-within F-value = MS-between/MS-within Once we have the F-value we can lookup the critical value in F-distribution table (for different levels: 0.05, 0.01 etc.) with df-between (column index in F-table) and df-within (row index in F-table). In our example I have looked up the F-table for 0.05 level and the critical value for df-between = 4 and df-within = 20 was 2.87 Now since our F-value (4.38) is greater than the found critical value (2.87), if we were to reject the null hypothesis then the probability of being wrong would be less than 0.05! The value in the cell below the p-value (0.01050)is the exact p-value for the F-value of 4.38 with dfs of 4 and 20. Thus we have statistically shown using analysis of variance that there is significant variability between groups in the above results and thus they are not equal. However analysis of variance does not tell which groups did better - for this we do pairwise comparisons. We do 2 such pairwise comparisons on each pair we are interested in - Scheffe test and Least Significant Difference (LSD) test. The Scheffe test is conservative while the LSD is less stringent. Note: Please refer to the worksheet named "Scheffe Tests". Scheffe Test statistic for groups a & b = (m(a) - m(b))^2/(MS-within * (1/#a + 1/#b) * (j-1)) Row 3 gives the m(j) values from the previous sheet. Row 5 gives the MS-within value from the previous sheet. Row 6 gives the degrees of freedom for between and within from the previous sheet. Row 9 - 17 give the Scheffe test statistic for various pairs and their corresponding p-values. Note: Please refer to the worksheet named "LSD Tests". LSD Test statistic for groups a & b = (m(a) - m(b))^2/(MS-within * (1/#a + 1/#b)) Row 3 gives the m(j) values from the 1st sheet. Row 5 gives the MS-within value from the 1st sheet. Row 6 the degrees of freedom for this test are different from that used by the analysis of variance and by Scheffe's test. This test uses 1 and N - j (20) as the degrees of freedom. Row 9 - 17 give the LSD test statistic for various pairs and their corresponding p-values. -------------------------------------------------------------------------- |
From: ted p. <tpederse@d.umn.edu> - 2006-05-09 01:07:36
|
We are pleased to announce the release of version 0.85 of SenseClusters. This release features our adaptation of the Gap Statistic, a state of the art method for automatically identifying the number of clusters in a given set of data. You can download this version from the links provided at : http://senseclusters.sourceforge.net/ or http://www.d.umn.edu/~tpederse/senseclusters.html You can also find the web interface to version 0.85 available at these links. With the Gap Statistic, there are now 4 different methods of finding the number of clusters automatically in SenseClusters. We will be presenting a demo of all of these at NAACL in New York City on June 6. You can see the paper that describes what we are demoing here: Automatic Cluster Stopping with Criterion Functions and the Gap Statistics (Pedersen and Kulkarni), Appears in the Proceedings of the Demonstration Session of the Human Language Technology Conference and the Sixth Annual Meeting of the North American Chapter of the Association for Computational Linguistics, June 6, 2006, New York City. http://www.d.umn.edu/~tpederse/Pubs/naacl06-demo.pdf So, please check out this new version, and if you are at NAACL please visit our demo! We will also have Knoppix CDs available with SenseClusters already installed, so you can run on your own PC without having to install. Please let us know if you have any questions or comments! Enjoy, Ted and Anagha -- Ted Pedersen http://www.d.umn.edu/~tpederse |
From: ted p. <tpederse@d.umn.edu> - 2006-05-07 22:45:26
|
Thanks! I will do this! Ted On Sun, 7 May 2006, Anagha Kulkarni wrote: > Hi Ted, > > Along with SenseClusters-Code-README.html you will also have to upload > the following files to sf: > 1. discriminate.html present at SC/Docs/HTML > 2. clusterstopping.html present at SC/Docs/HTML/Toolkit_Docs/clusterstop > > Sorry for missing this earlier. > > Thanks, > Anagha > > > ted pedersen wrote: > > Hi Anagha, > > > > I am in the process of releasing SC. I have renamed the tar file as > > SenseClusters-v0.85.tar.gz and the top level directory as > > SenseClusters_v0.85, but otherwise made not changes to the release. > > > > I have updated the index.html and README.SC.html pages on sf, but > > not this one > > http://senseclusters.sourceforge.net/SenseClusters-Code-README.html > > > > I think we have dealt with this issue before, and I will scan through > > my email to see how we create that (I think just running a script) > > but if you happen to know off the top of your head that would be great. > > > > If you could poke around my web pages (starting from my home page) > > and make sure everything looks in order, that would be great. If you > > could try and download and unpack the distribution, that would be > > good too...let me know if anything looks amiss, and then I will > > announce things. I am planning to announce on the corpora list also, > > since it has been a while since we have done that.... > > > > Thanks! > > Ted > > > > On Sat, 6 May 2006, Anagha Kulkarni wrote: > > > >> Hi Ted, > >> > >> I have copied 2 files, namely, SC_0.85.tar.gz & readme.tar.gz to > >> http://marimba.d.umn.edu/SC_0.85/ > >> > >> The SC_0.85.tar.gz file contains the v0.85 distribution (I have removed > >> the CVS files from all the directories) and the readme.tar.gz contains > >> the file README.SC.html > >> > >> I will shortly switch the web-interface from v0.83 to v0.85 and will let > >> you know ones that is ready. > >> > >> Please let me know if you see any problem or would like any help. > >> > >> Thanks! > >> Anagha > >> > > > > -- > > Ted Pedersen > > http://www.d.umn.edu/~tpederse > -- Ted Pedersen http://www.d.umn.edu/~tpederse |
From: Anagha K. <kulka020@d.umn.edu> - 2006-04-30 03:29:39
|
Hi Ted, I took a look at this problem and as you suspected, it was a minor bug in our 0.83 version. This is now fixed in v0.85 Thanks for bringing this my notice. Anagha ted pedersen wrote: > Hi Anagha, > > When I run the toolkit.sh demo, I get a warning about something > for discriminate.pl. It would be good if we looked at that > sometime, just to make sure it's nothing horrible. I don't think > it is, but since I noticed it I thought I would mention. > > Here's what it is... > > In similarity space, I think. > > Use of unitialized value in concatenation (.) or string at > /usr/local/bin/discriminate.pl line 2031 > > This was in authority.n.co.o2.similarity.rbr, but occurred > in others too. > > I copied that manually since it was running on knoppix. But, > I think I'm accurate! > > Thanks, > Ted > > -- > Ted Pedersen > http://www.d.umn.edu/~tpederse |
From: ted p. <tpederse@d.umn.edu> - 2006-04-17 07:59:04
|
Here are some E1 values based on data where there are exactly 8 clusters. What we can see is that the E1 values stabilize after that number of clusters are found, since at that point increasing the number of clusters does not change the overall inter cluster similarity (since there are only 8 types of contexts in the data, further subdivisions are dividing contexts that are already the same...) 1-way clustering: [E1=2.26e+05] [800 of 800] 2-way clustering: [E1=1.95e+05] [800 of 800] 3-way clustering: [E1=1.67e+05] [800 of 800] 4-way clustering: [E1=1.42e+05] [800 of 800] 5-way clustering: [E1=1.20e+05] [800 of 800] 6-way clustering: [E1=1.02e+05] [800 of 800] 7-way clustering: [E1=8.83e+04] [800 of 800] 8-way clustering: [E1=8.00e+04] [800 of 800] 9-way clustering: [E1=8.00e+04] [800 of 800] 10-way clustering: [E1=8.00e+04] [800 of 800] 11-way clustering: [E1=8.00e+04] [800 of 800] 12-way clustering: [E1=8.00e+04] [800 of 800] 13-way clustering: [E1=8.00e+04] [800 of 800] 14-way clustering: [E1=8.00e+04] [800 of 800] 15-way clustering: [E1=8.00e+04] [800 of 800] 16-way clustering: [E1=8.00e+04] [800 of 800] 17-way clustering: [E1=8.00e+04] [800 of 800] 18-way clustering: [E1=8.00e+04] [800 of 800] 19-way clustering: [E1=8.00e+04] [800 of 800] 20-way clustering: [E1=8.00e+04] [800 of 800] 21-way clustering: [E1=8.00e+04] [800 of 800] 22-way clustering: [E1=8.00e+04] [800 of 800] 23-way clustering: [E1=8.00e+04] [800 of 800] You'll notice in the perfect case that the E1 value at k=23 is 80,000, and we hit that value after k=8. That means that the inter cluster similarity is no longer changing at that point, since we have perfectly separated data. Now, if we look at some random data with the same marginal totals, we see a different situation... 1-way clustering: [E1=1.95e+05] [800 of 800] 2-way clustering: [E1=1.42e+05] [800 of 800] 3-way clustering: [E1=1.24e+05] [800 of 800] 4-way clustering: [E1=1.07e+05] [800 of 800] 5-way clustering: [E1=9.98e+04] [800 of 800] 6-way clustering: [E1=9.41e+04] [800 of 800] 7-way clustering: [E1=8.92e+04] [800 of 800] 8-way clustering: [E1=8.44e+04] [800 of 800] 9-way clustering: [E1=8.20e+04] [800 of 800] 10-way clustering: [E1=8.01e+04] [800 of 800] 11-way clustering: [E1=7.82e+04] [800 of 800] 12-way clustering: [E1=7.65e+04] [800 of 800] 13-way clustering: [E1=7.47e+04] [800 of 800] 14-way clustering: [E1=7.31e+04] [800 of 800] 15-way clustering: [E1=7.21e+04] [800 of 800] 16-way clustering: [E1=7.10e+04] [800 of 800] 17-way clustering: [E1=7.01e+04] [800 of 800] 18-way clustering: [E1=6.91e+04] [800 of 800] 19-way clustering: [E1=6.82e+04] [800 of 800] 20-way clustering: [E1=6.74e+04] [800 of 800] 21-way clustering: [E1=6.66e+04] [800 of 800] 22-way clustering: [E1=6.59e+04] [800 of 800] 23-way clustering: [E1=6.52e+04] [800 of 800] We start at 195,000 (k=1) and then arrive at 65,200 (k=23), which makes sense, in that the inter cluster similarity is very high at k=1 (there is only one cluster, and we are measuring the distance of the centroid of that cluster to the centroid of the data, which are essentially the same thing. So the similarity between clusters steadily reduces, while the internal similarity increases. No conclusions here, just some raw data to think about... Ted -- Ted Pedersen http://www.d.umn.edu/~tpederse |
From: ted p. <tpederse@d.umn.edu> - 2006-03-28 16:07:22
|
Oh no! I forgot to include nameconflate!! I was thinking we should do that, so that if people have plain text at least they can create new data. Hmmmmm...There are a few small things I would like to resolve, so I might try to fix this too. Otherwise, I'd like to make sure to fix this for the naacl/aaai release. I think if we can include some public domain corpora too that would help (not gigaword, but there are others). Ah, there is always something. It turns out to be easy to do things like removing buttons and toolbars. One simply boot your knoppix version, make those changes as a user does, and then you have an option to save a configuration file. You save that to a usb device, and then reboot into your "master copy" that you are creating on the hard drive, and copy the configuration file to /home/knoppix/KNOPPIX. So, I removed the Mozilla button, the Office button, those German toolbars :), and renamed the shortcut to SenseClusters Live!. So that was a nice surprise (easier than expected). Thanks, Ted -- Ted Pedersen http://www.d.umn.edu/~tpederse |
From: ted p. <tpederse@d.umn.edu> - 2006-03-28 08:18:43
|
-- Ted Pedersen http://www.d.umn.edu/~tpederse ---------- Forwarded message ---------- Date: Mon, 27 Mar 2006 19:40:58 -0600 From: Anagha Kulkarni <kulka020@d.umn.edu> To: ted pedersen <tpederse@d.umn.edu> Subject: Re: senseclusters live! testing Hi Ted, I am testing your SenseClusters' Live! CD on talisker. It is just amazing! Very very neat and nicely done! I have few observations below (I hope apart from what you have had): 1. Would the CDs have paper cover? If yes, could and should we say something of the sort that "The system will take some time to boot and stabilize, please wait until you see SenseClusters' web-page." Because I *think* that was the period when the user might wonder about the progress. 2. Link to Sample Data missing?? 3. Could we change the README file to the version that SenseClusters' homepage points to? (http://senseclusters.sourceforge.net/README.SC.html) Because if you look at the introduction section of the README on the CD you will see an unparsed =head1 tag. 4. I think the solution to the Browse problem is adding a front-slash at the end of the directory name. This worked for me. Following are all very minor points: 1. Could we remove the Bookmark toolbar from the Konqueror window? 2. At places space (" ") present in between "v" and "0.83". 3. Should we remove the Firefox icon from the Start Panel? Or else should we also set the home-page to SenseClusters for Firefox too? This is all I could catch! Very impressive - really! Please let me know if I can help. Thanks, Anagha ted pedersen wrote: > Hi Anagha, > > I've been doing some testing of the cd today, and this is what I found. I > just wanted to make sure that if you saw one of these that you would > realize I saw it too. :) These are also notes to self to some degree... > > Things to be added: > > 1) explanation of photo, and credit to UMD photographer. Put on > acknowledgements of main intro page to SC. > > 2) explanation that apache is configured to run standalone, > so links to external sites dont work, but still might be > useful (to have url) or if someone reconfigures. Put on main intro > page to SC. > > Things to fix: > > 1) Not possible to browse results, file /localhost/SC-htdocs/userxxxxxxx > > all other links are ok, and .tar file created ok, so not sure why this > is a problem. Permissions are set to rwxr-xr-x I think, so maybe that > is the problem? needs to be rwxrwxrwx? Will also check if owner matters. > Current owner is www-data while we are usually running as knoppix. So > maybe changing permisssions or changing owner will resolve? > > 2) Rename Knoppix Icon on Desktop at SenseClusters (not sure how to do). > > 3) Put Data in more convenient location, maybe on desktop. Currently in > /usr/lib/htdocs/ > > 4) Main SC web page is a little messy, type for running from command line > is awkward and looks bad. Make page more "clean" > > Maybe to do? > > 1) Add a stop of apachectl/httdp at runlevel 6 (which happens during > shutdown). This is probably not necessary, but maybe nice. > > So, that is what I have found. I will keep testing, but generally speaking > I feel like things are running ok. > > Thanks! > Ted > > -- > Ted Pedersen > http://www.d.umn.edu/~tpederse |
From: ted p. <tpederse@d.umn.edu> - 2006-03-27 19:51:01
|
Hi Anagha, I think we have talked about this before but I can't remember the resolution. Sometimes the web interface will fail in an apparently mysterious way (and say something like "error opening file xxx"). But, if you look at the logfile the error is completely explainable. Like, you can't do evaluation on unannotated data, or you don't have enough features. Would it be possible to have the web interface output the more descriptive logfile error in addition to the general failure message? This seems very familiar to me, so mabye we have already been through this, but I'm afraid I just can't remember!! Thanks, Ted -- Ted Pedersen http://www.d.umn.edu/~tpederse |
From: ted p. <tpe...@cs...> - 2006-03-27 08:23:45
|
Hi Anagha, There is an assumption made regarding the naming of cgi-bin and htdoc directories in apache that is not necessarily always true. I am using apache 1.3 for the knoppix cd (since this is what knoppix uses) and there the cgi-bin directory defaults to /usr/lib/cgi-bin, and DocumentRoot (what we call htdocs) is /var/www. This poses a small problem for callwrap.pl since it assumes that cgi-bin and htdocs (named as that) will be on the same "level", as in /usr/lib/cgi-bin and /usr/lib/htdocs. I have fixed this by configuring apache slightly differently than is usual for 1.3 (with the names mentione above) but I think it is working ok. We might want to document this a bit more clearly, and say that we expect this particular arrangement and these names, and if there is anything else used even if Apache understands it our programs will not. BTW, I was using a newer version of apache earlier, but that seemed to cause some glitches in other packages (not senseclusters) so I decided to drop back to the version of apache included in Knoppix. I am not sure why they use such an old version, perhaps it is smaller, or something. Anyway, that's how this came up...this is probably peculiar to 1.3. Thanks, Ted -- Ted Pedersen http://www.d.umn.edu/~tpederse |
From: ted p. <tpederse@d.umn.edu> - 2006-03-27 08:03:23
|
SenseClusetrs Live! CD (Knoppix) Mastering, March 2006 These are some notes that I created while mastering a Knoppix Linux Distribution that features Senseclusters. I call it SenseClusters Live! (version 0.83). These are hardly complete, but hopefully the mention at least a few high points in the creation of this CD. Note that the creation is not quite yet finished, but I think it's close, and I am waiting for one of the longish steps this process has, so it seemed like a good time to try and write things down. If you are not familiar with Knoppix Linux, the key innovation is that it allows you to create a CD that you can boot in order to run Linux and SenseClusters without having to install on the hard drive. This strikes us as a good idea since SenseClusters really only runs completely on Linux or Solaris, and installation of Linux on a hard drive is more than some folks want to attempt, and then installation of SenseClusters on top of that has a few wrinkles. I have been doing my remastering using Ubuntu 5.10. I don't think that matters much, except that Ubuntu supported wireless networking out of the box, so to speak, and that has been a big help. So they way you do this in general is to create a partition on a Linux distribution (Ubuntu) and then copy your Knoppix distribution CD to that partition. You get one CD worth of compressed data to build your distribution, so that limits you to 700MB. Knoppix is at approximately 700 MB, so whatever you wish to add must be offset by packages in Knoppix that you remove. I have been using Knoppix 4.0.2. I am generally following hints and tips found in the books "Knoppix Hacks" by Kyle Rankin (OReilly) and "Hacking Knoppix" by Scott Granneman (Wiley). These are both good, although Hacking Knoppix seems a bit more current, and is based on 4.0.2 while the Knoppix Hacks book is based on 3.4.0. I am also finding the following Howto to be very helpful: http://www.knoppix.net/wiki/Knoppix_Remastering_Howto The uncompressed Knoppix version 4.0.2 takes up 2.0 GB, and it compresses down to 700MB. The data and papers that are included with SenseClusters take up about 440MB uncompressed, and then SenseClusters and affiliated tools take up about 60 MB more, so I needed to remove about 500 MB uncompressed from the 2.0 GB that I started with. That's about 25%! So, with apt-get (debian based package manager) I removed the following : openoffice-de-en 300 mb ! i18n files 100mg mozilla-thunderbird 32 mb xboing 5mb chromium 5mb enigma and enigma-data 22mb gaim gaim-data 12mb kpilot 5mb kstars and kstars-data 20mb gimp and gimp-data 27mb various games... The removing was trickier than I expected, because I think a few times I removed things I didn't realize where important, and then once I went through the long process of creating the iso file system and burning the CD, nothing worked too well. :) So I got very careful about this, and probably learned a lot about Linux packages as a result! Then I installed the following with apt-get: pdl 22mb perl-doc 12.5 mb Note that I used apt-get for installation and removal when I could, since that is the Knoppix way. apt-get is a debian tool, and knoppix is derived from debian. But, there were some things not available as Debian packages - I also installed various CPAN modules, using the interactive cpan installer: text-nsp bit-vector set-scalar sparse algorithm-munkres XML::Simple (to display web interface output) Then, I installed a few packages "by hand", which means compiling, making, etc... and copying the executable to a system directory (in my case I put all executables in /usr/local/bin) cluto (scluster and vcluster) svdpackc (las2) SenseClusters (v0.83) (many programs, mostly .pl) After all this, the total size of everything is 2.0 gb uncompressed. After compressing, it is 683mb, which just fits onto a CD. In addition to the package installation, there was some configuration that needed to be done, perhaps most trickily Apache. For whatever reason Knoppix uses Apache 1.3 (whereas the current version is 2.2.0), so I needed to make a few small changes to the apache configuration to make sure our code would work. The biggest change was probably the default location of Scripts and DocumentRoot. So, I set scripts directory to /usr/lib/cgi-bin and the DocumentRoot directory to /usr/lib/htdocs (In Apache 1.3 DocumentRoot defaults to /var/www) Then, I set the Listen and Port values as follows: Listen 127.0.01:3279 Port 3280 If you happen to set the ports to the same value, nothing works!!!! knoppix does not automatically start apache, so I added a small startup script to /etc/rc[2-5].d and rc.local One small Perl issue... I needed to add a symbolic link to /usr/local/bin/perl from /usr/bin/perl since I did not have /usr/local/bin/perl available and that is what is referenced in the web interface scripts. ln -s /usr/bin/perl /usr/local/bin/perl Other setup tasks... modify /etc/profile to include path to NSP measures (due to quirk in how NSP searches for measures, using the path rather than inc) modify /tmp to be rwx for all users ?? (not sure I really had to do this or not, but at somet point I did and it seemed to help). change boot.msg to indicate this is SenseClusters Live! change background.jpg to a lovely picture of Duluth :) change resolv.conf for networking, at least while remastering. then change it back, or you'll distribute something that has your ip address, etc. in it. There was a a bit of work done in setting up local web pages for presenting easy to use links to the web interface, and to get data and papers that we also make available on this cd. Nothing there was so unique it bears mentioning here, just remember that some of that is found in /home/linuxiso/KNOPPIX and some of it is found in /usr/lib/htdocs. So that's some of what I did. It's a little more complex than I expected, but fortunately most everything seems to be working now! I will update this if there are new significant pieces of information that I might wish to remember in a few months time! Thanks, Ted -- Ted Pedersen http://www.d.umn.edu/~tpederse |
From: ted p. <tpederse@d.umn.edu> - 2006-03-27 04:55:23
|
Hi Anagha, When I run the toolkit.sh demo, I get a warning about something for discriminate.pl. It would be good if we looked at that sometime, just to make sure it's nothing horrible. I don't think it is, but since I noticed it I thought I would mention. Here's what it is... In similarity space, I think. Use of unitialized value in concatenation (.) or string at /usr/local/bin/discriminate.pl line 2031 This was in authority.n.co.o2.similarity.rbr, but occurred in others too. I copied that manually since it was running on knoppix. But, I think I'm accurate! Thanks, Ted -- Ted Pedersen http://www.d.umn.edu/~tpederse |
From: ted p. <tpederse@d.umn.edu> - 2006-03-22 03:29:03
|
I will be attending EACL in Trento, Italy April 3-7, and I will be doing three different presentations that revolve around SenseClusters. Please plan on attending any or all of these. There is one paper, one tutorial, and one demo, so you get a little bit of everything. First, on April 3 I will present the following paper at the Cross Language Induction Workshop http://www.site.uottawa.ca/~diana/eacl2006-clki-workshop.html : Improving Name Discrimination : A Language Salad Approach (Pedersen, Kulkarni, Angheluta, Kozareva, and Solorio) - Appears in the Proceedings of the EACL 2006 Workshop on Cross-Language Knowledge Induction, April 3, 2006, Trento, Italy. http://www.d.umn.edu/~tpederse/Pubs/eacl2006-salad.pdf This is very fun work that I like very much, where we have mixed together English with Bulgarian, Romanian and Spanish in order to improve name discrimination. As crazy as that sounds, it works pretty well. :) Second, the next day (April 4) I will present a tutorial that focuses on the methods that are implemented in SenseClusters. This tutorial will also feature the unveiling and debut of our new SenseClusters Live! CD. This is a Knoppix based Linux distribution that includes SenseClusters and lots of data, and you can run it from the CD without having to install Linux or SenseClusters on your hard drive. I will have extra CDs available so even if you don't come to the tutorial you can get one, and we will also have an iso version of this posted so if you aren't at EACL can download and burn onto a CD, just like you do for Linux. Here's a short description of the tutorial, and it will be on the afternoon of April 4. http://eacl06.itc.it/tutorials/tutorial.htm#TU03 Third, the *next* day (April 5) I will present a demo of the new cluster stopping techniques found in SenseClusters. Those are described in the following paper: Selecting the "Right" Number of Senses Based on Clustering Criterion Functions (Pedersen and Kulkarni), Appears in the Proceedings of the Posters and Demo Program of the Eleventh Conference of the European Chapter of the Association for Computational Linguistics, April 5-7, 2006, Trento, Italy. http://www.d.umn.edu/~tpederse/Pubs/eacl2006-demo.pdf If you haven't already gotten your SenseClusters Live! CD by this time, please stop by and see the demo and get a CD. I will be in Demo Session 2 on April 5, and it looks like there are quite a few demos of interest at all the sessions so please plan on visiting several of them. http://eacl06.itc.it/posters-demos/posters.htm So, if you are at EACL please do come to some or all of these events. They are all really different so you won't get bored (I promise :)! Your questions or comments on any of the above are of course most welcome. See you in Trento! Ted -- Ted Pedersen http://www.d.umn.edu/~tpederse |
From: ted p. <tpederse@d.umn.edu> - 2006-02-09 00:00:28
|
We are pleased to announce the release of version 0.83 of SenseClusters. We have made a larger version increment than usual (from 0.73 to 0.83) to make the point that there is significant new functionality in the package as of 0.83. You can download this new version from: http://www.d.umn.edu/~tpederse/senseclusters.html or http://senseclusters.sourceforge.net In particular, we have incorporated support for automatically identifying the number of clusters in a given data set. There are three methods provided, and they are described more completely in the following paper that will appear at EACL (in conjunction with a demo) in April: Selecting the "Right" Number of Senses Based on Clustering Criterion Functions (Pedersen and Kulkarni), To appear in the Proceedings of the Posters and Demo Program of the Eleventh Conference of the European Chapter of the Association for Computational Linguistics, April 3-7, 2006, Trento, Italy. http://www.d.umn.edu/~tpederse/Pubs/eacl2006-demo.pdf You can also try out this new functionality on our web interface, available at: http://marimba.d.umn.edu/cgi-bin/SC-cgi/index.cgi Please do give this a try. This is a very significant enhancement to the package. Your comments are particularly welcome as we seek to improve and expand our ability to automatically identify the number of clusters in a given set of data. Enjoy, Ted and Anagha ======================================================================== Below is a copy of the ChangeLog for Version 0.83. 1. Added Toolkit/clusterstop/clusterstopping.pl -Anagha 2. Integrated clusterstopping.pl with discriminate.pl -Anagha 3. Added test-cases for clusterstopping.pl -Anagha 4. Modified web-interface to support clusterstopping -Anagha 5. Modified/added documentation for cluster stopping: README.SC.pod, README.Toolkit.pod, discriminate.html, clusterstopping.html -Anagha 6. Removed /svd/pdlsvd.pl and related threads -Anagha 7. Fixed a bug about pattern matching in format_clusters.pl -Anagha -- Ted Pedersen http://www.d.umn.edu/~tpederse |