senseclusters-developers Mailing List for SenseClusters (Page 3)

Status: Beta

Brought to you by: tpederse

senseclusters-developers — SenseClusters Developers Mailing List

You can subscribe to this list here.

2003	Jan	Feb	Mar	Apr	May	Jun	Jul	Aug	Sep	Oct	Nov (2)	Dec
2004	Jan (6)	Feb	Mar	Apr (2)	May	Jun (1)	Jul (1)	Aug (2)	Sep (5)	Oct (30)	Nov (7)	Dec (11)
2005	Jan (51)	Feb (8)	Mar (3)	Apr (2)	May (2)	Jun (2)	Jul (5)	Aug (20)	Sep (5)	Oct (2)	Nov (2)	Dec
2006	Jan (8)	Feb (2)	Mar (7)	Apr (2)	May (4)	Jun (16)	Jul (7)	Aug (6)	Sep (1)	Oct (4)	Nov (1)	Dec
2007	Jan	Feb	Mar (1)	Apr	May	Jun	Jul	Aug	Sep	Oct	Nov	Dec
2008	Jan	Feb	Mar (2)	Apr (10)	May	Jun (1)	Jul (1)	Aug	Sep	Oct	Nov	Dec (2)
2009	Jan	Feb	Mar	Apr	May	Jun	Jul	Aug (2)	Sep	Oct	Nov	Dec
2010	Jan	Feb	Mar	Apr	May (2)	Jun	Jul	Aug	Sep	Oct	Nov	Dec
2013	Jan	Feb	Mar	Apr	May	Jun (3)	Jul	Aug	Sep	Oct	Nov	Dec
2015	Jan	Feb	Mar	Apr	May (1)	Jun	Jul	Aug	Sep	Oct (1)	Nov	Dec

Flat | Threaded

<< < 1 2 3 4 5 .. 11 > >> (Page 3 of 11)

Re: [Senseclusters-developers] effect of --lsa

From: Anagha K. <kulka020@d.umn.edu> - 2006-06-13 18:05:58

Hi Ted,

> Also, we should think about --space similarity. Are there any issues
> associated with that we should be aware of? In my mind we should still
> have the ability to create similarity spaces as we do now, since
> I think the similarity matrices are created *after* the context 
> representation is created. But, we should of course check that and make
> sure everything will continue to work (and that simat and bitsimat
> will work on the results of our --lsa mode).

I was pretty sure that the flow of control in similarity space is as you 
have described above:

creation of context/word vectors -> creating similarity matrix -> 
clustering in sim space

however i went ahead and have verified this and it is as we have 
expected it to be. So in short, we if are fine!

Thanks,
Anagha

Re: [Senseclusters-developers] options and naming conventions for lsa support

From: Mahesh J. <joshi031@d.umn.edu> - 2006-06-13 17:55:38

Hi Ted,

I too think the idea of having a "--lsa" option is better. It does  
give a convenient switch internally for programming purposes (rather  
than multiple option values to handle) and also maintains the  
backwards compatibility, which would have been a concern otherwise.

As you mention, let us stick to "--wordclust" as the option name for  
feature clustering for now with the understanding that it also  
provides feature clustering and we will have explicit and visible  
documentation mentioning the same (for the option itself, in the  
CHANGELOG and any other places). This has the further advantage of  
maintaining absolute backwards compatibility (not even renaming the  
option).

I do understand that training data does not make sense for feature  
clustering, however I am not sure about the headed/headless issue -  
so I will not comment on that for now.

Thanks,
Mahesh


On Tuesday, Jun 13, 2006, at 9:42 AM, ted pedersen wrote:

>
> Hi Anagha,
>
> Thanks for your comments and suggestions.
>
>> I like your second idea of adding a new option "--lsa" - it looks  
>> cleaner.
>
> Yes, I found myself liking the fact that it makes the lsa connection
> explicit, which I think will help avoid option overload.
>
>> For the issue with using the option name "--wordclust" for both
>> word-clustering and feature-clustering - may be you could use  
>> something
>> more generic like "--termclust" ?
>
> Mahesh and I discussed --termclust a little, but I was not crazy  
> about the
> idea because "term" has a specific meaning, and I don't think it will
> include all of our different bigrams or co-occurrences, for example.
>
> One option would be the more accurate --featclust, which would imply
> feature clustering. This is perhaps a better option that --wordclust,
> which really clearly says/means "word" clustering, and while that  
> is what
> we support now, in future what we support will be more generic...
>
> Of course, we might also want to be consistent with respect to how we
> specify context clustering. We simply say
>
> --context o1
> --context o2
>
> That is actually quite nice I think, as it is clear and relatively  
> clean.
> Unfortunately an option like
>
> --feature
> --feature --lsa
>
> is too vague and it's sort of confusing. Mahesh and I had talked  
> about the
> idea of an option like
>
> --rowclustering
>
> instead of --wordclust, but there are some options for svd and cluto
> that start with --row, so I'm a little concerned about overloading  
> that.
>
> In some respects I would like to find an alternative to wordclust,  
> which
> is both a little awkward and also going to be inaccurate. Ideally it
> would be somewhat "symmetric" to the --context option...
>
> For now, I think I prefer --wordclust to --featclust and --termclust,
> but I am not sure that I am convinced it is the best possible option
> name...
>
> I will admit that I am growing relatively fond of the --lsa  
> convention,
> but am still open to other ideas.
>
>> As to how the current restrictions will translate to the new "lsa"  
>> mode
>> - i think, headed or headless either type of data should be fine. But
>> the restriction on no-training data would persist, i think.
>
> Just to clarify that, I think for word clustering we still do not  
> want to
> allow training training (it doesn't really make sense), but for  
> context
> clustering it should of course be ok to have training data.
>
> For word clustering, I am not sure about the issue of headed or  
> headless
> data. Right now we only allow headless data, I think. So perhaps we  
> would
> want to retain that distinction?
>
> Thanks,
> Ted
>
> --
> Ted Pedersen
> http://www.d.umn.edu/~tpederse

Re: [Senseclusters-developers] options and naming conventions for lsa support

From: Anagha K. <kulka020@d.umn.edu> - 2006-06-13 17:50:21

> Just to clarify that, I think for word clustering we still do not want to 
> allow training training (it doesn't really make sense), but for context 
> clustering it should of course be ok to have training data. 

Yes, I agree.

> For word clustering, I am not sure about the issue of headed or headless
> data. Right now we only allow headless data, I think. So perhaps we would
> want to retain that distinction? 

Went back and looked at our correspondence regarding this issue of 
performing word clustering only with headless data and more or less the 
summary is that we did not want to restrict the word clustering to 
finding words similar to some specific target word but wanted to cluster 
as many open-class words as possible into sets of related words.

So I would like to take back what I had suggest regarding feature 
clustering and type of data. Thus I think, we should carry ahead the 
restriction of using only headless type of data with word clustering to 
feature clustering too.

Thanks,
Anagha

[Senseclusters-developers] effect of --lsa

From: ted p. <tpederse@d.umn.edu> - 2006-06-13 15:09:20

Hi Anagha and Mahesh,

Just a few thoughts here on the --lsa option. I think when we use
that option we are saying two things...

1) represent features with respect to the contexts in which they occur. 
This will require the use of order1vec, which will figure out which 
contexts include the feature, and produce a context by feature matrix.

2) transpose that context by feature matrix created in 1). 

Now, 1) is a little confusing since when we use

--context o2 --lsa

as we are asking for an order 2 context representation, but we
will create it using order1vec. We will create a context by feature
representation with order1vec, transpose it, and then use the resulting
feature by context representation as input to order2vec to build
the representation of the context vectors to be clustered. Actually,
that isn't so confusing...

If we do 

--wordclust --lsa

then we are simply saying create a context by feature representation,
again with order1vec, and then we transpose that, and take the
resulting feature by context vector and cluster that. 

Note that in both of the above cases we should be able to use svd
after the transpose step. 

Also, we should think about --space similarity. Are there any issues
associated with that we should be aware of? In my mind we should still
have the ability to create similarity spaces as we do now, since
I think the similarity matrices are created *after* the context 
representation is created. But, we should of course check that and make
sure everything will continue to work (and that simat and bitsimat
will work on the results of our --lsa mode).

Thanks,
Ted

--
Ted Pedersen
http://www.d.umn.edu/~tpederse

Re: [Senseclusters-developers] options and naming conventions for lsa support

From: ted p. <tpederse@d.umn.edu> - 2006-06-13 14:45:07

Hi Anagha,

Thanks for your comments and suggestions. 

> I like your second idea of adding a new option "--lsa" - it looks cleaner.

Yes, I found myself liking the fact that it makes the lsa connection 
explicit, which I think will help avoid option overload. 

> For the issue with using the option name "--wordclust" for both 
> word-clustering and feature-clustering - may be you could use something 
> more generic like "--termclust" ?

Mahesh and I discussed --termclust a little, but I was not crazy about the 
idea because "term" has a specific meaning, and I don't think it will 
include all of our different bigrams or co-occurrences, for example. 

One option would be the more accurate --featclust, which would imply  
feature clustering. This is perhaps a better option that --wordclust,  
which really clearly says/means "word" clustering, and while that is what  
we support now, in future what we support will be more generic...

Of course, we might also want to be consistent with respect to how we 
specify context clustering. We simply say 

--context o1
--context o2

That is actually quite nice I think, as it is clear and relatively clean.
Unfortunately an option like

--feature
--feature --lsa

is too vague and it's sort of confusing. Mahesh and I had talked about the  
idea of an option like 

--rowclustering

instead of --wordclust, but there are some options for svd and cluto
that start with --row, so I'm a little concerned about overloading that.

In some respects I would like to find an alternative to wordclust, which 
is both a little awkward and also going to be inaccurate. Ideally it
would be somewhat "symmetric" to the --context option...

For now, I think I prefer --wordclust to --featclust and --termclust,
but I am not sure that I am convinced it is the best possible option
name...

I will admit that I am growing relatively fond of the --lsa convention,
but am still open to other ideas.

> As to how the current restrictions will translate to the new "lsa" mode 
> - i think, headed or headless either type of data should be fine. But 
> the restriction on no-training data would persist, i think.

Just to clarify that, I think for word clustering we still do not want to 
allow training training (it doesn't really make sense), but for context 
clustering it should of course be ok to have training data. 

For word clustering, I am not sure about the issue of headed or headless
data. Right now we only allow headless data, I think. So perhaps we would
want to retain that distinction? 

Thanks,
Ted

--
Ted Pedersen
http://www.d.umn.edu/~tpederse

[Senseclusters-developers] lsa planning, svd and ripple effects

From: ted p. <tpederse@d.umn.edu> - 2006-06-13 13:37:18

On Tue, 13 Jun 2006, Anagha Kulkarni wrote:

Thanks Anagha,

> I would like to clarify just one point - if a user requests feature 
> clustering and svd then svd will be applied to the transposed matrix 
> (feature by context) and not to the context by feature matrix, right?

Yes. SVD should always be done on the representation that we are going
to cluster. So we should be able to do SVD with word/feature clustering 
and context clustering. 

So, we should be able to do the following...btw, while I am not sure if
the --lsa option is the best way to go, I do find it convenient as a 
shorthand. 

--context o1
--context o1 --svd

--context o2 
--context o2 --svd

--context o2 --lsa
--context o2 --lsa --svd

--wordclust 
--wordclust --svd

--wordclust -lsa
--wordclust -lsa --svd

So this does suggest a slight possible confusion with the --lsa option,
in that it does not imply svd is being used, and that svd must be
requested. That is ok, I think. So in effect, --lsa means that we want
the feature by context representation, and we may optionally apply 
svd to that. 

> With respect to the ripple effect, whenever we add a new script to 
> SenseClusters (more specifically to Toolkit) I typically do the 
> following things (some of the points below are obvious but i went ahead 
> and included them anyways) - let me know if you find anything that 
> should be in this list but is not:
> 1. if applicable, update Docs/Flows/flowchart.*

Yes, the --lsa changes will require flowchart updates. 

> 2. add FILE.html documentation file to Docs/HTML/Toolkit_Docs/DIR
> 3. update Docs/HTML/SenseClusters-Code-README.* to link the html file 
> added in 2. above
> 4. update Docs/HTML/discriminate.html
> 5. copy the new Docs/HTML/discriminate.html as Web/SC-htdocs/help.html
> 6. update Makefile.PL
> 7. create a new folder under Testing/ for the new script and adding 
> test-cases
> 8. modify the web-interface
> 9. update the Changes/Changelog-v*.txt

In addition, in this case I think the overall documentation of the package
(the main README) will need some revising, to reflect the fact that we
are now supporting LSA and that we have added a new sort of representation
to the package. It is probably time to revisit our overall documentation
anyway, so this can be a part of that.

Thanks!
Ted

Re: [Senseclusters-developers] options and naming conventions for lsa support

From: Anagha K. <kulka020@d.umn.edu> - 2006-06-13 07:36:57

Hi Ted,

I like your second idea of adding a new option "--lsa" - it looks cleaner.

For the issue with using the option name "--wordclust" for both 
word-clustering and feature-clustering - may be you could use something 
more generic like "--termclust" ?

As to how the current restrictions will translate to the new "lsa" mode 
- i think, headed or headless either type of data should be fine. But 
the restriction on no-training data would persist, i think.

Thanks,
Anagha


ted pedersen wrote:
> Hi Mahesh,
> 
> We have been discussing the naming conventions and terminology that 
> we should use for "word clustering" versus context clustering, and how
> in general lsa support should be incorporated into discriminate.pl
> 
> One important point that we've made is that order1 and order2 only apply
> to context clustering. order1 refers to representing a context with
> a vector that shows the features that occur in that context, and order2
> refers to representing a content with a vector that is an average of
> the vectors that represent the words or features in the contexts.
> 
> Now, with our support for feature by context representation that is in the 
> works, we will introduce a new type of order2 representation. Rather than
> representing words in the contexts to be clustered with vectors consisting
> of other words (the co-occurrences of the words) we will be able to 
> represent the contexts to be clustered by averaging together vectors of 
> features that represent the contexts in which those features occur. So we 
> will have a word by word representation (current o2) and a feature by 
> context representation (new order 2).
> 
> Right now we have in discriminate.pl the option 
> 
> --context o1
> or 
> --content o2 
> 
> we need something that indicates our new order 2, that is the one that
> uses the feature by context vectors to represent the context to be 
> clustered. 
> 
> One idea might be to simply create a new value for context, like...
> 
> --context o2_lsa
> 
> Another idea would be to create a new "switch" that would turn on "lsa"
> style processing, which would mean rather than using a word by word 
> representation, we would use feature by context...
> 
> --context o2 --lsa
> 
> The idea here would be that the --lsa switch could also be applied to
> our --wordclust option, to essentially change the word clustering option
> from word by word to feature by context (and thereby cluster features). 
> 
> --wordclust --lsa
> 
> This plan of attack, *might* have the benefit of minimizing the changes
> required in discrimiate.pl, but I am not sure of that. The possible 
> drawback is that --wordclust means "word clustering" and wordclust --lsa
> actually means feature clustering rather than word clustering...
> 
> Now, the advantage of this is that --lsa makes it very clear where we
> are using lsa and where we are not, and I think that is a good thing,
> since I want that to be clear when we introduce this functionality. 
> 
> So these would be the main "modes" of operation in SenseClusters after
> the inclusion of the LSA support. 
> 
> --context o1  
> --context o2
> --context o2 --lsa
> --wordclust
> --wordclust --lsa
> 
> The --lsa option would only be allowed with --context and --wordclust, 
> and it would not be valid with --context o1.
> 
> BTW, --wordclust should only allow for headless data, and it should not 
> be possible to use training data. These are the current restrictions, and
> I think they remain valid for --lsa mode. 
> 
> So this is one idea that I was kicking around. There are more, but I 
> wanted to get the discussion started sooner than later. 
> 
> Any drawbacks to the above that are apparent?
> 
> Thanks,
> Ted
> 
> --
> Ted Pedersen
> http://www.d.umn.edu/~tpederse

[Senseclusters-developers] options and naming conventions for lsa support

From: ted p. <tpederse@d.umn.edu> - 2006-06-13 02:46:43

Hi Mahesh,

We have been discussing the naming conventions and terminology that 
we should use for "word clustering" versus context clustering, and how
in general lsa support should be incorporated into discriminate.pl

One important point that we've made is that order1 and order2 only apply
to context clustering. order1 refers to representing a context with
a vector that shows the features that occur in that context, and order2
refers to representing a content with a vector that is an average of
the vectors that represent the words or features in the contexts.

Now, with our support for feature by context representation that is in the 
works, we will introduce a new type of order2 representation. Rather than
representing words in the contexts to be clustered with vectors consisting
of other words (the co-occurrences of the words) we will be able to 
represent the contexts to be clustered by averaging together vectors of 
features that represent the contexts in which those features occur. So we 
will have a word by word representation (current o2) and a feature by 
context representation (new order 2).

Right now we have in discriminate.pl the option 

--context o1
or 
--content o2 

we need something that indicates our new order 2, that is the one that
uses the feature by context vectors to represent the context to be 
clustered. 

One idea might be to simply create a new value for context, like...

--context o2_lsa

Another idea would be to create a new "switch" that would turn on "lsa"
style processing, which would mean rather than using a word by word 
representation, we would use feature by context...

--context o2 --lsa

The idea here would be that the --lsa switch could also be applied to
our --wordclust option, to essentially change the word clustering option
from word by word to feature by context (and thereby cluster features). 

--wordclust --lsa

This plan of attack, *might* have the benefit of minimizing the changes
required in discrimiate.pl, but I am not sure of that. The possible 
drawback is that --wordclust means "word clustering" and wordclust --lsa
actually means feature clustering rather than word clustering...

Now, the advantage of this is that --lsa makes it very clear where we
are using lsa and where we are not, and I think that is a good thing,
since I want that to be clear when we introduce this functionality. 

So these would be the main "modes" of operation in SenseClusters after
the inclusion of the LSA support. 

--context o1  
--context o2
--context o2 --lsa
--wordclust
--wordclust --lsa

The --lsa option would only be allowed with --context and --wordclust, 
and it would not be valid with --context o1.

BTW, --wordclust should only allow for headless data, and it should not 
be possible to use training data. These are the current restrictions, and
I think they remain valid for --lsa mode. 

So this is one idea that I was kicking around. There are more, but I 
wanted to get the discussion started sooner than later. 

Any drawbacks to the above that are apparent?

Thanks,
Ted

--
Ted Pedersen
http://www.d.umn.edu/~tpederse

[Senseclusters-developers] meeting followup wed may 31, plan for lsa support

From: ted p. <tpe...@ma...> - 2006-06-02 03:18:52

Hi Mahesh,

I just wanted to record a few thoughts and ideas from our meeting of Wed 
May 31 before they slipped away from me. I have cc'd Anagha on this since 
we get into some issues relating to Senseclusters changes and versions 
and I want to make sure we don't clash. I also copied this to the 
SenseClusters developers list, as that provides a nice archiving 
mechanism.

We discussed a plan of action that orders things more or less like this:

1) vector/matrix CPAN module, to perform transpose of order1vec output 
(which is context by feature).

2) perform feature clustering based on transposed output of order1vec. 

This represents a second kind of word clustering, the one we currently 
support (via --wordclust) uses word by word representation. In thinking 
about both of these methods of word clustering, in some sense they both 
represent first order methods. The word by word representation clusters
words based on the words they co-occur with, and the feature by context 
representation clusters features based on the contexts in which they 
occur. So one is a word co-occurrence based method (--wordcluster, the 
existing method), while the new method is more a context co-occurrence 
techique (features will be clustering together if they occur in similar 
contexts).

It is not yet clear ot me how to exactly articulate or phrase this, but I 
don't think calling them first and second order word clusters is quite 
right. 

We discussed whether or not we should extend the word by word matrices now 
used to create 2nd order representations to be feature by feature. We 
decided that this was probably not too essential at this point, in that
if someone really wanted to cluster features they could use our new
feature by context representation. 

After 1 and 2 are completed, we will release a new version of 
SenseClusters, that includes the new word clustering method. This new 
version should include support in discriminate.pl for this, test cases, 
and support in the web interface.

The ordering of the points below is a little less clear. 1 and 2 are 
clearly sequential, we need to think a little more about the points below 
before ordering I think. 

3) add support for "feature matching" to order2vec.pl, currently it just  
matches words (unigrams) in the context with those in the word by word 
matrix. This will allow for the creation of order 2. This will allow us to 
create second order representations of contexts, where features are 
replaced with a vector of contexts. These would of course be created by
point 2 above. 

So we will have two ways to create second order representations. The first 
is what we now provide, where words are replaced with vectors or word  
co-occurrences. The second (the new way) will be to replace features with  
vectors of context co-occurrences. 

Both our new way of clustering features (feature by context) and our new 
way of second order representation (replace features in context with 
vector of context co-occurrences) are very similar to LSA. 

4) In discussing feature matching, you proposed a very interesting idea  
that makes sense to me. Rather than using the xml2arff methodology, which 
matches the contexts to be represented with regular expressions, you 
poposed that we run NSP without any cutoffs (frequency scores, etc.) on
the test data (as well as the training data) and then find out if the 
candidate features identified in the test data actually are features 
according to our features selection data. This has the potential to be 
much faster, and if so we would want to do this with both order1vec.pl
and order2vec.pl. Order1vec.pl is currently based on xml2arff, and it
is very slow. The new order2vec.pl, that would match features, would also 
likely be based on xml2arff, and as such could also be very slow. So
this method of matching features might allow us to speed up the existing 
order1vec, and extend order2vec to features without making is slower. 

5) add support for the automatic generation of stoplists. We have several 
options here, one is to create a standalone utility that would generate a 
list of stopwords based on something like tf/idf. We could also do this 
internally in senseclusters, where we look at the feature by context 
representation, and remove those features that occur in "too many" contexts. 

We would also like to be able to provide tf/idf scores in our feature by 
context representation, which suggests that order1vec.pl would need to be
extended to output these values (right now it supports binary values and 
frequency counts). 

The standlone idea would result in a stoplist that would simply be input 
exaxctly like the stoplists we now use. We would not be able to use the 
tf/idf scores internal to SenseClusters, but we would be able to quickly 
derive stoplists for domain specific corpora or other languages. In NSP we 
have a mode for count where each line is considered to represent a 
context. We could use that when the data is formatted like that, otherwise 
we could simply define a value N that tells us how big a context is, and 
then we go through a corpus of plain text and figure out tf/idf based on 
that assumption. Note that instead of documents here we are talking about 
contexts. 

Of the above, I would rate 3 as essential, and 4 and 5 as highly desirable.

So, the above is pretty much taken off the top of my head, and so it is 
possible I have missed some important points, or said things poorly. 
Please do add any comments, additions, or disagreements you may have. I 
think it is important to hammer out a plan for 3, 4, and 5 as soon as 
possible, since that will help us plan the rest of the summer pretty well.
The most important thing is to try and anticipate all the changes we need 
or want to make now, rather than adding them later, that doesn't tend to 
work too well.

Anagha, any comments or observations you have are of course welcome. If 
you have any concerns about any of the above being feasible or possibly 
clashing with some of your work, please do raise that asap so we can plan 
accordingly. 

Thanks,
Ted

--
Ted Pedersen
http://www.d.umn.edu/~tpederse

[Senseclusters-developers] Knoppix CD (fwd)

From: ted p. <tpe...@ma...> - 2006-06-01 03:16:39

Final list of changes for version 0.89 of Knoppix CD.

---------- Forwarded message ----------
Date: Wed, 31 May 2006 18:29:59 -0500
From: Anagha Kulkarni <kulka020@d.umn.edu>
To: ted pedersen <tpederse@d.umn.edu>
Subject: Knoppix CD


Hi Ted,

Following is the list of things that i found might need some changing in 
the latest knoppix CD:

* Time zone.
* Icons for SC Data folder and SC Live! browser overlap.
* Some more information about SC on the homepage - may be from 
README.SC.pod's synopsis/introduction ??
* The transition from the first paragraph to the note that follows feels 
a bit abrupt - may be we can put the note in [] or use smaller and 
different font.
* Use the README.SC.html from 
http://senseclusters.sourceforge.net/README.SC.html (The current one has 
unparsed items)
* Move (Docs/HTML) html pages (and the Toolkit_Docs dir) under the 
"documentation" link to htdocs
* FAQ - remove the question about ClusterStopping
* FAQ - Question about email - needs updating - headless mode of SC.
(looks like the FAQ document in general might need some updating)
* Link to SenseClusters on the Publication page is the external link - 
change to the local link
* Web-interface
- Increase the font size of the text "SenseClusters Web Interface" in 
the banner
- Change the SC external link to local link in the banner
- Copy SC/Docs/HTML/discriminate.html to htdocs/SC-htdocs/help.html

Thanks,
Anagha

[Senseclusters-developers] ToDo list for SenseClusters v0.91

From: Anagha K. <kulka020@d.umn.edu> - 2006-06-01 00:16:45

* Update FAQs document

* Web/SC-htdocs/help.html not updated to the latest 
/Docs/HTML/discriminate.html

* Web-interface: if an experiment fails the reason/error is logged into 
the logfile but does not get displayed at the browser.

* Web-interface: when experiment with word-clustering if the option of 
setting the #clusters manually is selected then on the final screen the 
specified #clusters is not displayed.

* Add README.SC.html to the distribution.

* If input file is split into training and test data and if both scopes 
(train and test) are specified then the train-scope gets applied to the 
test data instead of getting applied to the train data.

[Senseclusters-developers] SenseClusters version 0.89 released!

From: ted p. <tpederse@d.umn.edu> - 2006-05-28 04:38:53

We are pleased to announce the release of SenseClusters version 0.89.
This includes a small but important fix to 0.87, which itself included
a small but important fix to 0.85. So, you probably want to make sure
you are running 0.89 to avoid these small but important problems or
discrepencies that we found in the earlier releases! You can download
this version from :

http://senseclusters.sourceforge.net/ or
http://www.d.umn.edu/~tpederse/senseclusters.html

Here are the Changelogs for both 0.89 and 0.87:

First, in 0.87 :

Changes made in Sense-Clusters version 0.85 during version 0.87

Ted Pedersen     tpederse@d.umn.edu
Anagha Kulkarni  kulka020@d.umn.edu

1. Fixed a bug in clusterstopping.pl related to the case of empty column,
    i.e, when a feature(s) does not occur in any of the contexts/instances.
         -Anagha

2. Updated INSTALL and Makefile.PL to require v0.03 of 
Algorithm::RandomMatrixGe
neration.               -Anagha

(Changelog-v0.85to0.87 Last Updated on 05/16/2006 by Anagha)
------------------------------------------------------------------------
And then in 0.89:

Changes made in Sense-Clusters version 0.87 during version 0.89

Ted Pedersen 	 tpederse@d.umn.edu
Anagha Kulkarni  kulka020@d.umn.edu

1. Modified the Makefile.PL and INSTALL document to require v0.04 of 
Algorithm::RandomMatrixGeneration
instead of 0.03 
-Anagha

2. Changed the default precision from 4 to 6 in discriminate.pl and 
/Web/SC-cgi/first.cgi						-Anagha

(Changelog-v0.87to0.89 Last Updated on 05/27/2006 by Anagha)
------------------------------------------------------------------------

Let us know if you have any questions, comments, or requests!

Enjoy!
Ted and Anagha

--
Ted Pedersen
http://www.d.umn.edu/~tpederse

[Senseclusters-developers] Statistical assessment of performance

From: Anagha K. <kulka020@d.umn.edu> - 2006-05-10 05:56:16

Attachments: example.xls

Notes on how to statistically assess/compare performance of different
algorithms/experimental-settings when using same datasets with each of
the algorithms.

  From "Empirical Methods for Artificial Intelligence" by Paul R. Cohen.
More specifically from Chapter 4 - "Hypothesis testing and estimation"
and largely Chapter 6 - "Performance Assessment".

--------------------------------------------------------------------------
What is it exactly that we wish to show/prove?

Given results such as below, we want to primarily show that all the 5
settings are not performing equally. We also want to show:
- as a sanity check, that A is significantly different from E (and B is
significantly different from E)
- A is better than C (and B is better than D)
- And in this case, A and B are not significantly different.

Note: The numbers in the bracket are the #clusters used by that
experimental setting.
A		B		C		D		E
Order1(2)	Order2(2)	Order1(6)	Order2(6)	Baseline
-------------------------------------------------------------------------
94.88		96.22		61.44		76.17		55.45
60.11		59.16		51.37		51.89		50.00
68.42		70.26		54.37		57.57		50.00
53.09		68.95		51.23		63.39		51.41
89.15		91.03		60.12		54.37		50.45


To show all the above we start with hypothesis testing and thus defining
the hypotheses:
- the null hypothesis (H0) - all the five settings are performing equally.
- the alternative hypothesis (H1) - the five settings are not equal.

Now we will analyze the variance in the above results i.e. we will
perform "analysis of variance" to show which of the differences in the
performances are statistically significant and which are not.

Note: Henceforth I will be referring to various terms and computations
from the worksheet named "analysis of variance" in the attached excel sheet.

Rows 1-6 is the above data (x(i,j)) where i=5 (#rows) & j=5 (#cols).
Row 8 is total of individual columns (settings/groups).
Row 9 is mean/average (m(j)).
Row 10 is standard deviation computed as:
s(j) = Sqrt(SummationOver_i((x(i,j) - m(j))^2)/(i-1))

Row 12 gives the Grand Mean (gm) which is computed by merging all the
data (rows and columns) into a single sample of size N = 25 experiments
(5 * 5 = 25) and then computing mean as usual.
Row 13 gives the standard deviation (gs) for this sample of size N.

Row 16 and 17 are intermediate calculations.
Note: Henceforth, the "within" term refers to computations performed
over individual groups i.e., columns/settings. While the "between" term
refers to computations performed across groups i.e., columns/settings.
Row 16: w(j) = SummationOver_i((x(i,j) - m(j))^2)
Row 17: b(j) = (m(j) - gm)^2

Row 20 - 23 is the final table for the analysis of variance.
Row 21: The "between" group calculations:
- The degrees of freedom are computed as: j - 1 (#settings - 1).
- The Sums of Squared deviations = SummationOver_j(b(j)) * #experiments
= SummationOver_j(b(j)) * i = SummationOver_j(b(j)) * 5
- Mean Square deviation: MS-between = SS-between / df-between
Row 21: The "within" group calculations:
- The degrees of freedom are computed as: N - j
- The Sums of Squared deviations = SummationOver_j(w(j)
- Mean Square deviation: MS-within = SS-within / df-within

F-value = MS-between/MS-within

Once we have the F-value we can lookup the critical value in
F-distribution table (for different levels: 0.05, 0.01 etc.) with
df-between (column index in F-table) and df-within (row index in
F-table). In our example I have looked up the F-table for 0.05 level and
the critical value for df-between = 4 and df-within = 20 was 2.87
Now since our F-value (4.38) is greater than the found critical value
(2.87), if we were to reject the null hypothesis then the probability of
being wrong would be less than 0.05! The value in the cell below the
p-value (0.01050)is the exact p-value for the F-value of 4.38 with dfs
of 4 and 20. Thus we have statistically shown using analysis of variance
that there is significant variability between groups in the above
results and thus they are not equal. However analysis of variance does
not tell which groups did better - for this we do pairwise comparisons.
We do 2 such pairwise comparisons on each pair we are interested in -
Scheffe test and Least Significant Difference (LSD) test. The Scheffe
test is conservative while the LSD is less stringent.

Note: Please refer to the worksheet named "Scheffe Tests".

Scheffe Test statistic for groups a & b =
(m(a) - m(b))^2/(MS-within * (1/#a + 1/#b) * (j-1))

Row 3 gives the m(j) values from the previous sheet.
Row 5 gives the MS-within value from the previous sheet.
Row 6 gives the degrees of freedom for between and within from the
previous sheet.
Row 9 - 17 give the Scheffe test statistic for various pairs and their
corresponding p-values.

Note: Please refer to the worksheet named "LSD Tests".

LSD Test statistic for groups a & b =
(m(a) - m(b))^2/(MS-within * (1/#a + 1/#b))

Row 3 gives the m(j) values from the 1st sheet.
Row 5 gives the MS-within value from the 1st sheet.
Row 6 the degrees of freedom for this test are different from that used
by the analysis of variance and by Scheffe's test. This test uses 1 and
N - j (20) as the degrees of freedom.
Row 9 - 17 give the LSD test statistic for various pairs and their
corresponding p-values.

--------------------------------------------------------------------------

[Senseclusters-developers] SenseClusters version 0.85 released (Gap Statistic)

From: ted p. <tpederse@d.umn.edu> - 2006-05-09 01:07:36

We are pleased to announce the release of version 0.85 of
SenseClusters. This release features our adaptation of
the Gap Statistic, a state of the art method for automatically
identifying the number of clusters in a given set of data.

You can download this version from the links provided at :

http://senseclusters.sourceforge.net/ or
http://www.d.umn.edu/~tpederse/senseclusters.html

You can also find the web interface to version 0.85 available
at these links.

With the Gap Statistic, there are now 4 different methods of
finding the number of clusters automatically in SenseClusters.

We will be presenting a demo of all of these at NAACL in
New York City on June 6. You can see the paper that describes
what we are demoing here:

Automatic Cluster Stopping with Criterion Functions and the Gap Statistics
(Pedersen and Kulkarni), Appears in the Proceedings of the Demonstration
Session of the Human Language Technology Conference and the Sixth Annual
Meeting of the North American Chapter of the Association for Computational
Linguistics, June 6, 2006, New York City.
http://www.d.umn.edu/~tpederse/Pubs/naacl06-demo.pdf

So, please check out this new version, and if you are at NAACL
please visit our demo! We will also have Knoppix CDs available
with SenseClusters already installed, so you can run on your own
PC without having to install.

Please let us know if you have any questions or comments!

Enjoy,
Ted and Anagha

--
Ted Pedersen
http://www.d.umn.edu/~tpederse

[Senseclusters-developers] Re: SenseClusters v0.85

From: ted p. <tpederse@d.umn.edu> - 2006-05-07 22:45:26

Thanks! I will do this!
Ted

On Sun, 7 May 2006, Anagha Kulkarni wrote:

> Hi Ted,
>
> Along with SenseClusters-Code-README.html you will also have to upload
> the following files to sf:
> 1. discriminate.html present at SC/Docs/HTML
> 2. clusterstopping.html present at SC/Docs/HTML/Toolkit_Docs/clusterstop
>
> Sorry for missing this earlier.
>
> Thanks,
> Anagha
>
>
> ted pedersen wrote:
> > Hi Anagha,
> >
> > I am in the process of releasing SC. I have renamed the tar file as
> > SenseClusters-v0.85.tar.gz and the top level directory as
> > SenseClusters_v0.85, but otherwise made not changes to the release.
> >
> > I have updated the index.html and README.SC.html pages on sf, but
> > not this one
> > http://senseclusters.sourceforge.net/SenseClusters-Code-README.html
> >
> > I think we have dealt with this issue before, and I will scan through
> > my email to see how we create that (I think just running a script)
> > but if you happen to know off the top of your head that would be great.
> >
> > If you could poke around my web pages (starting from my home page)
> > and make sure everything looks in order, that would be great. If you
> > could try and download and unpack the distribution, that would be
> > good too...let me know if anything looks amiss, and then I will
> > announce things. I am planning to announce on the corpora list also,
> > since it has been a while since we have done that....
> >
> > Thanks!
> > Ted
> >
> > On Sat, 6 May 2006, Anagha Kulkarni wrote:
> >
> >> Hi Ted,
> >>
> >> I have copied 2 files, namely, SC_0.85.tar.gz & readme.tar.gz to
> >> http://marimba.d.umn.edu/SC_0.85/
> >>
> >> The SC_0.85.tar.gz file contains the v0.85 distribution (I have removed
> >> the CVS files from all the directories) and the readme.tar.gz contains
> >> the file README.SC.html
> >>
> >> I will shortly switch the web-interface from v0.83 to v0.85 and will let
> >> you know ones that is ready.
> >>
> >> Please let me know if you see any problem or would like any help.
> >>
> >> Thanks!
> >> Anagha
> >>
> >
> > --
> > Ted Pedersen
> > http://www.d.umn.edu/~tpederse
>

--
Ted Pedersen
http://www.d.umn.edu/~tpederse

[Senseclusters-developers] Re: small point from demos/warning about line 2031 of discriminate.pl

From: Anagha K. <kulka020@d.umn.edu> - 2006-04-30 03:29:39

Hi Ted,

I took a look at this problem and as you suspected, it was a minor bug 
in our 0.83 version. This is now fixed in v0.85

Thanks for bringing this my notice.
Anagha


ted pedersen wrote:
> Hi Anagha,
> 
> When I run the toolkit.sh demo, I get a warning about something
> for discriminate.pl. It would be good if we looked at that
> sometime, just to make sure it's nothing horrible. I don't think
> it is, but since I noticed it I thought I would mention.
> 
> Here's what it is...
> 
> In similarity space, I think.
> 
> Use of unitialized value in concatenation (.) or string at
> /usr/local/bin/discriminate.pl line 2031
> 
> This was in authority.n.co.o2.similarity.rbr, but occurred
> in others too.
> 
> I copied that manually since it was running on knoppix. But,
> I think I'm accurate!
> 
> Thanks,
> Ted
> 
> --
> Ted Pedersen
> http://www.d.umn.edu/~tpederse

[Senseclusters-developers] behavior of e1 in "perfect" and reference case

From: ted p. <tpederse@d.umn.edu> - 2006-04-17 07:59:04

Here are some E1 values based on data where there are exactly 8 clusters.
What we can see is that the E1 values stabilize after that number of
clusters are found, since at that point increasing the number of clusters
does not change the overall inter cluster similarity (since there are only
8 types of contexts in the data, further subdivisions are dividing
contexts that are already the same...)

1-way clustering: [E1=2.26e+05] [800 of 800]
2-way clustering: [E1=1.95e+05] [800 of 800]
3-way clustering: [E1=1.67e+05] [800 of 800]
4-way clustering: [E1=1.42e+05] [800 of 800]
5-way clustering: [E1=1.20e+05] [800 of 800]
6-way clustering: [E1=1.02e+05] [800 of 800]
7-way clustering: [E1=8.83e+04] [800 of 800]
8-way clustering: [E1=8.00e+04] [800 of 800]
9-way clustering: [E1=8.00e+04] [800 of 800]
10-way clustering: [E1=8.00e+04] [800 of 800]
11-way clustering: [E1=8.00e+04] [800 of 800]
12-way clustering: [E1=8.00e+04] [800 of 800]
13-way clustering: [E1=8.00e+04] [800 of 800]
14-way clustering: [E1=8.00e+04] [800 of 800]
15-way clustering: [E1=8.00e+04] [800 of 800]
16-way clustering: [E1=8.00e+04] [800 of 800]
17-way clustering: [E1=8.00e+04] [800 of 800]
18-way clustering: [E1=8.00e+04] [800 of 800]
19-way clustering: [E1=8.00e+04] [800 of 800]
20-way clustering: [E1=8.00e+04] [800 of 800]
21-way clustering: [E1=8.00e+04] [800 of 800]
22-way clustering: [E1=8.00e+04] [800 of 800]
23-way clustering: [E1=8.00e+04] [800 of 800]

You'll notice in the perfect case that the E1 value at k=23 is 80,000,
and we hit that value after k=8. That means that the inter cluster
similarity is no longer changing at that point, since we have perfectly
separated data.

Now, if we look at some random data with the same marginal totals, we see
a different situation...

1-way clustering: [E1=1.95e+05] [800 of 800]
2-way clustering: [E1=1.42e+05] [800 of 800]
3-way clustering: [E1=1.24e+05] [800 of 800]
4-way clustering: [E1=1.07e+05] [800 of 800]
5-way clustering: [E1=9.98e+04] [800 of 800]
6-way clustering: [E1=9.41e+04] [800 of 800]
7-way clustering: [E1=8.92e+04] [800 of 800]
8-way clustering: [E1=8.44e+04] [800 of 800]
9-way clustering: [E1=8.20e+04] [800 of 800]
10-way clustering: [E1=8.01e+04] [800 of 800]
11-way clustering: [E1=7.82e+04] [800 of 800]
12-way clustering: [E1=7.65e+04] [800 of 800]
13-way clustering: [E1=7.47e+04] [800 of 800]
14-way clustering: [E1=7.31e+04] [800 of 800]
15-way clustering: [E1=7.21e+04] [800 of 800]
16-way clustering: [E1=7.10e+04] [800 of 800]
17-way clustering: [E1=7.01e+04] [800 of 800]
18-way clustering: [E1=6.91e+04] [800 of 800]
19-way clustering: [E1=6.82e+04] [800 of 800]
20-way clustering: [E1=6.74e+04] [800 of 800]
21-way clustering: [E1=6.66e+04] [800 of 800]
22-way clustering: [E1=6.59e+04] [800 of 800]
23-way clustering: [E1=6.52e+04] [800 of 800]

We start at 195,000 (k=1) and then arrive at 65,200 (k=23), which makes
sense, in that the inter cluster similarity is very high at k=1 (there is
only one cluster, and we are measuring the distance of the centroid of
that cluster to the centroid of the data, which are essentially the same
thing. So the similarity between clusters steadily reduces, while the
internal similarity increases.

No conclusions here, just some raw data to think about...

Ted

--
Ted Pedersen
http://www.d.umn.edu/~tpederse

[Senseclusters-developers] nameconflate on knoppix

From: ted p. <tpederse@d.umn.edu> - 2006-03-28 16:07:22

Oh no! I forgot to include nameconflate!! I was thinking
we should do that, so that if people have plain text
at least they can create new data.

Hmmmmm...There are a few small things I would like to
resolve, so I might try to fix this too. Otherwise, I'd
like to make sure to fix this for the naacl/aaai release.
I think if we can include some public domain corpora
too that would help (not gigaword, but there are others).

Ah, there is always something.

It turns out to be easy to do things like removing buttons
and toolbars. One simply boot your knoppix version, make
those changes as a user does, and then you have an option
to save a configuration file. You save that to a usb device,
and then reboot into your "master copy" that you are
creating on the hard drive, and copy the configuration file
to /home/knoppix/KNOPPIX. So, I removed the Mozilla button,
the Office button, those German toolbars :), and renamed
the shortcut to SenseClusters Live!. So that was a nice
surprise (easier than expected).

Thanks,
Ted

--
Ted Pedersen
http://www.d.umn.edu/~tpederse

[Senseclusters-developers] Re: senseclusters live! testing (fwd)

From: ted p. <tpederse@d.umn.edu> - 2006-03-28 08:18:43

--
Ted Pedersen
http://www.d.umn.edu/~tpederse

---------- Forwarded message ----------
Date: Mon, 27 Mar 2006 19:40:58 -0600
From: Anagha Kulkarni <kulka020@d.umn.edu>
To: ted pedersen <tpederse@d.umn.edu>
Subject: Re: senseclusters live! testing

Hi Ted,

I am testing your SenseClusters' Live! CD on talisker. It is just
amazing! Very very neat and nicely done!

I have few observations below (I hope apart from what you have had):

1. Would the CDs have paper cover? If yes, could and should we say
something of the sort that "The system will take some time to boot and
stabilize, please wait until you see SenseClusters' web-page." Because I
*think* that was the period when the user might wonder about the progress.
2. Link to Sample Data missing??
3. Could we change the README file to the version that SenseClusters'
homepage points to?
(http://senseclusters.sourceforge.net/README.SC.html) Because if you
look at the introduction section of the README on the CD you will see an
unparsed =head1 tag.
4. I think the solution to the Browse problem is adding a front-slash
at the end of the directory name. This worked for me.

Following are all very minor points:
1. Could we remove the Bookmark toolbar from the Konqueror window?
2. At places space (" ") present in between "v" and "0.83".
3. Should we remove the Firefox icon from the Start Panel? Or else
should we also set the home-page to SenseClusters for Firefox too?

This is all I could catch! Very impressive - really!

Please let me know if I can help.

Thanks,
Anagha

ted pedersen wrote:
> Hi Anagha,
>
> I've been doing some testing of the cd today, and this is what I found. I
> just wanted to make sure that if you saw one of these that you would
> realize I saw it too. :) These are also notes to self to some degree...
>
> Things to be added:
>
> 1) explanation of photo, and credit to UMD photographer. Put on
> acknowledgements of main intro page to SC.
>
> 2) explanation that apache is configured to run standalone,
> so links to external sites dont work, but still might be
> useful (to have url) or if someone reconfigures. Put on main intro
> page to SC.
>
> Things to fix:
>
> 1) Not possible to browse results, file /localhost/SC-htdocs/userxxxxxxx
>
> all other links are ok, and .tar file created ok, so not sure why this
> is a problem. Permissions are set to rwxr-xr-x I think, so maybe that
> is the problem? needs to be rwxrwxrwx? Will also check if owner matters.
> Current owner is www-data while we are usually running as knoppix. So
> maybe changing permisssions or changing owner will resolve?
>
> 2) Rename Knoppix Icon on Desktop at SenseClusters (not sure how to do).
>
> 3) Put Data in more convenient location, maybe on desktop. Currently in
> /usr/lib/htdocs/
>
> 4) Main SC web page is a little messy, type for running from command line
> is awkward and looks bad. Make page more "clean"
>
> Maybe to do?
>
> 1) Add a stop of apachectl/httdp at runlevel 6 (which happens during
> shutdown). This is probably not necessary, but maybe nice.
>
> So, that is what I have found. I will keep testing, but generally speaking
> I feel like things are running ok.
>
> Thanks!
> Ted
>
> --
> Ted Pedersen
> http://www.d.umn.edu/~tpederse

[Senseclusters-developers] note on web interface, logfile and failures

From: ted p. <tpederse@d.umn.edu> - 2006-03-27 19:51:01

Hi Anagha,

I think we have talked about this before but I can't remember the
resolution. Sometimes the web interface will fail in an apparently
mysterious way (and say something like "error opening file xxx").
But, if you look at the logfile the error is completely explainable.
Like, you can't do evaluation on unannotated data, or you don't have
enough features.

Would it be possible to have the web interface output the more
descriptive logfile error in addition to the general failure message?

This seems very familiar to me, so mabye we have already been
through this, but I'm afraid I just can't remember!!

Thanks,
Ted

--
Ted Pedersen
http://www.d.umn.edu/~tpederse

[Senseclusters-developers] assumption in naming convention of htdocs and cgi-bin in web interface

From: ted p. <tpe...@cs...> - 2006-03-27 08:23:45

Hi Anagha,

There is an assumption made regarding the naming of cgi-bin and htdoc
directories in apache that is not necessarily always true.

I am using apache 1.3 for the knoppix cd (since this is what knoppix
uses) and there the cgi-bin directory defaults to /usr/lib/cgi-bin,
and DocumentRoot (what we call htdocs) is /var/www.

This poses a small problem for callwrap.pl since it assumes that
cgi-bin and htdocs (named as that) will be on the same "level", as
in /usr/lib/cgi-bin and /usr/lib/htdocs.

I have fixed this by configuring apache slightly differently than
is usual for 1.3 (with the names mentione above) but I think it is
working ok.

We might want to document this a bit more clearly, and say that we
expect this particular arrangement and these names, and if there
is anything else used even if Apache understands it our programs will
not.

BTW, I was using a newer version of apache earlier, but that seemed
to cause some glitches in other packages (not senseclusters) so I
decided to drop back to the version of apache included in Knoppix. I
am not sure why they use such an old version, perhaps it is
smaller, or something. Anyway, that's how this came up...this is
probably peculiar to 1.3.

Thanks,
Ted

--
Ted Pedersen
http://www.d.umn.edu/~tpederse

[Senseclusters-developers] SenseClusters Live! CD remastering notes, March 2006

From: ted p. <tpederse@d.umn.edu> - 2006-03-27 08:03:23

SenseClusetrs Live! CD (Knoppix) Mastering, March 2006

These are some notes that I created while mastering a Knoppix Linux
Distribution that features Senseclusters. I call it SenseClusters Live!
(version 0.83). These are hardly complete, but hopefully the mention at
least a few high points in the creation of this CD. Note that the creation
is not quite yet finished, but I think it's close, and I am waiting for
one of the longish steps this process has, so it seemed like a good time
to try and write things down.

If you are not familiar with Knoppix Linux, the key innovation is that it
allows you to create a CD that you can boot in order to run Linux and
SenseClusters without having to install on the hard drive. This strikes
us as a good idea since SenseClusters really only runs completely on Linux
or Solaris, and installation of Linux on a hard drive is more than some
folks want to attempt, and then installation of SenseClusters on top of
that has a few wrinkles.

I have been doing my remastering using Ubuntu 5.10. I don't think that
matters much, except that Ubuntu supported wireless networking out of the
box, so to speak, and that has been a big help. So they way you do this
in general is to create a partition on a Linux distribution (Ubuntu) and
then copy your Knoppix distribution CD to that partition. You get one CD
worth of compressed data to build your distribution, so that limits you to
700MB. Knoppix is at approximately 700 MB, so whatever you wish to add
must be offset by packages in Knoppix that you remove. I have been using
Knoppix 4.0.2.

I am generally following hints and tips found in the books "Knoppix Hacks"
by Kyle Rankin (OReilly) and "Hacking Knoppix" by Scott Granneman
(Wiley). These are both good, although Hacking Knoppix seems a bit more
current, and is based on 4.0.2 while the Knoppix Hacks book is based on
3.4.0. I am also finding the following Howto to be very helpful:
http://www.knoppix.net/wiki/Knoppix_Remastering_Howto

The uncompressed Knoppix version 4.0.2 takes up 2.0 GB, and it compresses
down to 700MB. The data and papers that are included with SenseClusters
take up about 440MB uncompressed, and then SenseClusters and affiliated
tools take up about 60 MB more, so I needed to remove about 500 MB
uncompressed from the 2.0 GB that I started with. That's about 25%!

So, with apt-get (debian based package manager) I removed the following :

openoffice-de-en 300 mb !
i18n files 100mg
mozilla-thunderbird 32 mb
xboing 5mb
chromium 5mb
enigma and enigma-data 22mb
gaim gaim-data 12mb
kpilot 5mb
kstars and kstars-data 20mb
gimp and gimp-data 27mb
various games...

The removing was trickier than I expected, because I think a few times I
removed things I didn't realize where important, and then once I went
through the long process of creating the iso file system and burning the
CD, nothing worked too well. :) So I got very careful about this, and
probably learned a lot about Linux packages as a result!

Then I installed the following with apt-get:

pdl 22mb
perl-doc 12.5 mb

Note that I used apt-get for installation and removal when I could, since
that is the Knoppix way. apt-get is a debian tool, and knoppix is derived
from debian.

But, there were some things not available as Debian packages - I also
installed various CPAN modules, using the interactive cpan installer:

text-nsp
bit-vector
set-scalar
sparse
algorithm-munkres
XML::Simple (to display web interface output)

Then, I installed a few packages "by hand", which means compiling, making,
etc... and copying the executable to a system directory (in my case I put
all executables in /usr/local/bin)

cluto (scluster and vcluster)
svdpackc  (las2)
SenseClusters (v0.83) (many programs, mostly .pl)

After all this, the total size of everything is 2.0 gb uncompressed. After
compressing, it is 683mb, which just fits onto a CD.

In addition to the package installation, there was some configuration that
needed to be done, perhaps most trickily Apache. For whatever reason
Knoppix uses Apache 1.3 (whereas the current version is 2.2.0), so I
needed to make a few small changes to the apache configuration to make
sure our code would work. The biggest change was probably the default
location of Scripts and DocumentRoot.

So, I set scripts directory to

/usr/lib/cgi-bin

and the DocumentRoot directory to

/usr/lib/htdocs

(In Apache 1.3 DocumentRoot defaults to /var/www)

Then, I set the Listen and Port values as follows:

Listen 127.0.01:3279

Port   3280

If you happen to set the ports to the same value, nothing works!!!!

knoppix does not automatically start apache, so I added a small startup
script to /etc/rc[2-5].d and rc.local

One small Perl issue...

I needed to add a symbolic link to /usr/local/bin/perl from
/usr/bin/perl since I did not have /usr/local/bin/perl available
and that is what is referenced in the web interface scripts.

ln -s /usr/bin/perl /usr/local/bin/perl

Other setup tasks...

modify /etc/profile to include path to NSP measures (due to quirk in how
NSP searches for measures, using the path rather than inc)

modify /tmp to be rwx for all users ?? (not sure I really had to do this
or not, but at somet point I did and it seemed to help).

change boot.msg to indicate this is SenseClusters Live!

change background.jpg to a lovely picture of Duluth :)

change resolv.conf for networking, at least while remastering. then change
it back, or you'll distribute something that has your ip address, etc. in
it.

There was a a bit of work done in setting up local web pages for
presenting easy to use links to the web interface, and to get data and
papers that we also make available on this cd. Nothing there was so unique
it bears mentioning here, just remember that some of that is found in
/home/linuxiso/KNOPPIX and some of it is found in /usr/lib/htdocs.

So that's some of what I did. It's a little more complex than I expected,
but fortunately most everything seems to be working now! I will update
this if there are new significant pieces of information that I might wish
to remember in a few months time!

Thanks,
Ted

--
Ted Pedersen
http://www.d.umn.edu/~tpederse

[Senseclusters-developers] small point from demos/warning about line 2031 of discriminate.pl

From: ted p. <tpederse@d.umn.edu> - 2006-03-27 04:55:23

Hi Anagha,

When I run the toolkit.sh demo, I get a warning about something
for discriminate.pl. It would be good if we looked at that
sometime, just to make sure it's nothing horrible. I don't think
it is, but since I noticed it I thought I would mention.

Here's what it is...

In similarity space, I think.

Use of unitialized value in concatenation (.) or string at
/usr/local/bin/discriminate.pl line 2031

This was in authority.n.co.o2.similarity.rbr, but occurred
in others too.

I copied that manually since it was running on knoppix. But,
I think I'm accurate!

Thanks,
Ted

--
Ted Pedersen
http://www.d.umn.edu/~tpederse

[Senseclusters-developers] SenseClusters at EACL 2006

From: ted p. <tpederse@d.umn.edu> - 2006-03-22 03:29:03

I will be attending EACL in Trento, Italy April 3-7, and I will be doing
three different presentations that revolve around SenseClusters. Please
plan on attending any or all of these. There is one paper, one tutorial,
and one demo, so you get a little bit of everything.

First, on April 3 I will present the following paper at the Cross Language
Induction Workshop http://www.site.uottawa.ca/~diana/eacl2006-clki-workshop.html :

Improving Name Discrimination : A Language Salad Approach (Pedersen,
Kulkarni, Angheluta, Kozareva, and Solorio) - Appears in the Proceedings
of the EACL 2006 Workshop on Cross-Language Knowledge Induction, April 3,
2006, Trento, Italy. http://www.d.umn.edu/~tpederse/Pubs/eacl2006-salad.pdf

This is very fun work that I like very much, where we have mixed together
English with Bulgarian, Romanian and Spanish in order to improve name
discrimination. As crazy as that sounds, it works pretty well. :)

Second, the next day (April 4) I will present a tutorial that focuses on
the methods that are implemented in SenseClusters. This tutorial will
also feature the unveiling and debut of our new SenseClusters Live! CD.
This is a Knoppix based Linux distribution that includes SenseClusters
and lots of data, and you can run it from the CD without having to install
Linux or SenseClusters on your hard drive. I will have extra CDs available
so even if you don't come to the tutorial you can get one, and we will
also have an iso version of this posted so if you aren't at EACL can
download and burn onto a CD, just like you do for Linux.

Here's a short description of the tutorial, and it will be on the
afternoon of April 4. http://eacl06.itc.it/tutorials/tutorial.htm#TU03

Third, the *next* day (April 5) I will present a demo of the new
cluster stopping techniques found in SenseClusters. Those are described in
the following paper:

Selecting the "Right" Number of Senses Based on Clustering Criterion
Functions (Pedersen and Kulkarni), Appears in the Proceedings of the
Posters and Demo Program of the Eleventh Conference of the European
Chapter of the Association for Computational Linguistics, April 5-7, 2006,
Trento, Italy.  http://www.d.umn.edu/~tpederse/Pubs/eacl2006-demo.pdf

If you haven't already gotten your SenseClusters Live! CD by this time,
please stop by and see the demo and get a CD. I will be in Demo Session 2
on April 5, and it looks like there are quite a few demos of interest at
all the sessions so please plan on visiting several of them.

http://eacl06.itc.it/posters-demos/posters.htm

So, if you are at EACL please do come to some or all of these events. They
are all really different so you won't get bored (I promise :)!

Your questions or comments on any of the above are of course most welcome.

See you in Trento!
Ted

--
Ted Pedersen
http://www.d.umn.edu/~tpederse

[Senseclusters-developers] SenseClusters v0.83 released, supports automatic cluster stopping

From: ted p. <tpederse@d.umn.edu> - 2006-02-09 00:00:28

We are pleased to announce the release of version 0.83 of SenseClusters.
We have made a larger version increment than usual (from 0.73 to 0.83) to
make the point that there is significant new functionality in the package
as of 0.83. You can download this new version from:

http://www.d.umn.edu/~tpederse/senseclusters.html or
http://senseclusters.sourceforge.net

In particular, we have incorporated support for automatically identifying
the number of clusters in a given data set. There are three methods
provided, and they are described more completely in the following paper
that will appear at EACL (in conjunction with a demo) in April:

Selecting the "Right" Number of Senses Based on Clustering Criterion
Functions (Pedersen and Kulkarni), To appear in the Proceedings of the
Posters and Demo Program of the Eleventh Conference of the European
Chapter of the Association for Computational Linguistics, April 3-7, 2006,
Trento, Italy.  http://www.d.umn.edu/~tpederse/Pubs/eacl2006-demo.pdf

You can also try out this new functionality on our web interface,
available at:   http://marimba.d.umn.edu/cgi-bin/SC-cgi/index.cgi

Please do give this a try. This is a very significant enhancement to
the package. Your comments are particularly welcome as we seek to improve
and expand our ability to automatically identify the number of clusters in
a given set of data.

Enjoy,
Ted and Anagha

========================================================================

Below is a copy of the ChangeLog for Version 0.83.

1. Added Toolkit/clusterstop/clusterstopping.pl		     -Anagha

2. Integrated clusterstopping.pl with discriminate.pl	     -Anagha

3. Added test-cases for clusterstopping.pl                   -Anagha

4. Modified web-interface to support clusterstopping         -Anagha

5. Modified/added documentation for cluster stopping:
README.SC.pod, README.Toolkit.pod, discriminate.html,
clusterstopping.html                                         -Anagha

6. Removed /svd/pdlsvd.pl and related threads		     -Anagha

7. Fixed a bug about pattern matching in format_clusters.pl  -Anagha

--
Ted Pedersen
http://www.d.umn.edu/~tpederse

2 messages has been excluded from this view by a project administrator.

Flat | Threaded

<< < 1 2 3 4 5 .. 11 > >> (Page 3 of 11)