You can subscribe to this list here.
2010 
_{Jan}
(23) 
_{Feb}
(4) 
_{Mar}
(56) 
_{Apr}
(74) 
_{May}
(107) 
_{Jun}
(79) 
_{Jul}
(212) 
_{Aug}
(122) 
_{Sep}
(289) 
_{Oct}
(176) 
_{Nov}
(531) 
_{Dec}
(268) 

2011 
_{Jan}
(255) 
_{Feb}
(157) 
_{Mar}
(199) 
_{Apr}
(274) 
_{May}
(495) 
_{Jun}
(157) 
_{Jul}
(276) 
_{Aug}
(212) 
_{Sep}
(356) 
_{Oct}
(356) 
_{Nov}
(421) 
_{Dec}
(365) 
2012 
_{Jan}
(530) 
_{Feb}
(236) 
_{Mar}
(495) 
_{Apr}
(286) 
_{May}
(347) 
_{Jun}
(253) 
_{Jul}
(335) 
_{Aug}
(254) 
_{Sep}
(429) 
_{Oct}
(506) 
_{Nov}
(358) 
_{Dec}
(147) 
2013 
_{Jan}
(492) 
_{Feb}
(328) 
_{Mar}
(477) 
_{Apr}
(348) 
_{May}
(248) 
_{Jun}
(237) 
_{Jul}
(526) 
_{Aug}
(407) 
_{Sep}
(253) 
_{Oct}
(263) 
_{Nov}
(202) 
_{Dec}
(184) 
2014 
_{Jan}
(246) 
_{Feb}
(258) 
_{Mar}
(305) 
_{Apr}
(168) 
_{May}
(182) 
_{Jun}
(238) 
_{Jul}
(161) 
_{Aug}

_{Sep}

_{Oct}

_{Nov}

_{Dec}

S  M  T  W  T  F  S 


1
(29) 
2
(24) 
3
(13) 
4
(9) 
5
(21) 
6
(21) 
7
(23) 
8
(19) 
9
(9) 
10

11
(5) 
12
(12) 
13
(7) 
14
(4) 
15
(24) 
16
(23) 
17
(22) 
18
(32) 
19
(13) 
20

21
(2) 
22
(15) 
23
(19) 
24
(32) 
25
(6) 
26
(32) 
27
(15) 
28
(14) 
29
(18) 
30
(19) 
31
(24) 



From: bthirion <bertrand.thirion@in...>  20121031 19:06:55

> Hi, > > Thanks for this  yes I think I see that now. (The values do indeed > differ by n_dim * n_samples * log(scale), but no 0.5 here.) > > I guess in a way the issue is that we typically evaluate point > likelihoods, rather than e.g. integrals within some bounds of certainty > of the measurement. If doing the latter, then the size of that 'box' > would also vary with my scaling factor, and should compensate. Note sure I get your point: the expectancy of the log likelihood (i.e. the negative differential entropy) also scales linearly with the dilation factor (indeed without the 1/2). However, this has little impact in e.g. model selection problems, since the global scaling factor is fixed with the data, and thus is the same for all models tested. Best, Bertrand 
From: Dan Stowell <dan.stowell@ee...>  20121031 18:23:27

On 31/10/12 16:09, bthirion wrote: > On 10/31/2012 04:50 PM, Dan Stowell wrote: >> Hi all, >> >> I'm still getting odd results using mixture.GMM depending on data >> scaling. In the following code example, I change the overall scaling but >> I do NOT change the relative scaling of the dimensions. Yet under the >> three different scaling settings I get completely different results: >> >>  >> from sklearn.mixture import GMM >> from numpy import array, shape >> from numpy.random import randn >> from random import choice >> >> # centroids will be normallydistributed around zero: >> truelumps = randn(20, 5) * 10 >> >> # data randomly sampled from the centroids: >> data = array([choice(truelumps) + randn(5) for _ in xrange(1000)]) >> >> for scaler in [0.01, 1, 100]: >> scdata = data * scaler >> thegmm = GMM(n_components=10) >> thegmm.fit(scdata, n_iter=1000) >> ll = thegmm.score(scdata) >> print sum(ll) >>  >> >> Here's the output I get: >> >> GMM(cvtype='diag', n_components=10) >> 7094.87886779 >> GMM(cvtype='diag', n_components=10) >> 14681.566456 >> GMM(cvtype='diag', n_components=10) >> 37576.4496656 >> >> >> In principle, I don't think the overall data scaling should matter, but >> maybe there's an implementation issue I'm overlooking? >> >> Thanks >> Dan > Hi Dan, > > But even if the solution is the same, you expect the likelihood value to > change, i.e; it offseted by something like 0.5 * n_dim * n_samples * > log(scale). I'm not suprised by your result. Hi, Thanks for this  yes I think I see that now. (The values do indeed differ by n_dim * n_samples * log(scale), but no 0.5 here.) I guess in a way the issue is that we typically evaluate point likelihoods, rather than e.g. integrals within some bounds of certainty of the measurement. If doing the latter, then the size of that 'box' would also vary with my scaling factor, and should compensate. Thanks Dan 
From: Martin Fergie <mfergie@gm...>  20121031 18:14:47

Hi Dan, I would have thought that it is the relative scaling that is important, not the overall scaling. I.e. each feature of your data set should have zero mean and unit variance. Martin On 31 October 2012 16:09, bthirion <bertrand.thirion@...> wrote: > On 10/31/2012 04:50 PM, Dan Stowell wrote: > > Hi all, > > > > I'm still getting odd results using mixture.GMM depending on data > > scaling. In the following code example, I change the overall scaling but > > I do NOT change the relative scaling of the dimensions. Yet under the > > three different scaling settings I get completely different results: > > > >  > > from sklearn.mixture import GMM > > from numpy import array, shape > > from numpy.random import randn > > from random import choice > > > > # centroids will be normallydistributed around zero: > > truelumps = randn(20, 5) * 10 > > > > # data randomly sampled from the centroids: > > data = array([choice(truelumps) + randn(5) for _ in xrange(1000)]) > > > > for scaler in [0.01, 1, 100]: > > scdata = data * scaler > > thegmm = GMM(n_components=10) > > thegmm.fit(scdata, n_iter=1000) > > ll = thegmm.score(scdata) > > print sum(ll) > >  > > > > Here's the output I get: > > > > GMM(cvtype='diag', n_components=10) > > 7094.87886779 > > GMM(cvtype='diag', n_components=10) > > 14681.566456 > > GMM(cvtype='diag', n_components=10) > > 37576.4496656 > > > > > > In principle, I don't think the overall data scaling should matter, but > > maybe there's an implementation issue I'm overlooking? > > > > Thanks > > Dan > Hi Dan, > > But even if the solution is the same, you expect the likelihood value to > change, i.e; it offseted by something like 0.5 * n_dim * n_samples * > log(scale). I'm not suprised by your result. > > Bertrand > > >  > Everyone hates slow websites. So do we. > Make your web apps faster with AppDynamics > Download AppDynamics Lite for free today: > http://p.sf.net/sfu/appdyn_sfd2d_oct > _______________________________________________ > Scikitlearngeneral mailing list > Scikitlearngeneral@... > https://lists.sourceforge.net/lists/listinfo/scikitlearngeneral > 
From: Olivier Grisel <olivier.grisel@en...>  20121031 17:51:11

2012/10/31 Alexandre Gramfort <alexandre.gramfort@...>: > fine with me but do you push the logic further to any linear estimator ? > For example in Ridge we also have normalize=False by default. > > I would say that LassoLars is more the exception than the norm. Indeed, yet another tricky mission for the Consistency Brigade... I think I would still be +1 for setting normalize=True everywhere as I think it's quite likely for the users to get bad results and get upset because of ignorance of the normalization parameter. The semantic change for the next release is tricky in the short term for existing users but it a good change in the long term for future users.  Olivier http://twitter.com/ogrisel  http://github.com/ogrisel 
From: Alexandre Gramfort <alexandre.gramfort@in...>  20121031 17:10:40

fine with me but do you push the logic further to any linear estimator ? For example in Ridge we also have normalize=False by default. I would say that LassoLars is more the exception than the norm. Alex On Wed, Oct 31, 2012 at 11:53 AM, Jaques Grobler <jaquesgrobler@...> wrote: > It makes sense to me to make the change  however the scikitlearn users > would just > need to be warned about this. Perhaps for now we can just add a warning that > the API > will be changing as to make users well aware (before actually changing the > API) > and that they must manually set it up in the meanwhile so that the default > setting change doesn't > affect them. That might lessen the 'breakage' when the change is > implemented. > Anyway, rant aside, I think it makes good sense to set it to True by default > > 2012/10/31 Olivier Grisel <olivier.grisel@...> >> >> 2012/10/31 Gael Varoquaux <gael.varoquaux@...>: >> > >> > I want to change this (warning backward compatibility breakage :$ ). I >> > want to change Lasso to have normalize=True, because in my experience >> > this is a sane behavior. This would imply, for consistency, changing >> > ElasticNet to also have normalize=True. We would have to put the usual >> > warnings. >> > >> > What do people think? In one sens this change can trick people and break >> > in a subtle way the code that they are currently running. However, the >> > current situation also breaks in subtle way people's expectation. >> >> +1 >> >>  >> Olivier >> http://twitter.com/ogrisel  http://github.com/ogrisel >> >> >>  >> Everyone hates slow websites. So do we. >> Make your web apps faster with AppDynamics >> Download AppDynamics Lite for free today: >> http://p.sf.net/sfu/appdyn_sfd2d_oct >> _______________________________________________ >> Scikitlearngeneral mailing list >> Scikitlearngeneral@... >> https://lists.sourceforge.net/lists/listinfo/scikitlearngeneral > > > >  > Everyone hates slow websites. So do we. > Make your web apps faster with AppDynamics > Download AppDynamics Lite for free today: > http://p.sf.net/sfu/appdyn_sfd2d_oct > _______________________________________________ > Scikitlearngeneral mailing list > Scikitlearngeneral@... > https://lists.sourceforge.net/lists/listinfo/scikitlearngeneral > 
From: bthirion <bertrand.thirion@in...>  20121031 16:09:15

On 10/31/2012 04:50 PM, Dan Stowell wrote: > Hi all, > > I'm still getting odd results using mixture.GMM depending on data > scaling. In the following code example, I change the overall scaling but > I do NOT change the relative scaling of the dimensions. Yet under the > three different scaling settings I get completely different results: > >  > from sklearn.mixture import GMM > from numpy import array, shape > from numpy.random import randn > from random import choice > > # centroids will be normallydistributed around zero: > truelumps = randn(20, 5) * 10 > > # data randomly sampled from the centroids: > data = array([choice(truelumps) + randn(5) for _ in xrange(1000)]) > > for scaler in [0.01, 1, 100]: > scdata = data * scaler > thegmm = GMM(n_components=10) > thegmm.fit(scdata, n_iter=1000) > ll = thegmm.score(scdata) > print sum(ll) >  > > Here's the output I get: > > GMM(cvtype='diag', n_components=10) > 7094.87886779 > GMM(cvtype='diag', n_components=10) > 14681.566456 > GMM(cvtype='diag', n_components=10) > 37576.4496656 > > > In principle, I don't think the overall data scaling should matter, but > maybe there's an implementation issue I'm overlooking? > > Thanks > Dan Hi Dan, But even if the solution is the same, you expect the likelihood value to change, i.e; it offseted by something like 0.5 * n_dim * n_samples * log(scale). I'm not suprised by your result. Bertrand 
From: Dan Stowell <dan.stowell@ee...>  20121031 15:50:36

Hi all, I'm still getting odd results using mixture.GMM depending on data scaling. In the following code example, I change the overall scaling but I do NOT change the relative scaling of the dimensions. Yet under the three different scaling settings I get completely different results:  from sklearn.mixture import GMM from numpy import array, shape from numpy.random import randn from random import choice # centroids will be normallydistributed around zero: truelumps = randn(20, 5) * 10 # data randomly sampled from the centroids: data = array([choice(truelumps) + randn(5) for _ in xrange(1000)]) for scaler in [0.01, 1, 100]: scdata = data * scaler thegmm = GMM(n_components=10) thegmm.fit(scdata, n_iter=1000) ll = thegmm.score(scdata) print sum(ll)  Here's the output I get: GMM(cvtype='diag', n_components=10) 7094.87886779 GMM(cvtype='diag', n_components=10) 14681.566456 GMM(cvtype='diag', n_components=10) 37576.4496656 In principle, I don't think the overall data scaling should matter, but maybe there's an implementation issue I'm overlooking? Thanks Dan On 02/10/12 15:51, Dan Stowell wrote: > On 02/10/12 13:58, Alexandre Passos wrote: >> On Tue, Oct 2, 2012 at 7:48 AM, Dan Stowell >> <dan.stowell@...> wrote: >>> >>> Hi all, >>> >>> I'm using the GMM class as part of a larger system, and something is >>> misbehaving. Can someone confirm please: the results of using GMM.fit() >>> shouldn't have a strong dependence on the data ranges, should they? For >>> example, if one variable has a range 01000, while the other has a range >>> 01, that difference shouldn't have much bearing? >> >> This dependence is expected, and the variable with a range 01000 will >> dominate all others in your model unless you use a full covariance >> matrix, and even then you should expect some bias. In general it's >> good to meancenter and normalize everything before fitting a mixture >> model. > > Aha  yes, and it does indeed make a difference in my case. I was using > full covariance and had thought it would cope without normalisation, but > no. > > Thanks > Dan >  Dan Stowell Postdoctoral Research Assistant Centre for Digital Music Queen Mary, University of London Mile End Road, London E1 4NS http://www.elec.qmul.ac.uk/digitalmusic/people/dans.htm http://www.mcld.co.uk/ 
From: Andreas Mueller <amueller@ai...>  20121031 13:19:57

Hi Vlad. This is definitely a good question. I have that often when representing an image as bags of keypoints / features. Why is it not a good solution to have X as being a list of arrays / lists? Which algorithms do you want to use such samples in? The text feature extraction sort of deals with this by using a list, right? Cheers, Andy On 10/31/2012 01:13 PM, Vlad Niculae wrote: > Hello, > > It seems I have reached again the need for something that became > apparent when working with image patches last summer. Sometimes we > don't have a 1 to 1 correspondence between samples (rows in X) and > actual documents we are interested in scoring over. Instead, each > document consists of (a different) number of samples. > > This can be implemented either as an extra masking array that says > for each sample, what document it belongs to, by grouping `y` into > a list of lists (cumbersome and fails for the unsupervised case), or > by more clever / space efficient methods. > > The question is: did you need this? If so, how did you implement it? > Are you aware of other general purpose libraries that provide such > an API? Because I'm not. Next question is, what can we do about it? > > Example applications: > >  Image classification: > first, from each image we extract kbyk image patches, then we > transform them by sparse coding, and finally we feed them into a > classifier. This classifies each patch individually but in the end > we would want to group the results within each image and compute > "local" scores, or just take the max, for example. > > If using something like CIFAR where images have the same size, the > problem is simplified because each image will be split in the exact > same number of patches. If images have different shapes, or in the > next examples, this assumption cannot be made. > >  Coreference resolution: > A successful model for this problem is based on the mention pair > structure. The goal is to identify clusters of noun phrases that > refer to the same realworld entity. For each document (eg. news > article), the possible mentions (NPs, pronouns) are identified. > The feature extraction then builds "samples" in the form of all > possible pairs of these (sometimes we filter out pairs that are > obviously not coreferent, e.g. he / she, but this is disputable). > > Evaluating such systems requires average over documentlevel > scores, because the documentlevel scores typically used do not > distribute over averaging. [1] > >  Hyphenation: > This is just something I'm currently working on but the same > situation might occur more often. Documents are words, and > samples are positions between letters within each word. > Labels are whether it's correct to add a hyphen there or not. > In the end, sklearn can easily report how many hyphens were > correctly identified over the whole dictionary available. > However a more realistic score would be: how many words were > fully hyphenated correctly? This is because a sequence model > can be smart enough to know that it's not frequent to > insert three hyphens one after the other, for example a pattern > ...xxxxxx..., because of its global documentlevel awareness. > It would be interesting to see how much this brings over a > local SVM classifier that only sees one position at a time. > > Objects that should be aware of this: > >  score functions / metrics, >  some transformers >  resamplers / shufflers: we either want to keep documents together, > or make sure that when reshuffling, document membership is not lost. > > > Best, > Vlad >  > Vlad N. > http://vene.ro > > > > >  > Everyone hates slow websites. So do we. > Make your web apps faster with AppDynamics > Download AppDynamics Lite for free today: > http://p.sf.net/sfu/appdyn_sfd2d_oct > _______________________________________________ > Scikitlearngeneral mailing list > Scikitlearngeneral@... > https://lists.sourceforge.net/lists/listinfo/scikitlearngeneral 
From: Vlad Niculae <vlad@ve...>  20121031 13:14:10

Hello, It seems I have reached again the need for something that became apparent when working with image patches last summer. Sometimes we don't have a 1 to 1 correspondence between samples (rows in X) and actual documents we are interested in scoring over. Instead, each document consists of (a different) number of samples. This can be implemented either as an extra masking array that says for each sample, what document it belongs to, by grouping `y` into a list of lists (cumbersome and fails for the unsupervised case), or by more clever / space efficient methods. The question is: did you need this? If so, how did you implement it? Are you aware of other general purpose libraries that provide such an API? Because I'm not. Next question is, what can we do about it? Example applications:  Image classification: first, from each image we extract kbyk image patches, then we transform them by sparse coding, and finally we feed them into a classifier. This classifies each patch individually but in the end we would want to group the results within each image and compute "local" scores, or just take the max, for example. If using something like CIFAR where images have the same size, the problem is simplified because each image will be split in the exact same number of patches. If images have different shapes, or in the next examples, this assumption cannot be made.  Coreference resolution: A successful model for this problem is based on the mention pair structure. The goal is to identify clusters of noun phrases that refer to the same realworld entity. For each document (eg. news article), the possible mentions (NPs, pronouns) are identified. The feature extraction then builds "samples" in the form of all possible pairs of these (sometimes we filter out pairs that are obviously not coreferent, e.g. he / she, but this is disputable). Evaluating such systems requires average over documentlevel scores, because the documentlevel scores typically used do not distribute over averaging. [1]  Hyphenation: This is just something I'm currently working on but the same situation might occur more often. Documents are words, and samples are positions between letters within each word. Labels are whether it's correct to add a hyphen there or not. In the end, sklearn can easily report how many hyphens were correctly identified over the whole dictionary available. However a more realistic score would be: how many words were fully hyphenated correctly? This is because a sequence model can be smart enough to know that it's not frequent to insert three hyphens one after the other, for example a pattern ...xxxxxx..., because of its global documentlevel awareness. It would be interesting to see how much this brings over a local SVM classifier that only sees one position at a time. Objects that should be aware of this:  score functions / metrics,  some transformers  resamplers / shufflers: we either want to keep documents together, or make sure that when reshuffling, document membership is not lost. Best, Vlad  Vlad N. http://vene.ro 
From: Andreas Mueller <amueller@ai...>  20121031 11:51:45

On 10/31/2012 11:45 AM, Joseph Turian wrote: >> As far as I understand, we are not really sure what is the best way to >> build >> the trees (masks / no masks, presorting / lazy sorting..). > Are you talking about efficiency in training time, or generalization accuracy? > Training time. 
From: Nicolas Rochet <nicolas.rochet@et...>  20121031 11:47:48

Dear scikitlearn dev's, I'm concerned about one potential problem in example you gave in chapter 8.17.3.1 relative to adjusted_mutual_information. If i understand well that's score is usefull to compare two partitions (according to the wikipedia entry). So partitions have to be complete and the elements of each partitions must be pairwise disjoint. In the example, clusters [0,0,0,0] and [0,1,2,3] are compared, but, as far as i understand, they cannot be partitions ! May be i'm missing a point, but that' s seems to be a problem .... Best regards, Nicolas 
From: Joseph Turian <joseph@me...>  20121031 11:46:28

> As far as I understand, we are not really sure what is the best way to > build > the trees (masks / no masks, presorting / lazy sorting..). Are you talking about efficiency in training time, or generalization accuracy? Best, Joseph 
From: Jaques Grobler <jaquesgrobler@gm...>  20121031 10:53:16

It makes sense to me to make the change  however the scikitlearn users would just need to be warned about this. Perhaps for now we can just add a warning that the API will be changing as to make users well aware (before actually changing the API) and that they must manually set it up in the meanwhile so that the default setting change doesn't affect them. That might lessen the 'breakage' when the change is implemented. Anyway, rant aside, I think it makes good sense to set it to True by default 2012/10/31 Olivier Grisel <olivier.grisel@...> > 2012/10/31 Gael Varoquaux <gael.varoquaux@...>: > > > > I want to change this (warning backward compatibility breakage :$ ). I > > want to change Lasso to have normalize=True, because in my experience > > this is a sane behavior. This would imply, for consistency, changing > > ElasticNet to also have normalize=True. We would have to put the usual > > warnings. > > > > What do people think? In one sens this change can trick people and break > > in a subtle way the code that they are currently running. However, the > > current situation also breaks in subtle way people's expectation. > > +1 > >  > Olivier > http://twitter.com/ogrisel  http://github.com/ogrisel > > >  > Everyone hates slow websites. So do we. > Make your web apps faster with AppDynamics > Download AppDynamics Lite for free today: > http://p.sf.net/sfu/appdyn_sfd2d_oct > _______________________________________________ > Scikitlearngeneral mailing list > Scikitlearngeneral@... > https://lists.sourceforge.net/lists/listinfo/scikitlearngeneral > 
From: Lars Buitinck <L.J.B<uitinck@uv...>  20121031 10:51:57

2012/10/31 Olivier Grisel <olivier.grisel@...>: >>> Can we have a vote on this? +1  Lars Buitinck Scientific programmer, ILPS University of Amsterdam 
From: Olivier Grisel <olivier.grisel@en...>  20121031 10:44:36

2012/10/31 Andreas Mueller <amueller@...>: > On 10/31/2012 10:09 AM, Gael Varoquaux wrote: >> On Tue, Oct 30, 2012 at 03:40:48PM +0100, Lars Buitinck wrote: >>> Agree with David, int>float conversion should be expected to produce >>> larger arrays. >> Can we have a vote on this? +1 too  Olivier http://twitter.com/ogrisel  http://github.com/ogrisel 
From: Andreas Mueller <amueller@ai...>  20121031 10:32:09

On 10/31/2012 10:09 AM, Gael Varoquaux wrote: > On Tue, Oct 30, 2012 at 03:40:48PM +0100, Lars Buitinck wrote: >> Agree with David, int>float conversion should be expected to produce >> larger arrays. > Can we have a vote on this? > > I am +0 on int>float conversion always giving float64 (np.float). > I'm +1 
From: Olivier Grisel <olivier.grisel@en...>  20121031 10:29:28

2012/10/31 Gael Varoquaux <gael.varoquaux@...>: > > I want to change this (warning backward compatibility breakage :$ ). I > want to change Lasso to have normalize=True, because in my experience > this is a sane behavior. This would imply, for consistency, changing > ElasticNet to also have normalize=True. We would have to put the usual > warnings. > > What do people think? In one sens this change can trick people and break > in a subtle way the code that they are currently running. However, the > current situation also breaks in subtle way people's expectation. +1  Olivier http://twitter.com/ogrisel  http://github.com/ogrisel 
From: Gael Varoquaux <gael.varoquaux@no...>  20121031 10:21:55

* First some background: LarsLasso and Lasso are two different algorithms to solve the same problem (l1penalized linear model). As with all linear models, they have a 'normalize' parameter that can be turned of so that regressors are normalized. This is useful because the 'good' penalty on each weight is most likely to be proportional to the standard deviation of the corresponding regressor. It is not does via a preprocessing transform, because the coefs are automatically rescaled based on the normalization so that the linear model always holds. * Now the problem and question: In the scikit, for historical reasons, 'normalize' is True in LarsLasso and False in Lasso. This just tricked Fabian when writing some small demo code. I want to change this (warning backward compatibility breakage :$ ). I want to change Lasso to have normalize=True, because in my experience this is a sane behavior. This would imply, for consistency, changing ElasticNet to also have normalize=True. We would have to put the usual warnings. What do people think? In one sens this change can trick people and break in a subtle way the code that they are currently running. However, the current situation also breaks in subtle way people's expectation. G 
From: Gael Varoquaux <gael.varoquaux@no...>  20121031 10:09:49

On Tue, Oct 30, 2012 at 03:40:48PM +0100, Lars Buitinck wrote: > Agree with David, int>float conversion should be expected to produce > larger arrays. Can we have a vote on this? I am +0 on int>float conversion always giving float64 (np.float). G 
From: Olivier Grisel <olivier.grisel@en...>  20121031 09:50:57

2012/10/31 Afik Cohen <afik@...>: > > Hah, thanks for the explanation :) But yes, the accuracy was terrible. In fact, > we just ran another crossvalidated k=3 run with our current data, and got these > results: > > Training LogisticRegression(C=1.0, class_weight=None, dual=False, > fit_intercept=True, intercept_scaling=1, penalty=l2, tol=0.0001) > Running CrossValidated accuracy testing with 3 folds. > done [4276.551s] > Results: Accuracy: 0.639312 (+/ 0.003300) > Training time: 4276.55051398 > Input Data: (10480, 405562) > Labels: 1144 > > As you can see, 63% accuracy with 10480 document vectors with 405562 features. > Pretty awful compared to LinearSVC which gives us upwards of 95%. You need to find the optimal value for 'C' using grid search for both LinearSVC and LogisticRegression to be able to compare their respective performance and be able to tell that one of them yields significantly better predictions than the other. See the examples linked from the documentation for more details: http://scikitlearn.org/stable/modules/grid_search.html  Olivier http://twitter.com/ogrisel  http://github.com/ogrisel 
From: Peter Prettenhofer <peter.prettenhofer@gm...>  20121031 09:40:20

2012/10/31 Andreas Mueller <amueller@...>: > Hey everybody. > I noticed mahout also has random forest algorithms. Has anyone tried those? > Has anyone done any timing comparisons? > As far as I understand, we are not really sure what is the best way to build > the trees (masks / no masks, presorting / lazy sorting..). > I thought it might be a good idea to have a look at how the mahout people > are doing it. Hi Andy, I looked at it a while ago  I wasn't terribly impressed but I didn't look too much into it. It builds the trees depth first  they don't use a sample mask but sort the attribute on EACH split. See http://svn.apache.org/viewvc/mahout/trunk/core/src/main/java/org/apache/mahout/classifier/df/builder/DecisionTreeBuilder.java?view=markup Lines 251 and 252 and for the best split: http://svn.apache.org/viewvc/mahout/trunk/core/src/main/java/org/apache/mahout/classifier/df/split/RegressionSplit.java?view=markup Line 93 best, Peter > > Wdyt? > > Cheers, > Andy > >  > Everyone hates slow websites. So do we. > Make your web apps faster with AppDynamics > Download AppDynamics Lite for free today: > http://p.sf.net/sfu/appdyn_sfd2d_oct > _______________________________________________ > Scikitlearngeneral mailing list > Scikitlearngeneral@... > https://lists.sourceforge.net/lists/listinfo/scikitlearngeneral >  Peter Prettenhofer 
From: Andreas Mueller <amueller@ai...>  20121031 08:23:44

Hey everybody. I noticed mahout also has random forest algorithms. Has anyone tried those? Has anyone done any timing comparisons? As far as I understand, we are not really sure what is the best way to build the trees (masks / no masks, presorting / lazy sorting..). I thought it might be a good idea to have a look at how the mahout people are doing it. Wdyt? Cheers, Andy 
From: Afik Cohen <afik@tr...>  20121031 01:00:52

* Woops, my previous reply got munged up, so I'm resubmitting it. Please ignore my previous messed up email. > 2012/10/30 Afik Cohen <afik@...>: > >> Do you know what they are doing? I would expect they just do a softmax. > > I don't. :) But according to the LIBLINEAR FAQ: "If you really would like to > > have probability outputs for SVM in LIBLINEAR, you can consider using the > > simple > > probability model of logistic regression. Simply modify the following > > subroutine > > in linear.cpp." So I assume that it's using the same method logistic > > regression > > is using to determine probability estimates. > > Judging from your code, it is, which is why I suggested copying over > the new predict_proba. It does exactly the same thing as Liblinear's > code for logistic regression probability predictions. That is good to know, we'll try that. However, now you've got us thinking that maybe this method is unreliable and we're somewhat less confident using it in production code... > > > I think it's worth doing, because as I mentioned, we seemed to be getting > > meaningful results. We also compared probability outputs from LinearSVC() > > and OneVsRestClassifier(LinearSVC()); in the former, we would get N > > probabilities that the input belonged to each class, and in the latter we > > would get N "IS" and "IS NOT" pairs, showing for each class the probability > > that the input was closer to that class or to the rest of the classes. > > Again, these probability estimates did not seem like meaningless noise! > > Those "probabilities" are guaranteed to be >.5 for positive predictions and > <=.5 for negative ones in binary classification. They will always sum to one > for each sample in multiclass classification. That's still the case with the > hack I suggested. However, you could also have used the output from > decision_function; that produces a more appropriate confidence score for > linear SVMs. It's not a number between zero and one, though, but either > positive or negative and farther away from zero as the confidence increases. Yes, we have noticed that the estimates are >.5 for positive predictions and <=.5 for negative ones. For example, here is some example output from LinearSVC.predict_proba(): With a single classifier (a single LinearSVC() instance training on all classes): 0.00091845721710660235, 0.00091952391997811766, 0.00092169857946579239, 0.00092723763293324924, 0.00093133854468835234, 0.001014397289942081, 0.0010818874768571949, 0.0018864265035381381, 0.00091323156582493283, 0.00091434117232201174, 0.00091437125286051744, 0.00091637654884632082, ... Here, 0.0018864265035381381 is the highest probability, so it's the chosen class. This happens to be the correct prediction. With a OneVsRest strategy  fitting with OneVsRestClassifier(LinearSVC()): [array([[ 0.74510559, 0.25489441]]), array([[ 0.43768196, 0.56231804]]) array([[ 0.73616065, 0.26383935]]), array([[ 0.73083986, 0.26916014]]), array([[ 0.73569696, 0.26430304]]), array([[ 0.73282635, 0.26717365]]), array([[ 0.72934341, 0.27065659]]) ...] You can interpret this as each classifier's probabilities shown as array([[ "IS_NOT"% , "IS"% ]]). The array with the largest IS percentage is array([[ 0.43768196, 0.56231804]]), so that's the class that was picked. Testing this on an email we know does not belong to any existing class produces something like this: [array([[ 0.73209442, 0.26790558]]), array([[ 0.69946787, 0.30053213]]) array([[ 0.73971788, 0.26028212]]), array([[ 0.73583213, 0.26416787]]), array([[ 0.73277501, 0.26722499]])] The highest "IS" percentage is array([[ 0.69946787, 0.30053213]]), so that is the returned class, but note the low percentage. This could mean that this is a reliable way of establishing confidence thresholds, i.e. determining a point below which a match returned could be considered 'low confidence' and thus probably not to be trusted. > > I don't get the remark about OneVsRestClassifier. What do you mean by "is" and > "is not" pairs? What does your target vector y look like? I should have been more clear, I've seen this nomenclature before in other machine learning papers. "IS" means the % probability the input belongs to this class, whereas "IS NOT" means the % that it doesn't. > > >> Do you have strong reasons not to use logistic regression? > > Correct me if I've misunderstood, but regression is meant for fitting to a > > continuous variable or something, not classifying inputs to discrete > > classes, right? We're classifying emails into ~1200 distinct classes, so > > Logistic Regression is meaningless for us (in fact, when we tried it, it > > achieved a hilarious 48% crossvalidated k=3 accuracy. LinearSVC achieves > > 95% accuracy.) > > Actually, logistic regression *is* a classification model, it just has a very > unfortunate name ("In the terminology of statistics, this model is known as > logistic regression, although it should be emphasized that this is a model for > classification rather than regression"  C.M. Bishop, Pattern Recognition > and Machine Learning, p. 205). > > 48% accuracy is extreme compared to 95%, though. How were you applying > logistic regression? > Hah, thanks for the explanation :) But yes, the accuracy was terrible. In fact, we just ran another crossvalidated k=3 run with our current data, and got these results: Training LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, penalty=l2, tol=0.0001) Running CrossValidated accuracy testing with 3 folds. done [4276.551s] Results: Accuracy: 0.639312 (+/ 0.003300) Training time: 4276.55051398 Input Data: (10480, 405562) Labels: 1144 As you can see, 63% accuracy with 10480 document vectors with 405562 features. Pretty awful compared to LinearSVC which gives us upwards of 95%. Afik 
From: Afik Cohen <afik@tr...>  20121031 00:44:07

Hi Lars, Thanks for your reply. > > 2012/10/30 Afik Cohen <afik@...>: > >> Do you know what they are doing? I would expect they just do a softmax. > > I don't. :) But according to the LIBLINEAR FAQ: "If you really would like to > > have probability outputs for SVM in LIBLINEAR, you can consider using the > > simple > > probability model of logistic regression. Simply modify the following > > subroutine > > in linear.cpp." So I assume that it's using the same method logistic > > regression > > is using to determine probability estimates. > > Judging from your code, it is, which is why I suggested copying over > the new predict_proba. It does exactly the same thing as Liblinear's > code for logistic regression probability predictions. That is good to know, we'll try that. However, now you've got us thinking that maybe this method is unreliable and we're somewhat less confident using it in production code... > > > I think it's worth doing, because as I mentioned, we seemed to be getting > > meaningful results. We also compared probability outputs from LinearSVC() > > and OneVsRestClassifier(LinearSVC()); in the former, we would get N > > probabilities that the input belonged to each class, and in the latter we > > would get N "IS" and "IS NOT" pairs, showing for each class the probability > > that the input was closer to that class or to the rest of the classes. > > Again, these probability estimates did not seem like meaningless noise! > > Those "probabilities" are guaranteed to be >.5 for positive predictions and > <=.5 for negative ones in binary classification. They will always sum to one > for each sample in multiclass classification. That's still the case with the > hack I suggested. However, you could also have used the output from > decision_function; that produces a more appropriate confidence score for > linear SVMs. It's not a number between zero and one, though, but either > positive or negative and farther away from zero as the confidence increases. Yes, we have noticed that the estimates are >.5 for positive predictions and <=.5 for negative ones. For example, here is some example output from LinearSVC.predict_proba(): With a single classifier (a single LinearSVC() instance training on all classes): 0.00091845721710660235, 0.00091952391997811766, 0.00092169857946579239, 0.00092723763293324924, 0.00093133854468835234, 0.001014397289942081, 0.0010818874768571949, 0.0018864265035381381, 0.00091323156582493283, 0.00091434117232201174, 0.00091437125286051744, 0.00091637654884632082, ... Here, 0.0018864265035381381 is the highest probability, so it's the chosen class. This happens to be the correct prediction. With a OneVsRest strategy  fitting with OneVsRestClassifier(LinearSVC()): [array([[ 0.74510559, 0.25489441]]), array([[ 0.43768196, 0.56231804]]) array([[ 0.73616065, 0.26383935]]), array([[ 0.73083986, 0.26916014]]), array([[ 0.73569696, 0.26430304]]), array([[ 0.73282635, 0.26717365]]), array([[ 0.72934341, 0.27065659]]) ...] You can interpret this as each classifier's probabilities shown as array([[ "IS_NOT"% , "IS"% ]]). The array with the largest IS percentage is array([[ 0.43768196, 0.56231804]]), so that's the class that was picked. Testing this on an email we know does not belong to any existing class produces something like this: [array([[ 0.73209442, 0.26790558]]), array([[ 0.69946787, 0.30053213]]) array([[ 0.73971788, 0.26028212]]), array([[ 0.73583213, 0.26416787]]), array([[ 0.73277501, 0.26722499]])] The highest "IS" percentage is array([[ 0.69946787, 0.30053213]]), so that is the returned class, but note the low percentage. This could mean that this is a reliable way of establishing confidence thresholds, i.e. determining a point below which a match returned could be considered 'low confidence' and thus probably not to be trusted. > > I don't get the remark about OneVsRestClassifier. What do you mean by "is" and > "is not" pairs? What does your target vector y look like? I should have been more clear, I've seen this nomenclature before in other machine learning papers. "IS" means the % probability the input belongs to this class, whereas "IS NOT" means the % that it doesn't. > > >> Do you have strong reasons not to use logistic regression? > > Correct me if I've misunderstood, but regression is meant for fitting to a > > continuous variable or something, not classifying inputs to discrete > > classes, right? We're classifying emails into ~1200 distinct classes, so > > Logistic Regression is meaningless for us (in fact, when we tried it, it > > achieved a hilarious 48% crossvalidated k=3 accuracy. LinearSVC achieves > > 95% accuracy.) > > Actually, logistic regression *is* a classification model, it just has a very > unfortunate name ("In the terminology of statistics, this model is known as > logistic regression, although it should be emphasized that this is a model for > classification rather than regression"  C.M. Bishop, Pattern Recognition > and Machine Learning, p. 205). > > 48% accuracy is extreme compared to 95%, though. How were you applying > logistic regression? > Hah, thanks for the explanation :) But yes, the accuracy was terrible. In fact, we just ran another crossvalidated k=3 run with our current data, and got these results: Training LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, penalty=l2, tol=0.0001) Running CrossValidated accuracy testing with 3 folds. done [4276.551s] Reting this on an email we know does not belong to any existing class produces something like this: [array([[ 0.73209442, 0.26790558]]), array([[ 0.69946787, 0.30053213]]) array([[ 0.73971788, 0.26028212]]), array([[ 0.73583213, 0.26416787]]), array([[ 0.73277501, 0.26722499]])] The highest "IS" percentage is array([[ 0.69946787, 0.30053213]]), so that is the returned class, but note the low percentage. This could mean that this is a reliable way of establishing confidence thresholds, i.e. determining a point below which a match returned could be considered 'low confidence' and thus probably not to be trusted. > > I don't get the remark about OneVsRestClassifier. What do you mean by "is" and > "is not" pairs? What does your target vector y look like? I should have been more clear, I've seen this nomenclature before in other machine learning papers. "IS" means the % probability the input belongs to this class, whereas "IS NOT" means the % that it doesn't. > > >> Do you have strong reasons not to use logistic regression? > > Correct me if I've misunderstood, but regression is meant for fitting to a > > continuous variable or something, not classifying inputs to discrete > > classes, right? We're classifying emails into ~1200 distinct classes, so > > Logistic Regression is meaningless for us (in fact, when we tried it, it > > achieved a hilarious 48% crossvalidated k=3 accuracy. LinearSVC achieves > > 95% accuracy.) > > Actually, logistic regression *is* a classification model, it just has a very > unfortunate name ("In the terminology of statistics, this model is known as > logistic regression, although it should be emphasized that this is a model for > classification rather than regression"  C.M. Bishop, Pattern Recognition > and Machine Learning, p. 205). > > 48% accuracy is extreme compared to 95%, though. How were you applying > logistic regression? > Hah, thanks for the explanation :) But yes, the accuracy was terrible. In fact, we just ran another crossvalidated k=3 run with our current data, and got these results: Training LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, penalty=l2, tol=0.0001) Running CrossValidated accuracy testing with 3 folds. done [4276.551s] Results: Accuracy: 0.639312 (+/ 0.003300) Training time: 4276.55051398 Input Data: (10480, 405562) Labels: 1144 As you can see, 63% accuracy with 10480 document vectors with 405562 features. Pretty awful compared to LinearSVC which gives us upwards of 95%. Afik 
From: Lars Buitinck <L.J.B<uitinck@uv...>  20121030 22:55:31

2012/10/30 Afik Cohen <afik@...>: >> Do you know what they are doing? I would expect they just do a softmax. > I don't. :) But according to the LIBLINEAR FAQ: "If you really would like to > have probability outputs for SVM in LIBLINEAR, you can consider using the simple > probability model of logistic regression. Simply modify the following subroutine > in linear.cpp." So I assume that it's using the same method logistic regression > is using to determine probability estimates. Judging from your code, it is, which is why I suggested copying over the new predict_proba. It does exactly the same thing as Liblinear's code for logistic regression probability predictions. > I think it's worth doing, because as I mentioned, we seemed to be getting > meaningful results. We also compared probability outputs from LinearSVC() and > OneVsRestClassifier(LinearSVC()); in the former, we would get N probabilities > that the input belonged to each class, and in the latter we would get N "IS" and > "IS NOT" pairs, showing for each class the probability that the input was > closer to that class or to the rest of the classes. Again, these probability > estimates did not seem like meaningless noise! Those "probabilities" are guaranteed to be >.5 for positive predictions and <=.5 for negative ones in binary classification. They will always sum to one for each sample in multiclass classification. That's still the case with the hack I suggested. However, you could also have used the output from decision_function; that produces a more appropriate confidence score for linear SVMs. It's not a number between zero and one, though, but either positive or negative and farther away from zero as the confidence increases. I don't get the remark about OneVsRestClassifier. What do you mean by "is" and "is not" pairs? What does your target vector y look like? >> Do you have strong reasons not to use logistic regression? > Correct me if I've misunderstood, but regression is meant for fitting to a > continuous variable or something, not classifying inputs to discrete classes, > right? We're classifying emails into ~1200 distinct classes, so Logistic > Regression is meaningless for us (in fact, when we tried it, it achieved a > hilarious 48% crossvalidated k=3 accuracy. LinearSVC achieves 95% accuracy.) Actually, logistic regression *is* a classification model, it just has a very unfortunate name ("In the terminology of statistics, this model is known as logistic regression, although it should be emphasized that this is a model for classification rather than regression"  C.M. Bishop, Pattern Recognition and Machine Learning, p. 205). 48% accuracy is extreme compared to 95%, though. How were you applying logistic regression?  Lars Buitinck Scientific programmer, ILPS University of Amsterdam 