You can subscribe to this list here.
2010 
_{Jan}
(23) 
_{Feb}
(4) 
_{Mar}
(56) 
_{Apr}
(74) 
_{May}
(107) 
_{Jun}
(79) 
_{Jul}
(212) 
_{Aug}
(122) 
_{Sep}
(289) 
_{Oct}
(176) 
_{Nov}
(531) 
_{Dec}
(268) 

2011 
_{Jan}
(255) 
_{Feb}
(157) 
_{Mar}
(199) 
_{Apr}
(274) 
_{May}
(495) 
_{Jun}
(157) 
_{Jul}
(276) 
_{Aug}
(212) 
_{Sep}
(356) 
_{Oct}
(356) 
_{Nov}
(421) 
_{Dec}
(365) 
2012 
_{Jan}
(530) 
_{Feb}
(236) 
_{Mar}
(495) 
_{Apr}
(286) 
_{May}
(347) 
_{Jun}
(253) 
_{Jul}
(335) 
_{Aug}
(254) 
_{Sep}
(429) 
_{Oct}
(506) 
_{Nov}
(358) 
_{Dec}
(147) 
2013 
_{Jan}
(492) 
_{Feb}
(328) 
_{Mar}
(477) 
_{Apr}
(348) 
_{May}
(248) 
_{Jun}
(237) 
_{Jul}
(526) 
_{Aug}
(407) 
_{Sep}
(253) 
_{Oct}
(263) 
_{Nov}
(202) 
_{Dec}
(184) 
2014 
_{Jan}
(246) 
_{Feb}
(258) 
_{Mar}
(305) 
_{Apr}
(168) 
_{May}
(182) 
_{Jun}
(238) 
_{Jul}
(340) 
_{Aug}
(256) 
_{Sep}
(312) 
_{Oct}
(168) 
_{Nov}
(105) 
_{Dec}

S  M  T  W  T  F  S 



1
(25) 
2
(11) 
3
(6) 
4
(5) 
5
(5) 
6
(4) 
7
(7) 
8

9
(18) 
10
(14) 
11
(9) 
12
(1) 
13
(2) 
14
(7) 
15

16
(10) 
17
(7) 
18
(1) 
19

20
(1) 
21
(3) 
22
(6) 
23
(7) 
24
(8) 
25

26
(2) 
27
(5) 
28
(6) 
29
(11) 
30
(4) 
31
(14) 


From: Alexandre Gramfort <alexandre.gramfort@in...>  20110331 21:13:42

> To denoise I would like to use the following model: > argmin_w 0.5  y  w ^2 + alpha  w _1 so as you can see this is > exactly LASSO. the solution to this pb is known as a softthresholding and can me solved in 2 or 3 lines of numpy. > As far as I understand documentation typical way of using LASSO is: > > models = lasso_path(X, y, alphas=[1]) > > However I don't want to pass a matrix X but a function's handler, for instance: > models = lasso_path(lambda x: x, y, alphas=[1]) > I prefer functions because storing big matrices is not always possible. could you use a memmap array to work with such big data? We could pass callables but we would need to pass X and X.T. It would make sense when X is an FFT dictionary or a wavelet transform but this is not compatible with the current implementation. Note that if X is an orthogonal transform (like FFT) the solution can also be obtained with a soft thresholding. You might want to look at the iterative solvers based on proximal operators like our paper [1]. They are very common these days in the field of signal/image processing. Coordinate descent or LARS are maybe not the best candidates when X is not a real array. hope this helps ALex [1] http://www.ncbi.nlm.nih.gov/pubmed/21317080?dopt=Abstract 
From: Mateusz Malinowski <m4linka@gm...>  20110331 17:36:31

Sorry for being too vague. Here is my model: y = w + eps where y is the observation, x is 'true image' and eps noise. To denoise I would like to use the following model: argmin_w 0.5  y  w ^2 + alpha  w _1 so as you can see this is exactly LASSO. As far as I understand documentation typical way of using LASSO is: models = lasso_path(X, y, alphas=[1]) However I don't want to pass a matrix X but a function's handler, for instance: models = lasso_path(lambda x: x, y, alphas=[1]) I prefer functions because storing big matrices is not always possible. If you could also pass a function's handler then LASSO would become a suitable model for a bunch of image reconstructions problems where observationtrue_value relationship was modeled as y = f(w) + eps. Here is the snippet: begin_snippet NOISY_AMPLIFIER = 10.0 ; LENA_PATH = '../data/lena256.bmp' print("Reconstruction from noisy measurements!") # take original image img = plt.imread(LENA_PATH) # take noisy measurements img_noisy = img + NOISY_AMPLIFIER * np.random.randn(img.shape[0], img.shape[1]); observation = np.asanyarray(img_noisy) lasso_path(lambda x: x, observation.flatten(), alphas=[0.1]) end_snippet error message: Traceback (most recent call last): File "D:\Aptana_Workspace\Image_Reconstruction\src\denoising.py", line 31, in <module> lasso_path(lambda x: x, observation.flatten(), alphas=[0.1]) File "D:\Programowanie\Python2_7\lib\sitepackages\scikits\learn\linear_model\coordinate_descent.py", line 230, in lasso_path fit_intercept=fit_intercept, verbose=verbose, **fit_params) File "D:\Programowanie\Python2_7\lib\sitepackages\scikits\learn\linear_model\coordinate_descent.py", line 272, in enet_path X, y, Xmean, ymean = LinearModel._center_data(X, y, fit_intercept) File "D:\Programowanie\Python2_7\lib\sitepackages\scikits\learn\linear_model\base.py", line 61, in _center_data Xmean = X.mean(axis=0) AttributeError: 'function' object has no attribute 'mean' Best, M.M. P.S. Probably in my example with image denoising you can take advantage of the sparsity of the matrix X and storing such big matrices woudn't be a problem. However, not always it is possible since matrix may not be sparse but yet there is fast implementation in terms of function application (for instance X = Fourier transform, FFT  fast implementation of X * w). I should also add that image denoising is not my goal (only toy example to show you my problem), but if lasso_path could take function's handlers and solve it then I could implement what I want based on it. On Thu, Mar 31, 2011 at 4:45 PM, Alexandre Gramfort <alexandre.gramfort@...> wrote: > Hi Mateusz, > >> I would like to write python script which tackles the image denoising >> problem. The model is very similar to LASSO >> (http://scikitlearn.sourceforge.net/modules/linear_model.html#lasso) >> but in my case X is the identity matrix. > > can you be more specific? I guess you mean that you have no convolution > kernel. Can you write the cost function you want to minimize? > >> Although I can easily work with vectors instead of images, the >> existence of the matrix is problematic >> due to the size of such matrix. Therefore instead of working with >> matrices I would prefer to work with function's handlers. >> Is it possible to work with function's handlers instead of matrices >> within the SciKits.Learn framework? > > can you paste a code snippet of what you have in mind? > >> Is there any example of the usage of the framework regarding to image >> denoising/deblurring problem? > > We've used the scikit with Vincent Michel and Gael Varoquaux for TV > denoising [1] > which amounts to a regression task with images. But I am not sure of what you > have in mind. > > maybe the scikits.image folks [2] can give you more insights. > > Alex > > [1] http://www.ncbi.nlm.nih.gov/pubmed/21317080?dopt=Abstract > [2] https://github.com/stefanv/scikits.image > 
From: Peter Prettenhofer <peter.prettenhofer@gm...>  20110331 16:51:54

2011/3/31 Alexandre Passos <alexandre.tp@...>: > [..] > > Or I can fork from your branch, whichever way is easier. This would be even better  then you could review the changes (API, default settings, etc.) :) best, Peter  Peter Prettenhofer 
From: Alexandre Passos <alexandre.tp@gm...>  20110331 16:10:54

On Thu, Mar 31, 2011 at 12:20, Peter Prettenhofer <peter.prettenhofer@...> wrote: > Wow  that sounds interesting indeed  now I know why vowpal wabit has > such an odd formular for the weight update  multiplying the gradient > by the weight was too simple to be true... > > But before you integrate it into the cython code we should merge my > learningrate branch which introduces constant and inverse scaling > learning rates. Or I can fork from your branch, whichever way is easier.   Alexandre 
From: Peter Prettenhofer <peter.prettenhofer@gm...>  20110331 15:20:42

Wow  that sounds interesting indeed  now I know why vowpal wabit has such an odd formular for the weight update  multiplying the gradient by the weight was too simple to be true... But before you integrate it into the cython code we should merge my learningrate branch which introduces constant and inverse scaling learning rates. best, Peter 2011/3/31 Alexandre Passos <alexandre.tp@...>: > On Thu, Mar 31, 2011 at 11:28, Mathieu Blondel <mathieu@...> wrote: >> [..] >> >> I was thinking of userdefined losses for example to address cost >> sensitive learning (e.g., incur a stronger loss for some classes than >> others). > > I think it's simpler to adapt all staandard losses to be > costsensitive (with an optional parameter which is a vector with the > cost of getting each example wrong) than it is to handle general > losses. There is even a way to adapt common online/nononline SGD to > fit importance weights better than just multiplying things: > http://arxiv.org/pdf/1011.1576 . > > I can implement a variant of this for the batch sgd tomorrow in the sprint > if you're interested. >  >  Alexandre > >  > Create and publish websites with WebMatrix > Use the most popular FREE web apps or write code yourself; > WebMatrix provides all the features you need to develop and > publish your website. http://p.sf.net/sfu/mswebmatrixsf > _______________________________________________ > Scikitlearngeneral mailing list > Scikitlearngeneral@... > https://lists.sourceforge.net/lists/listinfo/scikitlearngeneral >  Peter Prettenhofer 
From: Alexandre Gramfort <alexandre.gramfort@in...>  20110331 15:00:28

Hi Piotr, On Wed, Mar 30, 2011 at 7:17 AM, Piotr Gawron <gawron@...> wrote: > I have following problem. > > I would like integrate ndLDA and Higher Order SVD algorithms with > scikits.learn. > The problem is that the data are no longer vectors therefore passing > 2d arrays (samples x data) to predictor.predict() and > transformer.transform() interfaces is not suitable. what would be the shape of y and X ? let's say X is 3D could you handle it with a shape parameter that works like this: X = X_2d.reshape(shape) and you would pass estimator.fit(X_2d, y, shape) would that work for you ? Alex 
From: Alexandre Passos <alexandre.tp@gm...>  20110331 14:49:05

On Thu, Mar 31, 2011 at 11:28, Mathieu Blondel <mathieu@...> wrote: > Peter, > > On Thu, Mar 31, 2011 at 9:36 PM, Peter Prettenhofer > <peter.prettenhofer@...> wrote: >> I haven't thought about this... a quick and dirty way to solve it is >> to specify the number of classes as a constructor argument (similar to >> the TheanoSGD classifier in jaberg's imagepatch branch). > > I thought about this too but it implies that we have two distinct > classes: one for batch and one for online. It would make things > simpler if we could have fit and partial_fit in the same class (or > fit_iterable if we decide to go this way). > >> From a technical point of view it's possible to inject custom loss >> functions (pure python loss functions) just by inheriting the >> extension type "LossFunction", however, this feature is currently not >> available to the scikit.learn user. The reason for that is convenience >>  the use does not need to instantiate a loss function object >> directly. > > I was thinking of userdefined losses for example to address cost > sensitive learning (e.g., incur a stronger loss for some classes than > others). I think it's simpler to adapt all staandard losses to be costsensitive (with an optional parameter which is a vector with the cost of getting each example wrong) than it is to handle general losses. There is even a way to adapt common online/nononline SGD to fit importance weights better than just multiplying things: http://arxiv.org/pdf/1011.1576 . I can implement a variant of this for the batch sgd tomorrow in the sprint if you're interested.   Alexandre 
From: Alexandre Gramfort <alexandre.gramfort@in...>  20110331 14:46:13

Hi Mateusz, > I would like to write python script which tackles the image denoising > problem. The model is very similar to LASSO > (http://scikitlearn.sourceforge.net/modules/linear_model.html#lasso) > but in my case X is the identity matrix. can you be more specific? I guess you mean that you have no convolution kernel. Can you write the cost function you want to minimize? > Although I can easily work with vectors instead of images, the > existence of the matrix is problematic > due to the size of such matrix. Therefore instead of working with > matrices I would prefer to work with function's handlers. > Is it possible to work with function's handlers instead of matrices > within the SciKits.Learn framework? can you paste a code snippet of what you have in mind? > Is there any example of the usage of the framework regarding to image > denoising/deblurring problem? We've used the scikit with Vincent Michel and Gael Varoquaux for TV denoising [1] which amounts to a regression task with images. But I am not sure of what you have in mind. maybe the scikits.image folks [2] can give you more insights. Alex [1] http://www.ncbi.nlm.nih.gov/pubmed/21317080?dopt=Abstract [2] https://github.com/stefanv/scikits.image 
From: Mathieu Blondel <mathieu@mb...>  20110331 14:28:29

Peter, On Thu, Mar 31, 2011 at 9:36 PM, Peter Prettenhofer <peter.prettenhofer@...> wrote: > I haven't thought about this... a quick and dirty way to solve it is > to specify the number of classes as a constructor argument (similar to > the TheanoSGD classifier in jaberg's imagepatch branch). I thought about this too but it implies that we have two distinct classes: one for batch and one for online. It would make things simpler if we could have fit and partial_fit in the same class (or fit_iterable if we decide to go this way). > From a technical point of view it's possible to inject custom loss > functions (pure python loss functions) just by inheriting the > extension type "LossFunction", however, this feature is currently not > available to the scikit.learn user. The reason for that is convenience >  the use does not need to instantiate a loss function object > directly. I was thinking of userdefined losses for example to address cost sensitive learning (e.g., incur a stronger loss for some classes than others). > I think the two scenarios: pure online learning and largescale > learning are quite different. If we only want to support the latter > I'd suggest to reject partial_fit and handle this stuff in a dataset > object. This object supports iteration over examples and SGD just > invokes its __iter__ method to access training data. The advantage is > that this also works if the object happens to be a numpy array or a > scipy sparse matrix (in the first case it yields dense row vectors, in > the latter sparse row vectors). It will also work with a memory mapped > array or Pytables. Could we expect better performance if the iterable outputs a subset of the data (X, y) rather than an individual instance (x, y)? > I think it will be difficult  if not impossible  to deliver an > implementation that is easy to use* and, on the one hand, supports > pure online learning and, on the other hand, is efficient for > largescale learning. What do you think? What are your/scikit.learn's > priorities? My priority is definitely the ability to handle largescale datasets over the pure online setting. The reason partial_fit(X, y) was proposed if remember correctly was that it is consistent with what has been done before : use numpy arrays (or sparse matrices) to represent the data. In terms of ease of use, I find the snippet above quite nice. But there was also the idea of creating an OnlineMixin which would implement fit_iterable(iter) in terms of partial_fit. reader = SvmlightReader("file.txt", block_size=10000) clf.fit_iterable(reader, n_iter=10) The main issue of partial_fit seems to be that the notion of iteration over the dataset is lost (which can be a problem for some learning rate decays?). Mathieu 
From: Peter Prettenhofer <peter.prettenhofer@gm...>  20110331 12:36:38

Hi Mathieu, thanks for taking this up. You find some comments below: 2011/3/31 Mathieu Blondel <mathieu@...>: > [..] > > The first one is that the vector y may contain only a subset of the > classes (or in the extreme case, only one class). This is a problem > since SGD preallocate the coef_ matrix (n_classes x n_features). The > obvious solution is to use a dictionary to store the weight vectors of > each class instead of a numpy 2darray. For compatibility with other > classifiers, we can implement coef_ as a property. I haven't thought about this... a quick and dirty way to solve it is to specify the number of classes as a constructor argument (similar to the TheanoSGD classifier in jaberg's imagepatch branch). Anyways complete online multiclass classification requires a serious refactoring of the current SGD code base! > > The second potential problem is about the learning schedules. The > routines written in Cython need a n_iter argument. If the user makes > several passes over the dataset (see below) and call partial_fit > repeatedly, we would need to save the state of the learning rate? That's true  the learning rate has to be stored. > > Peter, what areas of the code do you think need to be changed and do > you have ideas how to factor as much code as possible? When you look at the current cython code you will notice that it pretty much relies on numpy ndarray or scipy's sparse matrix. However, if we change the loop over the training examples from row indices [1] to something which returns a pair of x and y [2] where x may be a ndarray for the dense case or a sparse matrix with a single row or a recarray (as in bolt) for the sparse case. This will require just minor refactorings and will make the current code a little bit slower (factor of 2). [1] https://github.com/pprett/scikitlearn/blob/master/scikits/learn/linear_model/sgd_fast.pyx#L310 [2] https://github.com/pprett/bolt/blob/master/bolt/trainer/sgd.pyx#L428 > > Another thing I was wondering: is it possible to extract reusable > utils from the SGD module such as densesparse dot product, > densesparse addition etc? (I suppose we would need a pyd header > file?) I was wondering about that because of custom loss functions > too. >From a technical point of view it's possible to inject custom loss functions (pure python loss functions) just by inheriting the extension type "LossFunction", however, this feature is currently not available to the scikit.learn user. The reason for that is convenience  the use does not need to instantiate a loss function object directly. When it comes to "reusing" functions from the sgd module I'd suggest that we consider it as soon as we have a concrete example that will benefit from it  it will require some "cython import kungfu"  something which I haven't mastered yet. > > Also to put partial_fit into more context: although partial_fit can > potentially be used in a pure online setting, the plan was mainly to > use it for large scale datasets, i.e. make several iterations over the > datasets but load the data by blocks. The plan was to create an > iterator object which can be reset: > > reader = SvmlightReader("file.txt", block_size=10000) > for n in range(n_iter): > for X, y in reader: > clf.partial_fit(X, y) > reader.reset() > > It could also be useful to have a method to generate a minibatch > block randomly: > X, y = reader.random_minibatch(blocksize=1000) > > A textbased file format like Svmlight's doesn't offer a direct way to > quickly retrieve a random line. We would need to build a "line => byte > offset" index (can be produced in memory when needed). > > # All in all, this made me think that if we want to start playing with > an online API, it would probably be easier to start with a good old > averaged perceptron at first than trying to modify the current SGD > module. > I think the two scenarios: pure online learning and largescale learning are quite different. If we only want to support the latter I'd suggest to reject partial_fit and handle this stuff in a dataset object. This object supports iteration over examples and SGD just invokes its __iter__ method to access training data. The advantage is that this also works if the object happens to be a numpy array or a scipy sparse matrix (in the first case it yields dense row vectors, in the latter sparse row vectors). It will also work with a memory mapped array or Pytables. I think it will be difficult  if not impossible  to deliver an implementation that is easy to use* and, on the one hand, supports pure online learning and, on the other hand, is efficient for largescale learning. What do you think? What are your/scikit.learn's priorities? best, Peter * scikit.learn's main attribute  Peter Prettenhofer 
From: Mateusz Malinowski <m4linka@gm...>  20110331 12:31:35

I would like to write python script which tackles the image denoising problem. The model is very similar to LASSO (http://scikitlearn.sourceforge.net/modules/linear_model.html#lasso) but in my case X is the identity matrix. Although I can easily work with vectors instead of images, the existence of the matrix is problematic due to the size of such matrix. Therefore instead of working with matrices I would prefer to work with function's handlers. Is it possible to work with function's handlers instead of matrices within the SciKits.Learn framework? Is there any example of the usage of the framework regarding to image denoising/deblurring problem? Best, Mateusz Malinowski 
From: Mathieu Blondel <mathieu@mb...>  20110331 11:39:40

As you may remember from a thread on the mailinglist (back a few months ago), there was an agreement that online algorithms should implement a partial_fit(X, y) method. The reason for adding a new method was mainly a matter of semantics: partial_fit makes it clear that the previous model is not erased when partial_fit is called again. I started to look into adding partial_fit to the SGD module. My original idea was to rename the fit method in BaseSGD to _fit, add a partial=TrueFalse option and initialize the model parameters only when partial=False or the parameters are not present yet. This way, fit and partial_fit could easily be implemented in terms of _fit. However, it is more difficult than I thought and I found potential issues. The first one is that the vector y may contain only a subset of the classes (or in the extreme case, only one class). This is a problem since SGD preallocate the coef_ matrix (n_classes x n_features). The obvious solution is to use a dictionary to store the weight vectors of each class instead of a numpy 2darray. For compatibility with other classifiers, we can implement coef_ as a property. The second potential problem is about the learning schedules. The routines written in Cython need a n_iter argument. If the user makes several passes over the dataset (see below) and call partial_fit repeatedly, we would need to save the state of the learning rate? Peter, what areas of the code do you think need to be changed and do you have ideas how to factor as much code as possible? Another thing I was wondering: is it possible to extract reusable utils from the SGD module such as densesparse dot product, densesparse addition etc? (I suppose we would need a pyd header file?) I was wondering about that because of custom loss functions too. Also to put partial_fit into more context: although partial_fit can potentially be used in a pure online setting, the plan was mainly to use it for large scale datasets, i.e. make several iterations over the datasets but load the data by blocks. The plan was to create an iterator object which can be reset: reader = SvmlightReader("file.txt", block_size=10000) for n in range(n_iter): for X, y in reader: clf.partial_fit(X, y) reader.reset() It could also be useful to have a method to generate a minibatch block randomly: X, y = reader.random_minibatch(blocksize=1000) A textbased file format like Svmlight's doesn't offer a direct way to quickly retrieve a random line. We would need to build a "line => byte offset" index (can be produced in memory when needed). # All in all, this made me think that if we want to start playing with an online API, it would probably be easier to start with a good old averaged perceptron at first than trying to modify the current SGD module. Mathieu 
From: Vlad Niculae <vlad@ve...>  20110331 06:10:24

Hey, Of course I'll wait, what I meant is that I'll try to get as much done early on. Thanks, Vlad On Thu, Mar 31, 2011 at 8:39 AM, Gael Varoquaux <gael.varoquaux@...> wrote: > On Wed, Mar 30, 2011 at 02:12:29PM +0300, Vlad Niculae wrote: >> I have started to complete my proposal. I used what has been written >> in the description on your wiki, but I could use some help regarding >> the goals and the deliverables that I am fixing. Of course I know that >> the decision about what algorithms to implement etc. will be made >> later, and that the goals will be updated, but I would like to know >> how the initial form should look like. > >> I aim to submit the proposal as soon as possible after friday, when >> hopefully I will have some code merged upstream. > >> The link is here: >> https://github.com/vene/scikitlearn/wiki/VladNiculaestudentproposal > > Hey Vlad, > > Could you wait a bit longer than right after Friday to submit. I'd like > to give this proposal a bit more thought, and I guess other possible > mentors would too. I am fighting to find time. The deadline is still a > week after. > > Best, > > G > >  > Create and publish websites with WebMatrix > Use the most popular FREE web apps or write code yourself; > WebMatrix provides all the features you need to develop and > publish your website. http://p.sf.net/sfu/mswebmatrixsf > _______________________________________________ > Scikitlearngeneral mailing list > Scikitlearngeneral@... > https://lists.sourceforge.net/lists/listinfo/scikitlearngeneral > 
From: Gael Varoquaux <gael.varoquaux@no...>  20110331 05:39:52

On Wed, Mar 30, 2011 at 02:12:29PM +0300, Vlad Niculae wrote: > I have started to complete my proposal. I used what has been written > in the description on your wiki, but I could use some help regarding > the goals and the deliverables that I am fixing. Of course I know that > the decision about what algorithms to implement etc. will be made > later, and that the goals will be updated, but I would like to know > how the initial form should look like. > I aim to submit the proposal as soon as possible after friday, when > hopefully I will have some code merged upstream. > The link is here: > https://github.com/vene/scikitlearn/wiki/VladNiculaestudentproposal Hey Vlad, Could you wait a bit longer than right after Friday to submit. I'd like to give this proposal a bit more thought, and I guess other possible mentors would too. I am fighting to find time. The deadline is still a week after. Best, G 
From: Piotr Gawron <gawron@ii...>  20110330 11:17:25

I have following problem. I would like integrate ndLDA and Higher Order SVD algorithms with scikits.learn. The problem is that the data are no longer vectors therefore passing 2d arrays (samples x data) to predictor.predict() and transformer.transform() interfaces is not suitable. Do you have any ideas how should I pass the information about shape of the data to those functions in order to be in agreement with scikits.learn coding standards? Sincerely, Piotr Gawron 
From: Vlad Niculae <vlad@ve...>  20110330 11:12:37

Hi, I have started to complete my proposal. I used what has been written in the description on your wiki, but I could use some help regarding the goals and the deliverables that I am fixing. Of course I know that the decision about what algorithms to implement etc. will be made later, and that the goals will be updated, but I would like to know how the initial form should look like. I aim to submit the proposal as soon as possible after friday, when hopefully I will have some code merged upstream. The link is here: https://github.com/vene/scikitlearn/wiki/VladNiculaestudentproposal Thank you! Vlad On Thu, Feb 24, 2011 at 11:21 PM, Alexandre Gramfort <alexandre.gramfort@...> wrote: > Hi, > > following this old thread I've put up a list of ideas for a GSOC: > > https://github.com/scikitlearn/scikitlearn/wiki/AlistoftopicsforaGooglesummerofcode(GSOC)2011 > > do not hesitate to add ideas and propose yourself as a mentor :) > > Alex > > On Mon, Jan 31, 2011 at 1:29 PM, Robert Kern <robert.kern@...> wrote: >> On Mon, Jan 31, 2011 at 12:26, Mathieu Blondel <mathieu@...> wrote: >>> On Tue, Feb 1, 2011 at 2:48 AM, Gael Varoquaux >>> <gael.varoquaux@...> wrote: >>> >>>> Scipy cannot sponsort. It would be PSF via Scipy, or we could try others >>> >>> Why not? Does an organization needs to legally exist (e.g., as a >>> foundation) to be eligible? >> >> There is no fundamental reason, but every time we have applied as a >> separate mentoring organization, Google has told us to direct mentors >> and students to apply under the PSF. >> >>  >> Robert Kern >> >> "I have come to believe that the whole world is an enigma, a harmless >> enigma that is made terrible by our own mad attempt to interpret it as >> though it had an underlying truth." >>  Umberto Eco >> >>  >> Special Offer Download ArcSight Logger for FREE (a $49 USD value)! >> Finally, a worldclass log management solution at an even better pricefree! >> Download using promo code Free_Logger_4_Dev2Dev. Offer expires >> February 28th, so secure your free ArcSight Logger TODAY! >> http://p.sf.net/sfu/arcsightsfd2d >> _______________________________________________ >> Scikitlearngeneral mailing list >> Scikitlearngeneral@... >> https://lists.sourceforge.net/lists/listinfo/scikitlearngeneral >> > >  > Free Software Download: Index, Search & Analyze Logs and other IT data in > RealTime with Splunk. Collect, index and harness all the fast moving IT data > generated by your applications, servers and devices whether physical, virtual > or in the cloud. Deliver compliance at lower cost and gain new business > insights. http://p.sf.net/sfu/splunkdev2dev > _______________________________________________ > Scikitlearngeneral mailing list > Scikitlearngeneral@... > https://lists.sourceforge.net/lists/listinfo/scikitlearngeneral > 
From: Gael Varoquaux <gael.varoquaux@no...>  20110330 04:49:59

On Wed, Mar 30, 2011 at 01:07:22AM +0300, Vlad Niculae wrote: > I have indeed removed the bunch object in favour of the internal > named_tuple construct, Tha's definitely what you would want to do in an ideal world, but the named tuple is new to Python 2.6, and we are still trying to support Python 2.5. You can try and use it, and fall back to the bunch if it's not present. > > By the way, are you still interested in submitting a Google summer of > > code proposal? In which case, it would be good to start working on it, > > as the deadline for proposals in nearing in (this applies to anybody > > wanting to do the GSoC on scikits.learn). > I am very interested indeed in the GSoC. I was waiting for updates > regarding the status of scikitslearn as a mentoring organization etc. We are accepting student via the PSF: http://wiki.python.org/moin/SummerOfCode/2011 > But I probably should get all the documents in order anyway. Yes, you should work on your proposal, and work to meet the PSF requirements. Good luck, Gaël 
From: Mathieu Blondel <mathieu@mb...>  20110330 03:18:07

On Wed, Mar 30, 2011 at 5:41 AM, Vlad Niculae <vlad@...> wrote: > I will also participate via IRC. My main tasks are in matrix > factorization: finishing the tests and writing examples for NMF, maybe > begin work on sparse PCA or kernel PCA or something on the list in my > thread. If not, I'll help the other tasks as much as I can. I have an implementation of Kernel PCA (just need to polish the documentation). I will try to put together a pull request so you can review it during the sprint. Mathieu 
From: Vlad Niculae <vlad@ve...>  20110329 22:07:30

On Wed, Mar 30, 2011 at 12:03 AM, Gael Varoquaux <gael.varoquaux@...> wrote: > On Tue, Mar 29, 2011 at 11:41:48PM +0300, Vlad Niculae wrote: >> I will also participate via IRC. My main tasks are in matrix >> factorization: finishing the tests and writing examples for NMF, maybe >> begin work on sparse PCA or kernel PCA or something on the list in my >> thread. If not, I'll help the other tasks as much as I can. > > Hey Vlad, > > I have a lot of interest in matrix factorization techniques, so I have > been trying to review your branch for the last week, when I had spare > cycles. Unfortunately, I haven't found time to dig deep in, and I can't > do it now. Let me give a few very general remarks about the cro, which is > what I had time to look at most. I don't love the use of the bunch object > I think that I would prefer a dictionary, because it is a standard Python > object. Or you could use the Bunch as it is defined in > scikits.learn.datasets.base, that way we don't have 2 bunch objects. I have indeed removed the bunch object in favour of the internal named_tuple construct, I'm not sure whether I pushed that change yet. Among other unpushed changes are sparsity constraints in the nmf fit! I can't wait to do some benchmarks and examples. > Ideally, I'd like a more 'flat' data structure, simply based on lists, > but I can see why you had to do what you did. Also, I think that I'd > prefer a bit more functionality pushed in functions rather than methods, > as I don't think that the methods that you have always need the structure > of an object. I will indeed try to refactor into functions. I am just more used to this style. It's also often easier to test. I should dwell into this > more, but... > >> Should I announce this on the wiki and if so, in what form? > > What you have done is good. > > By the way, are you still interested in submitting a Google summer of > code proposal? In which case, it would be good to start working on it, > as the deadline for proposals in nearing in (this applies to anybody > wanting to do the GSoC on scikits.learn). I am very interested indeed in the GSoC. I was waiting for updates regarding the status of scikitslearn as a mentoring organization etc. But I probably should get all the documents in order anyway. Thanks! Vlad > G > >  > Enable your software for Intel(R) Active Management Technology to meet the > growing manageability and security demands of your customers. Businesses > are taking advantage of Intel(R) vPro (TM) technology  will your software > be a part of the solution? Download the Intel(R) Manageability Checker > today! http://p.sf.net/sfu/inteldev2devmar > _______________________________________________ > Scikitlearngeneral mailing list > Scikitlearngeneral@... > https://lists.sourceforge.net/lists/listinfo/scikitlearngeneral > 
From: Gael Varoquaux <gael.varoquaux@no...>  20110329 21:31:44

On Tue, Mar 29, 2011 at 11:03:20PM +0200, helge.reikeras@... wrote: > It would certainly be possible to do fit(X) etc. In fact, this is > pretty much how the interface behaves already. However, the structure > of X is not a simple numarray. E.g. we use an empty list to indicate > latent nodes and a nonempty list for evidence nodes. For the latter > the number of list elements corresponds to the dimensionality of the > random variable associated with the observed node. We then nest these > lists into a list of observations indexed by the nodes. In the common > case that you have multiple observations per node such a list is > created for each observation and then joined into a new list that will > eventually become the training data X. I.e. > X[i][j][k] > is the ith observation of the kth dimension of the jth node. OK, so this can be seen as a list of sparse matrices in LIL (list of lists) representation. This is quite close to what we would do, but I guess we would separate the observations from the graphical structure. One example would be https://github.com/agramfort/scikitlearn/blob/hcluster2/examples/cluster/plot_lena_ward_segmentation.py#L42 the data, 'X', is clustered in a way compatible with a markox structure on the image, that is specified by the 'connectivity' matrix (as sparse matrix in this example). We have a convention in the scikit that the observations, i.e. the data to fit, should always be a 2D array (or sparse matrix, if the associations are sparse) with a first axis being the samples direction, and the second the features direction. This convention enables interchangeability of methods, but also it makes it possible for utilities such as crossvalidation to work on the data uniformely, without understanding the nature of the data. With regards to prefering a sparse matrix rather than a list of list as an input for the graph structure, this stems from the fact that they are different ways of representing a graph: list of list, list of edges (coo in sparse matrix), ... The best structure to represent the graph depends on the algorithms that you are going to run on it. Sparse matrices are pretty much graphs that use a few numpy arrays to store their structure. They are available in different storage formats with alreadywritten code to convert from one to the other. And finally, there is a lot of code available to do graph operations on them. > Note that the current implementation only deals with nodes that can be > represented as finite probability tables, i.e. discrete or observed > continuous variables. We can not do inference for latent continuous > variables. Fair enough, it limits the problem, and thus makes it easier to implement. Cheers, G 
From: <helge.reikeras@gm...>  20110329 21:03:26

Hi Gael Thanks for your reply. On Tue, Mar 29, 2011 at 10:05 PM, Gael Varoquaux <gael.varoquaux@...> wrote: > On Tue, Mar 29, 2011 at 08:14:16PM +0200, helge.reikeras@... wrote: >> During my master's I I helped author a GM Python toolkit [2]. The code >> never really made it out of 'research mode' and still lacks a bit >> w.r.t. documentation, tests, optimization, etc. However, it does work. >> Some parts are essentially ports of the Bayes Nets toolbox for Matlab >> by K.Murphy. Perhaps there is something that could be reused in the >> SciKit. > > There would definitly be some interest. Out of curiosity, can you > implement a simple interface with learners (we call them estimators) > implementing a fit(X) method and models describe with a few set of > general objects. For instance, we like to use sparse matrices to > describe a graph. The reason that I am asking this, is that if you need a > different API and layout than the rest of the scikit, it might make more > sens keeping the packages separate. On the other hand, if you can 'slot > in', it would be great, as one of the goals of the scikit is to make it > easy to combine and compare methods. It would certainly be possible to do fit(X) etc. In fact, this is pretty much how the interface behaves already. However, the structure of X is not a simple numarray. E.g. we use an empty list to indicate latent nodes and a nonempty list for evidence nodes. For the latter the number of list elements corresponds to the dimensionality of the random variable associated with the observed node. We then nest these lists into a list of observations indexed by the nodes. In the common case that you have multiple observations per node such a list is created for each observation and then joined into a new list that will eventually become the training data X. I.e. X[i][j][k] is the ith observation of the kth dimension of the jth node. The interface is fairly flexible in how it deals with missing data. If the node is latent then X[i][j] = []. We already have the option of sparse adjacency matrices. Note that the current implementation only deals with nodes that can be represented as finite probability tables, i.e. discrete or observed continuous variables. We can not do inference for latent continuous variables. > > Thanks for your input, > > Gael Regards, Helge > >  > Enable your software for Intel(R) Active Management Technology to meet the > growing manageability and security demands of your customers. Businesses > are taking advantage of Intel(R) vPro (TM) technology  will your software > be a part of the solution? Download the Intel(R) Manageability Checker > today! http://p.sf.net/sfu/inteldev2devmar > _______________________________________________ > Scikitlearngeneral mailing list > Scikitlearngeneral@... > https://lists.sourceforge.net/lists/listinfo/scikitlearngeneral > 
From: Gael Varoquaux <gael.varoquaux@no...>  20110329 21:03:25

On Tue, Mar 29, 2011 at 11:41:48PM +0300, Vlad Niculae wrote: > I will also participate via IRC. My main tasks are in matrix > factorization: finishing the tests and writing examples for NMF, maybe > begin work on sparse PCA or kernel PCA or something on the list in my > thread. If not, I'll help the other tasks as much as I can. Hey Vlad, I have a lot of interest in matrix factorization techniques, so I have been trying to review your branch for the last week, when I had spare cycles. Unfortunately, I haven't found time to dig deep in, and I can't do it now. Let me give a few very general remarks about the cro, which is what I had time to look at most. I don't love the use of the bunch object I think that I would prefer a dictionary, because it is a standard Python object. Or you could use the Bunch as it is defined in scikits.learn.datasets.base, that way we don't have 2 bunch objects. Ideally, I'd like a more 'flat' data structure, simply based on lists, but I can see why you had to do what you did. Also, I think that I'd prefer a bit more functionality pushed in functions rather than methods, as I don't think that the methods that you have always need the structure of an object. It's also often easier to test. I should dwell into this more, but... > Should I announce this on the wiki and if so, in what form? What you have done is good. By the way, are you still interested in submitting a Google summer of code proposal? In which case, it would be good to start working on it, as the deadline for proposals in nearing in (this applies to anybody wanting to do the GSoC on scikits.learn). G 
From: didier vila <viladidier@ho...>  20110329 20:43:56

All As i saw you will organize a sprint this friday, i just wanted to mention the time serie prediction topics. For exemple, it will be fantastic if Linear Dynamical Systems can be developped or it can be good to give more structural flexibility to the HMM ( second order third order  or flexible structure). At this stage, I learnt how to use the GaussianHMM (done) and the Gaussiens Process ( in process) in order to make predictions. i am not a programmer but just an user of scikits.python ( at least at this stage) who developps his skill in python with a quantitative background. I will be more than happy to participate to a validation work as a final user in this topics or others closer as i am keen to learn the theory and check the implementation based on the thoery ( As I will be in travel Friday for my work, I will not attend). you just have to let me know if you need people involved in a validation process. Regards Didier 
From: Vlad Niculae <vlad@ve...>  20110329 20:41:57

Hello I will also participate via IRC. My main tasks are in matrix factorization: finishing the tests and writing examples for NMF, maybe begin work on sparse PCA or kernel PCA or something on the list in my thread. If not, I'll help the other tasks as much as I can. Should I announce this on the wiki and if so, in what form? I will send a pull request for the NMF code at least the night before. Anyway I'll be on IRC as soon as I get home on friday (is 9am UTC+1 Paris time?), see you then! Vlad On Mon, Mar 28, 2011 at 12:04 PM, Vincent Michel <vm.michel@...> wrote: > Hi, > > The sprint will start at 9am, and will finish around 7pm. > Location in Paris is: > > Logilab, 104 boulevard LouisAuguste Blanqui, 75013 > Metro 6  Glacière > More details here : http://www.logilab.fr/contact > > See you there ! > > Vincent > > > > 2011/3/26 Gael Varoquaux <gael.varoquaux@...> >> >> This is just a reminder to everybody that we are having a sprint on the >> scikit next Friday. It will be hosted at Logilab, in Paris, and in Boston >> (AlexG do you have a location?). We will also be on line on IRC for >> remote participation. >> >> The sprint is the best time to merge in a feature that you have had >> halfimplemented, or scratch a new itch. It is also a great way to learn >> the best practices for contributing to the scikit, as core developers >> will be available for quick interaction. >> >> We will try to merge in many of the longawaiting pull request. If you >> have a pull request waiting for a merge and you are around to answer >> questions, it will make the work easier. >> >> Efficient sprinting comes with preparation. It would be great if >> everybody could edit the corresponding wiki page >> https://github.com/scikitlearn/scikitlearn/wiki/Upcomingevents to give >> practical information: who is going to be there, what are people going to >> work on, what tasks are there to be done and where can newcomers find >> information to achieve these tasks. >> >> See y'all soon! >> >> Gaël >> >> >>  >> Enable your software for Intel(R) Active Management Technology to meet the >> growing manageability and security demands of your customers. Businesses >> are taking advantage of Intel(R) vPro (TM) technology  will your software >> be a part of the solution? Download the Intel(R) Manageability Checker >> today! http://p.sf.net/sfu/inteldev2devmar >> _______________________________________________ >> Scikitlearngeneral mailing list >> Scikitlearngeneral@... >> https://lists.sourceforge.net/lists/listinfo/scikitlearngeneral > > >  > Enable your software for Intel(R) Active Management Technology to meet the > growing manageability and security demands of your customers. Businesses > are taking advantage of Intel(R) vPro (TM) technology  will your software > be a part of the solution? Download the Intel(R) Manageability Checker > today! http://p.sf.net/sfu/inteldev2devmar > _______________________________________________ > Scikitlearngeneral mailing list > Scikitlearngeneral@... > https://lists.sourceforge.net/lists/listinfo/scikitlearngeneral > > 
From: Gael Varoquaux <gael.varoquaux@no...>  20110329 20:05:17

On Tue, Mar 29, 2011 at 08:14:16PM +0200, helge.reikeras@... wrote: > I see from [1] that one of the planned features for scikits.learn is > graphical models. I'm curious what the status regarding this feature > is? Not much is done :). I do some graphical Gaussian model for my own research work (for instance http://hal.inria.fr/inria00512451/PDF/paper.pdf), and regularized covariance learning is of general interest in machine. In this regards, it is clear that some Gaussian graphical model code will land in the scikit at some point (for instance, I'd love to find the time to implement http://books.nips.cc/papers/files/nips23/NIPS2010_0109.pdf ). > During my master's I I helped author a GM Python toolkit [2]. The code > never really made it out of 'research mode' and still lacks a bit > w.r.t. documentation, tests, optimization, etc. However, it does work. > Some parts are essentially ports of the Bayes Nets toolbox for Matlab > by K.Murphy. Perhaps there is something that could be reused in the > SciKit. There would definitly be some interest. Out of curiosity, can you implement a simple interface with learners (we call them estimators) implementing a fit(X) method and models describe with a few set of general objects. For instance, we like to use sparse matrices to describe a graph. The reason that I am asking this, is that if you need a different API and layout than the rest of the scikit, it might make more sens keeping the packages separate. On the other hand, if you can 'slot in', it would be great, as one of the goals of the scikit is to make it easy to combine and compare methods. Thanks for your input, Gael 