From: David Marek <h4wk.cz@gm...>  20120514 22:13:01

Hi, I have worked on multilayer perceptron and I've got a basic implementation working. You can see it at https://github.com/davidmarek/scikitlearn/tree/gsoc_mlp The most important part is the sgd implementation, which can be found here https://github.com/davidmarek/scikitlearn/blob/gsoc_mlp/sklearn/mlp/mlp_fast.pyx I have encountered a few problems and I would like to know your opinion. 1) There are classes like SequentialDataset and WeightVector which are used in sgd for linear_model, but I am not sure if I should use them here as well. I have to do more with samples and weights than just multiply and add them together. I wouldn't be able to use numpy functions like tanh and do batch updates, would I? What do you think? Am I missing something that would help me do everything I need with SequentialDataset? I implemented my own LossFunction because I need a vectorized version, I think that is the same problem. 2) I used Andreas' implementation as an inspiration and I am not sure I understand some parts of it: * Shouldn't the bias vector be initialized with ones instead of zeros? I guess there is no difference. * I am not sure why is the bias updated with: bias_output += lr * np.mean(delta_o, axis=0) shouldn't it be: bias_output += lr / batch_size * np.mean(delta_o, axis=0)? * Shouldn't the backward step for computing delta_h be: delta_h[:] = np.dot(delta_o, weights_output.T) * hidden.doutput(x_hidden) where hidden.doutput is a derivation of the activation function for hidden layer? I hope my questions are not too stupid. Thank you. David 
From: Andreas Mueller <amueller@ai...>  20120515 07:37:48

Hi David. I'll have a look at your code later today. Let me first answer your questions to my code On 05/15/2012 12:12 AM, David Marek wrote: > Hi, > > 2) I used Andreas' implementation as an inspiration and I am not sure > I understand some parts of it: > * Shouldn't the bias vector be initialized with ones instead of > zeros? I guess there is no difference. I am always initializing it with zeros. If you initialize it with ones, you might get out of the linear part of the nonlinearity. At the beginning, you definitely want to stay close to the linear part to have meaningful derivatives. What would be the reason to initialize with ones? Btw, there is a Paper by Bengios group on how to initialize the weights in a "good" way. You should have a look at that, but I don't have the reference at the moment. > * I am not sure why is the bias updated with: > bias_output += lr * np.mean(delta_o, axis=0) > shouldn't it be: > bias_output += lr / batch_size * np.mean(delta_o, axis=0)? By doing the mean, the batch_size doesn't have an influence on the size of the gradient if I'm not mistaken. > * Shouldn't the backward step for computing delta_h be: > delta_h[:] = np.dot(delta_o, weights_output.T) * hidden.doutput(x_hidden) > where hidden.doutput is a derivation of the activation function for > hidden layer? Yes, it should be. For softmax and maximum entropy loss, loads of stuff gets canceled and the derivative wrt the output is linear. Try wolfram alpha if you don't believe me ;) I haven't really found a place with a good derivation for this. It is not very obvious to me. > > I hope my questions are not too stupid. Thank you. > Not at all. Cheers, Andy 
From: David WardeFarley <wardefar@ir...>  20120515 14:58:58

On Tue, May 15, 2012 at 12:12:34AM +0200, David Marek wrote: > Hi, > > I have worked on multilayer perceptron and I've got a basic > implementation working. You can see it at > https://github.com/davidmarek/scikitlearn/tree/gsoc_mlp The most > important part is the sgd implementation, which can be found here > https://github.com/davidmarek/scikitlearn/blob/gsoc_mlp/sklearn/mlp/mlp_fast.pyx > > I have encountered a few problems and I would like to know your opinion. > > 1) There are classes like SequentialDataset and WeightVector which are > used in sgd for linear_model, but I am not sure if I should use them > here as well. I have to do more with samples and weights than just > multiply and add them together. I wouldn't be able to use numpy > functions like tanh and do batch updates, would I? What do you think? I haven't had a look at these classes myself but I think working with raw NumPy arrays is a better idea in terms of efficiency. > Am I missing something that would help me do everything I need with > SequentialDataset? I implemented my own LossFunction because I need a > vectorized version, I think that is the same problem. > > 2) I used Andreas' implementation as an inspiration and I am not sure > I understand some parts of it: > * Shouldn't the bias vector be initialized with ones instead of > zeros? I guess there is no difference. If the training set is meancentered, then absolutely, yes. Otherwise the biases should in the hidden layer should be initialized to the mean over the training set of Wx, where W are the initial weights. This ensures that the activation function is near its linear regime. > * I am not sure why is the bias updated with: > bias_output += lr * np.mean(delta_o, axis=0) > shouldn't it be: > bias_output += lr / batch_size * np.mean(delta_o, axis=0)? As Andy said, the former allows you to set the learning rate without taking into account the batch size, which makes things a little simpler. > * Shouldn't the backward step for computing delta_h be: > delta_h[:] = np.dot(delta_o, weights_output.T) * hidden.doutput(x_hidden) > where hidden.doutput is a derivation of the activation function for > hidden layer? Offhand that sounds right. You can use Theano as a sanity check for your implementation. David 
From: David Marek <h4wk.cz@gm...>  20120516 10:16:27

On Tue, May 15, 2012 at 4:59 PM, David WardeFarley <wardefar@...> wrote: > On Tue, May 15, 2012 at 12:12:34AM +0200, David Marek wrote: >> Hi, >> >> I have worked on multilayer perceptron and I've got a basic >> implementation working. You can see it at >> https://github.com/davidmarek/scikitlearn/tree/gsoc_mlp The most >> important part is the sgd implementation, which can be found here >> https://github.com/davidmarek/scikitlearn/blob/gsoc_mlp/sklearn/mlp/mlp_fast.pyx >> >> I have encountered a few problems and I would like to know your opinion. >> >> 1) There are classes like SequentialDataset and WeightVector which are >> used in sgd for linear_model, but I am not sure if I should use them >> here as well. I have to do more with samples and weights than just >> multiply and add them together. I wouldn't be able to use numpy >> functions like tanh and do batch updates, would I? What do you think? > > I haven't had a look at these classes myself but I think working with raw > NumPy arrays is a better idea in terms of efficiency. > >> Am I missing something that would help me do everything I need with >> SequentialDataset? I implemented my own LossFunction because I need a >> vectorized version, I think that is the same problem. >> >> 2) I used Andreas' implementation as an inspiration and I am not sure >> I understand some parts of it: >> * Shouldn't the bias vector be initialized with ones instead of >> zeros? I guess there is no difference. > > If the training set is meancentered, then absolutely, yes. > > Otherwise the biases should in the hidden layer should be initialized to > the mean over the training set of Wx, where W are the initial weights. > This ensures that the activation function is near its linear regime. Ok, the rule of thumb is that the bias should be initialized so the activation function starts in linear regime. >> * I am not sure why is the bias updated with: >> bias_output += lr * np.mean(delta_o, axis=0) >> shouldn't it be: >> bias_output += lr / batch_size * np.mean(delta_o, axis=0)? > > As Andy said, the former allows you to set the learning rate without taking > into account the batch size, which makes things a little simpler. I see, it's pretty obvious when I look at it now. >> * Shouldn't the backward step for computing delta_h be: >> delta_h[:] = np.dot(delta_o, weights_output.T) * hidden.doutput(x_hidden) >> where hidden.doutput is a derivation of the activation function for >> hidden layer? > > Offhand that sounds right. You can use Theano as a sanity check for your > implementation. Thank you David and Andreas for answering my questions. I will look at Theano. David 
From: Andreas Mueller <amueller@ai...>  20120516 10:22:29

Hi David. Did you also see this mail: http://permalink.gmane.org/gmane.comp.python.scikitlearn/3071 For some reason it doesn't show up in my inbox and you didn't quote it. So just making sure. Cheers, Andy > Thank you David and Andreas for answering my questions. I will look at Theano. 
From: David Marek <h4wk.cz@gm...>  20120516 10:29:41

Hi Yes, I did. I am using gmail so I just quote one mail, didn't want to answer each mail separately when they are so similar. Sorry, I will try to be more specific in quoting. David 16. 5. 2012 v 12:22, Andreas Mueller <amueller@...>: > Hi David. > Did you also see this mail: > http://permalink.gmane.org/gmane.comp.python.scikitlearn/3071 > For some reason it doesn't show up in my inbox and you didn't quote it. > So just making sure. > > Cheers, > Andy >> Thank you David and Andreas for answering my questions. I will look at Theano. > > >  > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. Discussions > will include endpoint security, mobile security and the latest in malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > _______________________________________________ > Scikitlearngeneral mailing list > Scikitlearngeneral@... > https://lists.sourceforge.net/lists/listinfo/scikitlearngeneral 
From: Justin Bayer <bayer.justin@go...>  20120517 08:56:56

>>> * Shouldn't the backward step for computing delta_h be: >>> delta_h[:] = np.dot(delta_o, weights_output.T) * hidden.doutput(x_hidden) >>> where hidden.doutput is a derivation of the activation function for >>> hidden layer? >> >> Offhand that sounds right. You can use Theano as a sanity check for your >> implementation. > > Thank you David and Andreas for answering my questions. I will look at Theano. Alternatively, you can just check it numerically. Scipy already comes with an implementation [1] for scalartoscalar mappings, which you can use with a double for loop for vectortovector functions. It is much more straightforward to add this to unit tests than theano (obiviously, because of no additional dependency) and less hassle than to write out results of derivatives by hand. [1] http://docs.scipy.org/doc/scipy/reference/generated/scipy.misc.derivative.html > David > >  > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. Discussions > will include endpoint security, mobile security and the latest in malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > _______________________________________________ > Scikitlearngeneral mailing list > Scikitlearngeneral@... > https://lists.sourceforge.net/lists/listinfo/scikitlearngeneral  Dipl. Inf. Justin Bayer Lehrstuhl für Robotik und Echtzeitsysteme, Technische Universität München http://www6.in.tum.de/Main/Bayerj 
From: Mathieu Blondel <mathieu@mb...>  20120515 15:16:28
Attachments:
Message as HTML

On Tue, May 15, 2012 at 11:59 PM, David WardeFarley < wardefar@...> wrote: > > I haven't had a look at these classes myself but I think working with raw > NumPy arrays is a better idea in terms of efficiency. > Since it abstracts away the data representation, SequentialDataset is useful if you want to support both dense and sparse representations in your MLP implementation. Mathieu 
From: David WardeFarley <wardefar@ir...>  20120515 15:44:22

On Wed, May 16, 2012 at 12:16:21AM +0900, Mathieu Blondel wrote: > On Tue, May 15, 2012 at 11:59 PM, David WardeFarley < > wardefar@...> wrote: > > > > > I haven't had a look at these classes myself but I think working with raw > > NumPy arrays is a better idea in terms of efficiency. > > > > Since it abstracts away the data representation, SequentialDataset is > useful if you want to support both dense and sparse representations in your > MLP implementation. Ah, ok. As long as there are sufficient ways to avoid lots of large temporaries being allocated, that seems like a good idea. David 
From: Peter Prettenhofer <peter.prettenhofer@gm...>  20120516 07:14:51

2012/5/15 Mathieu Blondel <mathieu@...>: > > > On Tue, May 15, 2012 at 11:59 PM, David WardeFarley > <wardefar@...> wrote: >> >> >> I haven't had a look at these classes myself but I think working with raw >> NumPy arrays is a better idea in terms of efficiency. > > > Since it abstracts away the data representation, SequentialDataset is useful > if you want to support both dense and sparse representations in your MLP > implementation. > > Mathieu > Hi everybody, sorry for my late reply  Mathieu is correct: The only purpose of SequentialDataset is to create a common interface to both dense and sparse representations. It is pretty much tailored to the needs of the SGD module (as Andy already pointed out). I think if you want to support both dense and sparse data you'll have to think about such an abstraction eventually. Maybe its a good idea to start with a dense implementation and then we could try to refactor it to support both dense and sparse inputs using a suitable abstraction. best, Peter >  > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. Discussions > will include endpoint security, mobile security and the latest in malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > _______________________________________________ > Scikitlearngeneral mailing list > Scikitlearngeneral@... > https://lists.sourceforge.net/lists/listinfo/scikitlearngeneral >  Peter Prettenhofer 
From: Andreas Mueller <amueller@ai...>  20120515 19:24:17
Attachments:
Message as HTML

On 05/15/2012 05:16 PM, Mathieu Blondel wrote: > > > On Tue, May 15, 2012 at 11:59 PM, David WardeFarley > <wardefar@... <mailto:wardefar@...>> wrote: > > > I haven't had a look at these classes myself but I think working > with raw > NumPy arrays is a better idea in terms of efficiency. > > > Since it abstracts away the data representation, SequentialDataset is > useful if you want to support both dense and sparse representations in > your MLP implementation. I am not sure if we want to support sparse data. I have no experience with using MLPs on sparse data. Could this be done efficiently? The weight vector would need to be represented explicitly and densely, I guess. Any ideas? 
From: David WardeFarley <wardefar@ir...>  20120515 20:06:13

On 20120515, at 3:23 PM, Andreas Mueller <amueller@...> wrote: > I am not sure if we want to support sparse data. I have no experience with using MLPs on sparse data. > Could this be done efficiently? The weight vector would need to be represented explicitly and densely, I guess. > > Any ideas? People can and do use neural nets with sparse inputs, densesparse products aren't usually too bad in my experience. Careful regularization and/or lots of data (a decent number of examples where each feature is nonzero) will be necessary to get good results, but this goes for basically any parametric model operating on sparse inputs. Aside: there was interesting work on autoencoderbased pretraining of MLPs with sparse (binary, I think) inputs done by my colleagues here in Montreal. They showed that in the reconstruction step, you can get away with reconstructing the nonzero entries in the original input and a small random sample of the zero entries, and it works just as well as doing the (much more expensive, when the input is highdimensional) exhaustive reconstruction. Neat stuff. David 
From: Andreas Mueller <amueller@ai...>  20120515 20:31:29

On 05/15/2012 10:06 PM, David WardeFarley wrote: > On 20120515, at 3:23 PM, Andreas Mueller<amueller@...> wrote: > >> I am not sure if we want to support sparse data. I have no experience with using MLPs on sparse data. >> Could this be done efficiently? The weight vector would need to be represented explicitly and densely, I guess. >> >> Any ideas? > People can and do use neural nets with sparse inputs, densesparse products aren't usually too bad in my experience. Careful regularization and/or lots of data (a decent number of examples where each feature is nonzero) will be necessary to get good results, but this goes for basically any parametric model operating on sparse inputs. > Looking at the SequentialDataset implementation and the algorithms again, I tend to agree with David (M.), in that using numpy arrays might be better. If we want to support a sparse version, we'd need another implementation (of the low level functions). The SequentialDataset was made for vector x vector operations. Depending on whether we do minibatch or online learning, the MLP needs vector x matrix or matrix x matrix operations. In particular matrix x matrix is probably not feasible with the SequentialDataset, though I think even vector x matrix might be ugly and possibly slow, though I'm not sure there. What do you think Mathieu (and the others)? On the same topic: I'm not sure if we decided whether we want minibatch, batch and online learning. I have the feeling that it might be possible to do particular optimizations for online learning, and this is the algorithm that I favor the most. Comments? David M., what do you think? Btw, two comments on your current code: I think this looks pretty good already. Atm, the tests are failing, though. Also, I feel like using squared error for classification is a very bad habit that for some reason survived the last 20 years in some dark corner. Did you compare timings and results against my implementation? Once you are pretty sure that the code is correct, you should disable the boundscheck in cython, as this can improve speed a lot :) Cheers, Andy 
From: David Marek <h4wk.cz@gm...>  20120516 11:21:12

On Tue, May 15, 2012 at 10:31 PM, Andreas Mueller <amueller@...> wrote: > On the same topic: I'm not sure if we decided whether we want minibatch, > batch and online learning. > I have the feeling that it might be possible to do particular > optimizations for online learning, and this > is the algorithm that I favor the most. > > Comments? > > David M., what do you think? Well, I am not sure what optimizations could be done for online learning, yet. At first I thought it would be possible to use SequentialDataset for online learning, but now I don't think it's a good idea to reimplement matrix operations that will be needed, when we have numpy. If we find optimizations that would make online learning faster than other options than I'd vote for it. But so far I think the batch_size argument is ok. > Btw, two comments on your current code: > I think this looks pretty good already. Atm, the tests are failing, though. > Also, I feel like using squared error for classification is a very bad habit > that for some reason survived the last 20 years in some dark corner. Well, the first test should not fail, it's just XOR, the second one is recognizing handwritten numbers and I don't expect it to be 100% successful, I am just using it as a simple benchmark. Thank you for pointing out what I thought about my Neural Networks course in university, they teach 20 year old things :D > Did you compare timings and results against my implementation? > Once you are pretty sure that the code is correct, you should disable > the boundscheck > in cython, as this can improve speed a lot :) I haven't yet, will look at it. I have seen boundscheck and other options used in sgd_fast, will have to try them. Thanks David 
From: Mathieu Blondel <mathieu@mb...>  20120516 04:11:48
Attachments:
Message as HTML

On Wed, May 16, 2012 at 5:31 AM, Andreas Mueller <amueller@...>wrote: > > The SequentialDataset was made for vector x vector operations. Depending > on whether we > do minibatch or online learning, the MLP needs vector x matrix or > matrix x matrix operations. > In particular matrix x matrix is probably not feasible with the > SequentialDataset, though I think > even vector x matrix might be ugly and possibly slow, though I'm not > sure there. > > What do you think Mathieu (and the others)? > I think that it is worth investigating the separation between the core algorithm logic and the data representation dependent parts. SGD used to be implemented separately for dense and sparse inputs but the rewrite based on SequentialDataset significantly simplified the source code (but Peter is the best person to comment on this). David could start by getting the numpy array based implementation right, then before implementing the sparse version, investigate how to abstract away the data representation dependent parts either by using/extending SequentialDataset/WeightVector or by creating his own utility classes. Mathieu PS: When it makes sense, it would be nice if we could strive to add sparse matrix support whenever we add a new estimator. 
From: Peter Prettenhofer <peter.prettenhofer@gm...>  20120516 07:23:37

2012/5/16 Mathieu Blondel <mathieu@...>: > > > On Wed, May 16, 2012 at 5:31 AM, Andreas Mueller <amueller@...> > wrote: >> >> >> The SequentialDataset was made for vector x vector operations. Depending >> on whether we >> do minibatch or online learning, the MLP needs vector x matrix or >> matrix x matrix operations. >> In particular matrix x matrix is probably not feasible with the >> SequentialDataset, though I think >> even vector x matrix might be ugly and possibly slow, though I'm not >> sure there. >> >> What do you think Mathieu (and the others)? > > > I think that it is worth investigating the separation between the core > algorithm logic and the data representation dependent parts. SGD used to be > implemented separately for dense and sparse inputs but the rewrite based on > SequentialDataset significantly simplified the source code (but Peter is the > best person to comment on this). David could start by getting the numpy > array based implementation right, then before implementing the sparse > version, investigate how to abstract away the data representation dependent > parts either by using/extending SequentialDataset/WeightVector or by > creating his own utility classes. > > Mathieu > > PS: When it makes sense, it would be nice if we could strive to add sparse > matrix support whenever we add a new estimator. I totally agree > >  > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. Discussions > will include endpoint security, mobile security and the latest in malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > _______________________________________________ > Scikitlearngeneral mailing list > Scikitlearngeneral@... > https://lists.sourceforge.net/lists/listinfo/scikitlearngeneral >  Peter Prettenhofer 
From: Andreas Mueller <amueller@ai...>  20120516 10:31:39

Am 16.05.2012 12:29, schrieb David Marek: > Hi > > Yes, I did. I am using gmail so I just quote one mail, didn't want to > answer each mail separately when they are so similar. Sorry, I will > try to be more specific in quoting. > Never mind, probably my mail program just acted up. Btw, I am not sure theano is the best way to compute derivatives ;) 
From: David WardeFarley <wardefar@ir...>  20120516 18:27:54

On 20120516, at 6:31 AM, Andreas Mueller <amueller@...> wrote: > Btw, I am not sure theano is the best way to compute derivatives ;) No? I would agree in the general case. However, in the case of MLPs and backprop, it's a use case for which Theano has been designed and heavily optimized. With it, it's very easy and quick to produce a correct MLP implementation (the deep learning tutorials contain one). It's *not* the best way to obtain a readable mathematical expression for the gradients, but it'll allow you to compute them easily/correctly, which makes it a useful thing to verify against. I've done this a fair bit myself. I've never had so much success with symbolic tools like Wolfram Alpha in situations involving lots of sums over indexed scalar quantities and whatnot, but perhaps I didn't try hard enough. Once the initial version is working, Theano will serve another purpose: as a speed benchmark to try and beat (or at least not be too far behind). :) David 
From: Andreas Mueller <amueller@ai...>  20120531 19:02:30

Hey David. How is it going? I haven't heard from you in a while. Did you blog anything about your progress? Cheers, Andy Am 16.05.2012 12:15, schrieb David Marek: > On Tue, May 15, 2012 at 4:59 PM, David WardeFarley > <wardefar@...> wrote: >> On Tue, May 15, 2012 at 12:12:34AM +0200, David Marek wrote: >>> Hi, >>> >>> I have worked on multilayer perceptron and I've got a basic >>> implementation working. You can see it at >>> https://github.com/davidmarek/scikitlearn/tree/gsoc_mlp The most >>> important part is the sgd implementation, which can be found here >>> https://github.com/davidmarek/scikitlearn/blob/gsoc_mlp/sklearn/mlp/mlp_fast.pyx >>> >>> I have encountered a few problems and I would like to know your opinion. >>> >>> 1) There are classes like SequentialDataset and WeightVector which are >>> used in sgd for linear_model, but I am not sure if I should use them >>> here as well. I have to do more with samples and weights than just >>> multiply and add them together. I wouldn't be able to use numpy >>> functions like tanh and do batch updates, would I? What do you think? >> I haven't had a look at these classes myself but I think working with raw >> NumPy arrays is a better idea in terms of efficiency. >> >>> Am I missing something that would help me do everything I need with >>> SequentialDataset? I implemented my own LossFunction because I need a >>> vectorized version, I think that is the same problem. >>> >>> 2) I used Andreas' implementation as an inspiration and I am not sure >>> I understand some parts of it: >>> * Shouldn't the bias vector be initialized with ones instead of >>> zeros? I guess there is no difference. >> If the training set is meancentered, then absolutely, yes. >> >> Otherwise the biases should in the hidden layer should be initialized to >> the mean over the training set of Wx, where W are the initial weights. >> This ensures that the activation function is near its linear regime. > Ok, the rule of thumb is that the bias should be initialized so the > activation function starts in linear regime. > >>> * I am not sure why is the bias updated with: >>> bias_output += lr * np.mean(delta_o, axis=0) >>> shouldn't it be: >>> bias_output += lr / batch_size * np.mean(delta_o, axis=0)? >> As Andy said, the former allows you to set the learning rate without taking >> into account the batch size, which makes things a little simpler. > I see, it's pretty obvious when I look at it now. > >>> * Shouldn't the backward step for computing delta_h be: >>> delta_h[:] = np.dot(delta_o, weights_output.T) * hidden.doutput(x_hidden) >>> where hidden.doutput is a derivation of the activation function for >>> hidden layer? >> Offhand that sounds right. You can use Theano as a sanity check for your >> implementation. > Thank you David and Andreas for answering my questions. I will look at Theano. > > David > >  > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. Discussions > will include endpoint security, mobile security and the latest in malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > _______________________________________________ > Scikitlearngeneral mailing list > Scikitlearngeneral@... > https://lists.sourceforge.net/lists/listinfo/scikitlearngeneral 
From: David Marek <h4wk.cz@gm...>  20120601 00:49:45
Attachments:
Message as HTML

Hi, I don't have much time these days because I have got exams in school. I am sorry I haven't informed you. I have implemented a multi class cross entropy and soft max function and turned off some of the cython checks, the result is that the cython implementation is only slightly better, I guess that's because I am using objects as an output functions, I will have to benchmark them to know more. The next step is to test that the gradient descent is working correctly. I am a little unsure how to approach this. One thing I will do is to compute one step of backpropagation by hand and check that the implementation is doing the same. Another thing I will try to do is to compute the gradients numerically, I am not exactly sure if its enough to use the derivative from scipy and apply it on the forward step. David Dne 31.5.2012 21:02 "Andreas Mueller" <amueller@...> napsal(a): > Hey David. > How is it going? > I haven't heard from you in a while. > Did you blog anything about your progress? > > Cheers, > Andy > > Am 16.05.2012 12:15, schrieb David Marek: > > On Tue, May 15, 2012 at 4:59 PM, David WardeFarley > > <wardefar@...> wrote: > >> On Tue, May 15, 2012 at 12:12:34AM +0200, David Marek wrote: > >>> Hi, > >>> > >>> I have worked on multilayer perceptron and I've got a basic > >>> implementation working. You can see it at > >>> https://github.com/davidmarek/scikitlearn/tree/gsoc_mlp The most > >>> important part is the sgd implementation, which can be found here > >>> > https://github.com/davidmarek/scikitlearn/blob/gsoc_mlp/sklearn/mlp/mlp_fast.pyx > >>> > >>> I have encountered a few problems and I would like to know your > opinion. > >>> > >>> 1) There are classes like SequentialDataset and WeightVector which are > >>> used in sgd for linear_model, but I am not sure if I should use them > >>> here as well. I have to do more with samples and weights than just > >>> multiply and add them together. I wouldn't be able to use numpy > >>> functions like tanh and do batch updates, would I? What do you think? > >> I haven't had a look at these classes myself but I think working with > raw > >> NumPy arrays is a better idea in terms of efficiency. > >> > >>> Am I missing something that would help me do everything I need with > >>> SequentialDataset? I implemented my own LossFunction because I need a > >>> vectorized version, I think that is the same problem. > >>> > >>> 2) I used Andreas' implementation as an inspiration and I am not sure > >>> I understand some parts of it: > >>> * Shouldn't the bias vector be initialized with ones instead of > >>> zeros? I guess there is no difference. > >> If the training set is meancentered, then absolutely, yes. > >> > >> Otherwise the biases should in the hidden layer should be initialized to > >> the mean over the training set of Wx, where W are the initial weights. > >> This ensures that the activation function is near its linear regime. > > Ok, the rule of thumb is that the bias should be initialized so the > > activation function starts in linear regime. > > > >>> * I am not sure why is the bias updated with: > >>> bias_output += lr * np.mean(delta_o, axis=0) > >>> shouldn't it be: > >>> bias_output += lr / batch_size * np.mean(delta_o, axis=0)? > >> As Andy said, the former allows you to set the learning rate without > taking > >> into account the batch size, which makes things a little simpler. > > I see, it's pretty obvious when I look at it now. > > > >>> * Shouldn't the backward step for computing delta_h be: > >>> delta_h[:] = np.dot(delta_o, weights_output.T) * > hidden.doutput(x_hidden) > >>> where hidden.doutput is a derivation of the activation function for > >>> hidden layer? > >> Offhand that sounds right. You can use Theano as a sanity check for your > >> implementation. > > Thank you David and Andreas for answering my questions. I will look at > Theano. > > > > David > > > > >  > > Live Security Virtual Conference > > Exclusive live event will cover all the ways today's security and > > threat landscape has changed and how IT managers can respond. Discussions > > will include endpoint security, mobile security and the latest in malware > > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > > _______________________________________________ > > Scikitlearngeneral mailing list > > Scikitlearngeneral@... > > https://lists.sourceforge.net/lists/listinfo/scikitlearngeneral > > > >  > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. Discussions > will include endpoint security, mobile security and the latest in malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > _______________________________________________ > Scikitlearngeneral mailing list > Scikitlearngeneral@... > https://lists.sourceforge.net/lists/listinfo/scikitlearngeneral > 
From: Gael Varoquaux <gael.varoquaux@no...>  20120601 05:27:41

On Fri, Jun 01, 2012 at 02:49:35AM +0200, David Marek wrote: > I don't have much time these days because I have got exams in school. Good luck! > I have implemented a multi class cross entropy and soft max function and > turned off some of the cython checks, the result is that the cython > implementation is only slightly better, I guess that's because I am using > objects as an output functions, I will have to benchmark them to know > more. Do you have any code on github that you can show us? I am not trying to micromanage you, but more to see if we can help by giving ideas on seeing the code. G 
From: Andreas Mueller <amueller@ai...>  20120618 06:44:34

Hey David. Olivier dug up this paper by LeCun's group: http://users.ics.aalto.fi/kcho/papers/icml11.pdf I think this might be quite interesting for the MLP. It is probably also interesting for the linear SGD. I'm surprised that they didn't compare against diagonal stochastic LevenbergMarquardt with constant learning rate... Cheers, Andy On 05/15/2012 12:12 AM, David Marek wrote: > Hi, > > I have worked on multilayer perceptron and I've got a basic > implementation working. You can see it at > https://github.com/davidmarek/scikitlearn/tree/gsoc_mlp The most > important part is the sgd implementation, which can be found here > https://github.com/davidmarek/scikitlearn/blob/gsoc_mlp/sklearn/mlp/mlp_fast.pyx > > I have encountered a few problems and I would like to know your opinion. > > 1) There are classes like SequentialDataset and WeightVector which are > used in sgd for linear_model, but I am not sure if I should use them > here as well. I have to do more with samples and weights than just > multiply and add them together. I wouldn't be able to use numpy > functions like tanh and do batch updates, would I? What do you think? > Am I missing something that would help me do everything I need with > SequentialDataset? I implemented my own LossFunction because I need a > vectorized version, I think that is the same problem. > > 2) I used Andreas' implementation as an inspiration and I am not sure > I understand some parts of it: > * Shouldn't the bias vector be initialized with ones instead of > zeros? I guess there is no difference. > * I am not sure why is the bias updated with: > bias_output += lr * np.mean(delta_o, axis=0) > shouldn't it be: > bias_output += lr / batch_size * np.mean(delta_o, axis=0)? > * Shouldn't the backward step for computing delta_h be: > delta_h[:] = np.dot(delta_o, weights_output.T) * hidden.doutput(x_hidden) > where hidden.doutput is a derivation of the activation function for > hidden layer? > > I hope my questions are not too stupid. Thank you. > > David > >  > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. Discussions > will include endpoint security, mobile security and the latest in malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > _______________________________________________ > Scikitlearngeneral mailing list > Scikitlearngeneral@... > https://lists.sourceforge.net/lists/listinfo/scikitlearngeneral 
From: Olivier Grisel <olivier.grisel@en...>  20120618 09:44:23

2012/6/18 Andreas Mueller <amueller@...>: > Hey David. > Olivier dug up this paper by LeCun's group: > http://users.ics.aalto.fi/kcho/papers/icml11.pdf > I think this might be quite interesting for the MLP. Err, no. The paper I mentioned is even newer: http://arxiv.org/abs/1206.1106 Just to make it more explicit about the paper content, the title is: "No More Pesky Learning Rates" and it's a method for estimating the optimal learning rate schedule online from the data, while learning a model using SGD with a smooth loss (convex or not). It's a preprint for NIPS 2012. It looks very promising. It would great to try and reproduce some of their empirical results.  Olivier http://twitter.com/ogrisel  http://github.com/ogrisel 
From: Andreas Mueller <amueller@ai...>  20120618 09:46:03

Am 18.06.2012 11:43, schrieb Olivier Grisel: > 2012/6/18 Andreas Mueller<amueller@...>: >> Hey David. >> Olivier dug up this paper by LeCun's group: >> http://users.ics.aalto.fi/kcho/papers/icml11.pdf >> I think this might be quite interesting for the MLP. > Err, no. The paper I mentioned is even newer: > > http://arxiv.org/abs/1206.1106 > > Just to make it more explicit about the paper content, the title is: > "No More Pesky Learning Rates" and it's a method for estimating the > optimal learning rate schedule online from the data, while learning a > model using SGD with a smooth loss (convex or not). It's a preprint > for NIPS 2012. It looks very promising. It would great to try and > reproduce some of their empirical results. > Sorry, copy&paste error :/ 
From: David Marek <h4wk.cz@gm...>  20120619 09:10:10
Attachments:
Message as HTML

Hi On Mon, Jun 18, 2012 at 11:43 AM, Olivier Grisel <olivier.grisel@...>wrote: > Err, no. The paper I mentioned is even newer: > > http://arxiv.org/abs/1206.1106 > > Just to make it more explicit about the paper content, the title is: > "No More Pesky Learning Rates" and it's a method for estimating the > optimal learning rate schedule online from the data, while learning a > model using SGD with a smooth loss (convex or not). It's a preprint > for NIPS 2012. It looks very promising. It would great to try and > reproduce some of their empirical results. > Thanks, I will look at it. David 