scikit-learn-general

 [Scikit-learn-general] multilayer perceptron questions From: David Marek - 2012-05-14 22:13:01 ```Hi, I have worked on multilayer perceptron and I've got a basic implementation working. You can see it at https://github.com/davidmarek/scikit-learn/tree/gsoc_mlp The most important part is the sgd implementation, which can be found here https://github.com/davidmarek/scikit-learn/blob/gsoc_mlp/sklearn/mlp/mlp_fast.pyx I have encountered a few problems and I would like to know your opinion. 1) There are classes like SequentialDataset and WeightVector which are used in sgd for linear_model, but I am not sure if I should use them here as well. I have to do more with samples and weights than just multiply and add them together. I wouldn't be able to use numpy functions like tanh and do batch updates, would I? What do you think? Am I missing something that would help me do everything I need with SequentialDataset? I implemented my own LossFunction because I need a vectorized version, I think that is the same problem. 2) I used Andreas' implementation as an inspiration and I am not sure I understand some parts of it: * Shouldn't the bias vector be initialized with ones instead of zeros? I guess there is no difference. * I am not sure why is the bias updated with: bias_output += lr * np.mean(delta_o, axis=0) shouldn't it be: bias_output += lr / batch_size * np.mean(delta_o, axis=0)? * Shouldn't the backward step for computing delta_h be: delta_h[:] = np.dot(delta_o, weights_output.T) * hidden.doutput(x_hidden) where hidden.doutput is a derivation of the activation function for hidden layer? I hope my questions are not too stupid. Thank you. David ```
 Re: [Scikit-learn-general] multilayer perceptron questions From: Andreas Mueller - 2012-05-15 07:37:48 ```Hi David. I'll have a look at your code later today. Let me first answer your questions to my code On 05/15/2012 12:12 AM, David Marek wrote: > Hi, > > 2) I used Andreas' implementation as an inspiration and I am not sure > I understand some parts of it: > * Shouldn't the bias vector be initialized with ones instead of > zeros? I guess there is no difference. I am always initializing it with zeros. If you initialize it with ones, you might get out of the linear part of the nonlinearity. At the beginning, you definitely want to stay close to the linear part to have meaningful derivatives. What would be the reason to initialize with ones? Btw, there is a Paper by Bengios group on how to initialize the weights in a "good" way. You should have a look at that, but I don't have the reference at the moment. > * I am not sure why is the bias updated with: > bias_output += lr * np.mean(delta_o, axis=0) > shouldn't it be: > bias_output += lr / batch_size * np.mean(delta_o, axis=0)? By doing the mean, the batch_size doesn't have an influence on the size of the gradient if I'm not mistaken. > * Shouldn't the backward step for computing delta_h be: > delta_h[:] = np.dot(delta_o, weights_output.T) * hidden.doutput(x_hidden) > where hidden.doutput is a derivation of the activation function for > hidden layer? Yes, it should be. For softmax and maximum entropy loss, loads of stuff gets canceled and the derivative wrt the output is linear. Try wolfram alpha if you don't believe me ;) I haven't really found a place with a good derivation for this. It is not very obvious to me. > > I hope my questions are not too stupid. Thank you. > Not at all. Cheers, Andy ```
 Re: [Scikit-learn-general] multilayer perceptron questions From: David Warde-Farley - 2012-05-15 14:58:58 ```On Tue, May 15, 2012 at 12:12:34AM +0200, David Marek wrote: > Hi, > > I have worked on multilayer perceptron and I've got a basic > implementation working. You can see it at > https://github.com/davidmarek/scikit-learn/tree/gsoc_mlp The most > important part is the sgd implementation, which can be found here > https://github.com/davidmarek/scikit-learn/blob/gsoc_mlp/sklearn/mlp/mlp_fast.pyx > > I have encountered a few problems and I would like to know your opinion. > > 1) There are classes like SequentialDataset and WeightVector which are > used in sgd for linear_model, but I am not sure if I should use them > here as well. I have to do more with samples and weights than just > multiply and add them together. I wouldn't be able to use numpy > functions like tanh and do batch updates, would I? What do you think? I haven't had a look at these classes myself but I think working with raw NumPy arrays is a better idea in terms of efficiency. > Am I missing something that would help me do everything I need with > SequentialDataset? I implemented my own LossFunction because I need a > vectorized version, I think that is the same problem. > > 2) I used Andreas' implementation as an inspiration and I am not sure > I understand some parts of it: > * Shouldn't the bias vector be initialized with ones instead of > zeros? I guess there is no difference. If the training set is mean-centered, then absolutely, yes. Otherwise the biases should in the hidden layer should be initialized to the mean over the training set of -Wx, where W are the initial weights. This ensures that the activation function is near its linear regime. > * I am not sure why is the bias updated with: > bias_output += lr * np.mean(delta_o, axis=0) > shouldn't it be: > bias_output += lr / batch_size * np.mean(delta_o, axis=0)? As Andy said, the former allows you to set the learning rate without taking into account the batch size, which makes things a little simpler. > * Shouldn't the backward step for computing delta_h be: > delta_h[:] = np.dot(delta_o, weights_output.T) * hidden.doutput(x_hidden) > where hidden.doutput is a derivation of the activation function for > hidden layer? Offhand that sounds right. You can use Theano as a sanity check for your implementation. David ```
 Re: [Scikit-learn-general] multilayer perceptron questions From: David Marek - 2012-05-16 10:16:27 ```On Tue, May 15, 2012 at 4:59 PM, David Warde-Farley wrote: > On Tue, May 15, 2012 at 12:12:34AM +0200, David Marek wrote: >> Hi, >> >> I have worked on multilayer perceptron and I've got a basic >> implementation working. You can see it at >> https://github.com/davidmarek/scikit-learn/tree/gsoc_mlp The most >> important part is the sgd implementation, which can be found here >> https://github.com/davidmarek/scikit-learn/blob/gsoc_mlp/sklearn/mlp/mlp_fast.pyx >> >> I have encountered a few problems and I would like to know your opinion. >> >> 1) There are classes like SequentialDataset and WeightVector which are >> used in sgd for linear_model, but I am not sure if I should use them >> here as well. I have to do more with samples and weights than just >> multiply and add them together. I wouldn't be able to use numpy >> functions like tanh and do batch updates, would I? What do you think? > > I haven't had a look at these classes myself but I think working with raw > NumPy arrays is a better idea in terms of efficiency. > >> Am I missing something that would help me do everything I need with >> SequentialDataset? I implemented my own LossFunction because I need a >> vectorized version, I think that is the same problem. >> >> 2) I used Andreas' implementation as an inspiration and I am not sure >> I understand some parts of it: >>  * Shouldn't the bias vector be initialized with ones instead of >> zeros? I guess there is no difference. > > If the training set is mean-centered, then absolutely, yes. > > Otherwise the biases should in the hidden layer should be initialized to > the mean over the training set of -Wx, where W are the initial weights. > This ensures that the activation function is near its linear regime. Ok, the rule of thumb is that the bias should be initialized so the activation function starts in linear regime. >>  * I am not sure why is the bias updated with: >>    bias_output += lr * np.mean(delta_o, axis=0) >>    shouldn't it be: >>    bias_output += lr / batch_size * np.mean(delta_o, axis=0)? > > As Andy said, the former allows you to set the learning rate without taking > into account the batch size, which makes things a little simpler. I see, it's pretty obvious when I look at it now. >>  * Shouldn't the backward step for computing delta_h be: >>    delta_h[:] = np.dot(delta_o, weights_output.T) * hidden.doutput(x_hidden) >>    where hidden.doutput is a derivation of the activation function for >> hidden layer? > > Offhand that sounds right. You can use Theano as a sanity check for your > implementation. Thank you David and Andreas for answering my questions. I will look at Theano. David ```
 Re: [Scikit-learn-general] multilayer perceptron questions From: Andreas Mueller - 2012-05-16 10:22:29 ```Hi David. Did you also see this mail: http://permalink.gmane.org/gmane.comp.python.scikit-learn/3071 For some reason it doesn't show up in my inbox and you didn't quote it. So just making sure. Cheers, Andy > Thank you David and Andreas for answering my questions. I will look at Theano. ```
 Re: [Scikit-learn-general] multilayer perceptron questions From: David Marek - 2012-05-16 10:29:41 ```Hi Yes, I did. I am using gmail so I just quote one mail, didn't want to answer each mail separately when they are so similar. Sorry, I will try to be more specific in quoting. David 16. 5. 2012 v 12:22, Andreas Mueller : > Hi David. > Did you also see this mail: > http://permalink.gmane.org/gmane.comp.python.scikit-learn/3071 > For some reason it doesn't show up in my inbox and you didn't quote it. > So just making sure. > > Cheers, > Andy >> Thank you David and Andreas for answering my questions. I will look at Theano. > > > ------------------------------------------------------------------------------ > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. Discussions > will include endpoint security, mobile security and the latest in malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > _______________________________________________ > Scikit-learn-general mailing list > Scikit-learn-general@... > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general ```
 Re: [Scikit-learn-general] multilayer perceptron questions From: Justin Bayer - 2012-05-17 08:56:56 ```>>>  * Shouldn't the backward step for computing delta_h be: >>>    delta_h[:] = np.dot(delta_o, weights_output.T) * hidden.doutput(x_hidden) >>>    where hidden.doutput is a derivation of the activation function for >>> hidden layer? >> >> Offhand that sounds right. You can use Theano as a sanity check for your >> implementation. > > Thank you David and Andreas for answering my questions. I will look at Theano. Alternatively, you can just check it numerically. Scipy already comes with an implementation [1] for scalar-to-scalar mappings, which you can use with a double for loop for vector-to-vector functions. It is much more straightforward to add this to unit tests than theano (obiviously, because of no additional dependency) and less hassle than to write out results of derivatives by hand. [1] http://docs.scipy.org/doc/scipy/reference/generated/scipy.misc.derivative.html > David > > ------------------------------------------------------------------------------ > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. Discussions > will include endpoint security, mobile security and the latest in malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > _______________________________________________ > Scikit-learn-general mailing list > Scikit-learn-general@... > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Dipl. Inf. Justin Bayer Lehrstuhl für Robotik und Echtzeitsysteme, Technische Universität München http://www6.in.tum.de/Main/Bayerj ```
 Re: [Scikit-learn-general] multilayer perceptron questions From: Mathieu Blondel - 2012-05-15 15:16:28 Attachments: Message as HTML ```On Tue, May 15, 2012 at 11:59 PM, David Warde-Farley < wardefar@...> wrote: > > I haven't had a look at these classes myself but I think working with raw > NumPy arrays is a better idea in terms of efficiency. > Since it abstracts away the data representation, SequentialDataset is useful if you want to support both dense and sparse representations in your MLP implementation. Mathieu ```
 Re: [Scikit-learn-general] multilayer perceptron questions From: David Warde-Farley - 2012-05-15 15:44:22 ```On Wed, May 16, 2012 at 12:16:21AM +0900, Mathieu Blondel wrote: > On Tue, May 15, 2012 at 11:59 PM, David Warde-Farley < > wardefar@...> wrote: > > > > > I haven't had a look at these classes myself but I think working with raw > > NumPy arrays is a better idea in terms of efficiency. > > > > Since it abstracts away the data representation, SequentialDataset is > useful if you want to support both dense and sparse representations in your > MLP implementation. Ah, ok. As long as there are sufficient ways to avoid lots of large temporaries being allocated, that seems like a good idea. David ```
 Re: [Scikit-learn-general] multilayer perceptron questions From: Peter Prettenhofer - 2012-05-16 07:14:51 ```2012/5/15 Mathieu Blondel : > > > On Tue, May 15, 2012 at 11:59 PM, David Warde-Farley > wrote: >> >> >> I haven't had a look at these classes myself but I think working with raw >> NumPy arrays is a better idea in terms of efficiency. > > > Since it abstracts away the data representation, SequentialDataset is useful > if you want to support both dense and sparse representations in your MLP > implementation. > > Mathieu > Hi everybody, sorry for my late reply - Mathieu is correct: The only purpose of SequentialDataset is to create a common interface to both dense and sparse representations. It is pretty much tailored to the needs of the SGD module (as Andy already pointed out). I think if you want to support both dense and sparse data you'll have to think about such an abstraction eventually. Maybe its a good idea to start with a dense implementation and then we could try to refactor it to support both dense and sparse inputs using a suitable abstraction. best, Peter > ------------------------------------------------------------------------------ > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. Discussions > will include endpoint security, mobile security and the latest in malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > _______________________________________________ > Scikit-learn-general mailing list > Scikit-learn-general@... > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general > -- Peter Prettenhofer ```
 Re: [Scikit-learn-general] multilayer perceptron questions From: Andreas Mueller - 2012-05-15 19:24:17 Attachments: Message as HTML ```On 05/15/2012 05:16 PM, Mathieu Blondel wrote: > > > On Tue, May 15, 2012 at 11:59 PM, David Warde-Farley > > wrote: > > > I haven't had a look at these classes myself but I think working > with raw > NumPy arrays is a better idea in terms of efficiency. > > > Since it abstracts away the data representation, SequentialDataset is > useful if you want to support both dense and sparse representations in > your MLP implementation. I am not sure if we want to support sparse data. I have no experience with using MLPs on sparse data. Could this be done efficiently? The weight vector would need to be represented explicitly and densely, I guess. Any ideas? ```
 Re: [Scikit-learn-general] multilayer perceptron questions From: David Warde-Farley - 2012-05-15 20:06:13 ```On 2012-05-15, at 3:23 PM, Andreas Mueller wrote: > I am not sure if we want to support sparse data. I have no experience with using MLPs on sparse data. > Could this be done efficiently? The weight vector would need to be represented explicitly and densely, I guess. > > Any ideas? People can and do use neural nets with sparse inputs, dense-sparse products aren't usually too bad in my experience. Careful regularization and/or lots of data (a decent number of examples where each feature is non-zero) will be necessary to get good results, but this goes for basically any parametric model operating on sparse inputs. Aside: there was interesting work on autoencoder-based pre-training of MLPs with sparse (binary, I think) inputs done by my colleagues here in Montreal. They showed that in the reconstruction step, you can get away with reconstructing the non-zero entries in the original input and a small random sample of the zero entries, and it works just as well as doing the (much more expensive, when the input is high-dimensional) exhaustive reconstruction. Neat stuff. David ```
 Re: [Scikit-learn-general] multilayer perceptron questions From: Andreas Mueller - 2012-05-15 20:31:29 ```On 05/15/2012 10:06 PM, David Warde-Farley wrote: > On 2012-05-15, at 3:23 PM, Andreas Mueller wrote: > >> I am not sure if we want to support sparse data. I have no experience with using MLPs on sparse data. >> Could this be done efficiently? The weight vector would need to be represented explicitly and densely, I guess. >> >> Any ideas? > People can and do use neural nets with sparse inputs, dense-sparse products aren't usually too bad in my experience. Careful regularization and/or lots of data (a decent number of examples where each feature is non-zero) will be necessary to get good results, but this goes for basically any parametric model operating on sparse inputs. > Looking at the SequentialDataset implementation and the algorithms again, I tend to agree with David (M.), in that using numpy arrays might be better. If we want to support a sparse version, we'd need another implementation (of the low level functions). The SequentialDataset was made for vector x vector operations. Depending on whether we do mini-batch or online learning, the MLP needs vector x matrix or matrix x matrix operations. In particular matrix x matrix is probably not feasible with the SequentialDataset, though I think even vector x matrix might be ugly and possibly slow, though I'm not sure there. What do you think Mathieu (and the others)? On the same topic: I'm not sure if we decided whether we want minibatch, batch and online learning. I have the feeling that it might be possible to do particular optimizations for online learning, and this is the algorithm that I favor the most. Comments? David M., what do you think? Btw, two comments on your current code: I think this looks pretty good already. Atm, the tests are failing, though. Also, I feel like using squared error for classification is a very bad habit that for some reason survived the last 20 years in some dark corner. Did you compare timings and results against my implementation? Once you are pretty sure that the code is correct, you should disable the boundscheck in cython, as this can improve speed a lot :) Cheers, Andy ```
 Re: [Scikit-learn-general] multilayer perceptron questions From: David Marek - 2012-05-16 11:21:12 ```On Tue, May 15, 2012 at 10:31 PM, Andreas Mueller wrote: > On the same topic: I'm not sure if we decided whether we want minibatch, > batch and online learning. > I have the feeling that it might be possible to do particular > optimizations for online learning, and this > is the algorithm that I favor the most. > > Comments? > > David M., what do you think? Well, I am not sure what optimizations could be done for online learning, yet. At first I thought it would be possible to use SequentialDataset for online learning, but now I don't think it's a good idea to reimplement matrix operations that will be needed, when we have numpy. If we find optimizations that would make online learning faster than other options than I'd vote for it. But so far I think the batch_size argument is ok. > Btw, two comments on your current code: > I think this looks pretty good already. Atm, the tests are failing, though. > Also, I feel like using squared error for classification is a very bad habit > that for some reason survived the last 20 years in some dark corner. Well, the first test should not fail, it's just XOR, the second one is recognizing hand-written numbers and I don't expect it to be 100% successful, I am just using it as a simple benchmark. Thank you for pointing out what I thought about my Neural Networks course in university, they teach 20 year old things :-D > Did you compare timings and results against my implementation? > Once you are pretty sure that the code is correct, you should disable > the boundscheck > in cython, as this can improve speed a lot :) I haven't yet, will look at it. I have seen boundscheck and other options used in sgd_fast, will have to try them. Thanks David ```
 Re: [Scikit-learn-general] multilayer perceptron questions From: Mathieu Blondel - 2012-05-16 04:11:48 Attachments: Message as HTML ```On Wed, May 16, 2012 at 5:31 AM, Andreas Mueller wrote: > > The SequentialDataset was made for vector x vector operations. Depending > on whether we > do mini-batch or online learning, the MLP needs vector x matrix or > matrix x matrix operations. > In particular matrix x matrix is probably not feasible with the > SequentialDataset, though I think > even vector x matrix might be ugly and possibly slow, though I'm not > sure there. > > What do you think Mathieu (and the others)? > I think that it is worth investigating the separation between the core algorithm logic and the data representation dependent parts. SGD used to be implemented separately for dense and sparse inputs but the rewrite based on SequentialDataset significantly simplified the source code (but Peter is the best person to comment on this). David could start by getting the numpy array based implementation right, then before implementing the sparse version, investigate how to abstract away the data representation dependent parts either by using/extending SequentialDataset/WeightVector or by creating his own utility classes. Mathieu PS: When it makes sense, it would be nice if we could strive to add sparse matrix support whenever we add a new estimator. ```
 Re: [Scikit-learn-general] multilayer perceptron questions From: Peter Prettenhofer - 2012-05-16 07:23:37 ```2012/5/16 Mathieu Blondel : > > > On Wed, May 16, 2012 at 5:31 AM, Andreas Mueller > wrote: >> >> >> The SequentialDataset was made for vector x vector operations. Depending >> on whether we >> do mini-batch or online learning, the MLP needs vector x matrix or >> matrix x matrix operations. >> In particular matrix x matrix is probably not feasible with the >> SequentialDataset, though I think >> even vector x matrix might be ugly and possibly slow, though I'm not >> sure there. >> >> What do you think Mathieu (and the others)? > > > I think that it is worth investigating the separation between the core > algorithm logic and the data representation dependent parts. SGD used to be > implemented separately for dense and sparse inputs but the rewrite based on > SequentialDataset significantly simplified the source code (but Peter is the > best person to comment on this). David could start by getting the numpy > array based implementation right, then before implementing the sparse > version, investigate how to abstract away the data representation dependent > parts either by using/extending SequentialDataset/WeightVector or by > creating his own utility classes. > > Mathieu > > PS: When it makes sense, it would be nice if we could strive to add sparse > matrix support whenever we add a new estimator. I totally agree > > ------------------------------------------------------------------------------ > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. Discussions > will include endpoint security, mobile security and the latest in malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > _______________________________________________ > Scikit-learn-general mailing list > Scikit-learn-general@... > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general > -- Peter Prettenhofer ```
 Re: [Scikit-learn-general] multilayer perceptron questions From: Andreas Mueller - 2012-05-16 10:31:39 ```Am 16.05.2012 12:29, schrieb David Marek: > Hi > > Yes, I did. I am using gmail so I just quote one mail, didn't want to > answer each mail separately when they are so similar. Sorry, I will > try to be more specific in quoting. > Never mind, probably my mail program just acted up. Btw, I am not sure theano is the best way to compute derivatives ;) ```
 Re: [Scikit-learn-general] multilayer perceptron questions From: David Warde-Farley - 2012-05-16 18:27:54 ```On 2012-05-16, at 6:31 AM, Andreas Mueller wrote: > Btw, I am not sure theano is the best way to compute derivatives ;) No? I would agree in the general case. However, in the case of MLPs and backprop, it's a use case for which Theano has been designed and heavily optimized. With it, it's very easy and quick to produce a correct MLP implementation (the deep learning tutorials contain one). It's *not* the best way to obtain a readable mathematical expression for the gradients, but it'll allow you to compute them easily/correctly, which makes it a useful thing to verify against. I've done this a fair bit myself. I've never had so much success with symbolic tools like Wolfram Alpha in situations involving lots of sums over indexed scalar quantities and whatnot, but perhaps I didn't try hard enough. Once the initial version is working, Theano will serve another purpose: as a speed benchmark to try and beat (or at least not be too far behind). :) David ```
 Re: [Scikit-learn-general] multilayer perceptron questions From: Andreas Mueller - 2012-05-31 19:02:30 ```Hey David. How is it going? I haven't heard from you in a while. Did you blog anything about your progress? Cheers, Andy Am 16.05.2012 12:15, schrieb David Marek: > On Tue, May 15, 2012 at 4:59 PM, David Warde-Farley > wrote: >> On Tue, May 15, 2012 at 12:12:34AM +0200, David Marek wrote: >>> Hi, >>> >>> I have worked on multilayer perceptron and I've got a basic >>> implementation working. You can see it at >>> https://github.com/davidmarek/scikit-learn/tree/gsoc_mlp The most >>> important part is the sgd implementation, which can be found here >>> https://github.com/davidmarek/scikit-learn/blob/gsoc_mlp/sklearn/mlp/mlp_fast.pyx >>> >>> I have encountered a few problems and I would like to know your opinion. >>> >>> 1) There are classes like SequentialDataset and WeightVector which are >>> used in sgd for linear_model, but I am not sure if I should use them >>> here as well. I have to do more with samples and weights than just >>> multiply and add them together. I wouldn't be able to use numpy >>> functions like tanh and do batch updates, would I? What do you think? >> I haven't had a look at these classes myself but I think working with raw >> NumPy arrays is a better idea in terms of efficiency. >> >>> Am I missing something that would help me do everything I need with >>> SequentialDataset? I implemented my own LossFunction because I need a >>> vectorized version, I think that is the same problem. >>> >>> 2) I used Andreas' implementation as an inspiration and I am not sure >>> I understand some parts of it: >>> * Shouldn't the bias vector be initialized with ones instead of >>> zeros? I guess there is no difference. >> If the training set is mean-centered, then absolutely, yes. >> >> Otherwise the biases should in the hidden layer should be initialized to >> the mean over the training set of -Wx, where W are the initial weights. >> This ensures that the activation function is near its linear regime. > Ok, the rule of thumb is that the bias should be initialized so the > activation function starts in linear regime. > >>> * I am not sure why is the bias updated with: >>> bias_output += lr * np.mean(delta_o, axis=0) >>> shouldn't it be: >>> bias_output += lr / batch_size * np.mean(delta_o, axis=0)? >> As Andy said, the former allows you to set the learning rate without taking >> into account the batch size, which makes things a little simpler. > I see, it's pretty obvious when I look at it now. > >>> * Shouldn't the backward step for computing delta_h be: >>> delta_h[:] = np.dot(delta_o, weights_output.T) * hidden.doutput(x_hidden) >>> where hidden.doutput is a derivation of the activation function for >>> hidden layer? >> Offhand that sounds right. You can use Theano as a sanity check for your >> implementation. > Thank you David and Andreas for answering my questions. I will look at Theano. > > David > > ------------------------------------------------------------------------------ > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. Discussions > will include endpoint security, mobile security and the latest in malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > _______________________________________________ > Scikit-learn-general mailing list > Scikit-learn-general@... > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general ```
 Re: [Scikit-learn-general] multilayer perceptron questions From: David Marek - 2012-06-01 00:49:45 Attachments: Message as HTML ```Hi, I don't have much time these days because I have got exams in school. I am sorry I haven't informed you. I have implemented a multi class cross entropy and soft max function and turned off some of the cython checks, the result is that the cython implementation is only slightly better, I guess that's because I am using objects as an output functions, I will have to benchmark them to know more. The next step is to test that the gradient descent is working correctly. I am a little unsure how to approach this. One thing I will do is to compute one step of backpropagation by hand and check that the implementation is doing the same. Another thing I will try to do is to compute the gradients numerically, I am not exactly sure if its enough to use the derivative from scipy and apply it on the forward step. David Dne 31.5.2012 21:02 "Andreas Mueller" napsal(a): > Hey David. > How is it going? > I haven't heard from you in a while. > Did you blog anything about your progress? > > Cheers, > Andy > > Am 16.05.2012 12:15, schrieb David Marek: > > On Tue, May 15, 2012 at 4:59 PM, David Warde-Farley > > wrote: > >> On Tue, May 15, 2012 at 12:12:34AM +0200, David Marek wrote: > >>> Hi, > >>> > >>> I have worked on multilayer perceptron and I've got a basic > >>> implementation working. You can see it at > >>> https://github.com/davidmarek/scikit-learn/tree/gsoc_mlp The most > >>> important part is the sgd implementation, which can be found here > >>> > https://github.com/davidmarek/scikit-learn/blob/gsoc_mlp/sklearn/mlp/mlp_fast.pyx > >>> > >>> I have encountered a few problems and I would like to know your > opinion. > >>> > >>> 1) There are classes like SequentialDataset and WeightVector which are > >>> used in sgd for linear_model, but I am not sure if I should use them > >>> here as well. I have to do more with samples and weights than just > >>> multiply and add them together. I wouldn't be able to use numpy > >>> functions like tanh and do batch updates, would I? What do you think? > >> I haven't had a look at these classes myself but I think working with > raw > >> NumPy arrays is a better idea in terms of efficiency. > >> > >>> Am I missing something that would help me do everything I need with > >>> SequentialDataset? I implemented my own LossFunction because I need a > >>> vectorized version, I think that is the same problem. > >>> > >>> 2) I used Andreas' implementation as an inspiration and I am not sure > >>> I understand some parts of it: > >>> * Shouldn't the bias vector be initialized with ones instead of > >>> zeros? I guess there is no difference. > >> If the training set is mean-centered, then absolutely, yes. > >> > >> Otherwise the biases should in the hidden layer should be initialized to > >> the mean over the training set of -Wx, where W are the initial weights. > >> This ensures that the activation function is near its linear regime. > > Ok, the rule of thumb is that the bias should be initialized so the > > activation function starts in linear regime. > > > >>> * I am not sure why is the bias updated with: > >>> bias_output += lr * np.mean(delta_o, axis=0) > >>> shouldn't it be: > >>> bias_output += lr / batch_size * np.mean(delta_o, axis=0)? > >> As Andy said, the former allows you to set the learning rate without > taking > >> into account the batch size, which makes things a little simpler. > > I see, it's pretty obvious when I look at it now. > > > >>> * Shouldn't the backward step for computing delta_h be: > >>> delta_h[:] = np.dot(delta_o, weights_output.T) * > hidden.doutput(x_hidden) > >>> where hidden.doutput is a derivation of the activation function for > >>> hidden layer? > >> Offhand that sounds right. You can use Theano as a sanity check for your > >> implementation. > > Thank you David and Andreas for answering my questions. I will look at > Theano. > > > > David > > > > > ------------------------------------------------------------------------------ > > Live Security Virtual Conference > > Exclusive live event will cover all the ways today's security and > > threat landscape has changed and how IT managers can respond. Discussions > > will include endpoint security, mobile security and the latest in malware > > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > > _______________________________________________ > > Scikit-learn-general mailing list > > Scikit-learn-general@... > > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general > > > > ------------------------------------------------------------------------------ > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. Discussions > will include endpoint security, mobile security and the latest in malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > _______________________________________________ > Scikit-learn-general mailing list > Scikit-learn-general@... > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general > ```
 Re: [Scikit-learn-general] multilayer perceptron questions From: Gael Varoquaux - 2012-06-01 05:27:41 ```On Fri, Jun 01, 2012 at 02:49:35AM +0200, David Marek wrote: > I don't have much time these days because I have got exams in school. Good luck! > I have implemented a multi class cross entropy and soft max function and > turned off some of the cython checks, the result is that the cython > implementation is only slightly better, I guess that's because I am using > objects as an output functions, I will have to benchmark them to know > more. Do you have any code on github that you can show us? I am not trying to micro-manage you, but more to see if we can help by giving ideas on seeing the code. G ```
 Re: [Scikit-learn-general] multilayer perceptron questions From: Andreas Mueller - 2012-06-18 06:44:34 ```Hey David. Olivier dug up this paper by LeCun's group: http://users.ics.aalto.fi/kcho/papers/icml11.pdf I think this might be quite interesting for the MLP. It is probably also interesting for the linear SGD. I'm surprised that they didn't compare against diagonal stochastic Levenberg-Marquardt with constant learning rate... Cheers, Andy On 05/15/2012 12:12 AM, David Marek wrote: > Hi, > > I have worked on multilayer perceptron and I've got a basic > implementation working. You can see it at > https://github.com/davidmarek/scikit-learn/tree/gsoc_mlp The most > important part is the sgd implementation, which can be found here > https://github.com/davidmarek/scikit-learn/blob/gsoc_mlp/sklearn/mlp/mlp_fast.pyx > > I have encountered a few problems and I would like to know your opinion. > > 1) There are classes like SequentialDataset and WeightVector which are > used in sgd for linear_model, but I am not sure if I should use them > here as well. I have to do more with samples and weights than just > multiply and add them together. I wouldn't be able to use numpy > functions like tanh and do batch updates, would I? What do you think? > Am I missing something that would help me do everything I need with > SequentialDataset? I implemented my own LossFunction because I need a > vectorized version, I think that is the same problem. > > 2) I used Andreas' implementation as an inspiration and I am not sure > I understand some parts of it: > * Shouldn't the bias vector be initialized with ones instead of > zeros? I guess there is no difference. > * I am not sure why is the bias updated with: > bias_output += lr * np.mean(delta_o, axis=0) > shouldn't it be: > bias_output += lr / batch_size * np.mean(delta_o, axis=0)? > * Shouldn't the backward step for computing delta_h be: > delta_h[:] = np.dot(delta_o, weights_output.T) * hidden.doutput(x_hidden) > where hidden.doutput is a derivation of the activation function for > hidden layer? > > I hope my questions are not too stupid. Thank you. > > David > > ------------------------------------------------------------------------------ > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. Discussions > will include endpoint security, mobile security and the latest in malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > _______________________________________________ > Scikit-learn-general mailing list > Scikit-learn-general@... > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general ```
 Re: [Scikit-learn-general] multilayer perceptron questions From: Olivier Grisel - 2012-06-18 09:44:23 ```2012/6/18 Andreas Mueller : > Hey David. > Olivier dug up this paper by LeCun's group: > http://users.ics.aalto.fi/kcho/papers/icml11.pdf > I think this might be quite interesting for the MLP. Err, no. The paper I mentioned is even newer: http://arxiv.org/abs/1206.1106 Just to make it more explicit about the paper content, the title is: "No More Pesky Learning Rates" and it's a method for estimating the optimal learning rate schedule online from the data, while learning a model using SGD with a smooth loss (convex or not). It's a pre-print for NIPS 2012. It looks very promising. It would great to try and reproduce some of their empirical results. -- Olivier http://twitter.com/ogrisel - http://github.com/ogrisel ```
 Re: [Scikit-learn-general] multilayer perceptron questions From: Andreas Mueller - 2012-06-18 09:46:03 ```Am 18.06.2012 11:43, schrieb Olivier Grisel: > 2012/6/18 Andreas Mueller: >> Hey David. >> Olivier dug up this paper by LeCun's group: >> http://users.ics.aalto.fi/kcho/papers/icml11.pdf >> I think this might be quite interesting for the MLP. > Err, no. The paper I mentioned is even newer: > > http://arxiv.org/abs/1206.1106 > > Just to make it more explicit about the paper content, the title is: > "No More Pesky Learning Rates" and it's a method for estimating the > optimal learning rate schedule online from the data, while learning a > model using SGD with a smooth loss (convex or not). It's a pre-print > for NIPS 2012. It looks very promising. It would great to try and > reproduce some of their empirical results. > Sorry, copy&paste error :-/ ```
 Re: [Scikit-learn-general] multilayer perceptron questions From: David Marek - 2012-06-19 09:10:10 Attachments: Message as HTML ```Hi On Mon, Jun 18, 2012 at 11:43 AM, Olivier Grisel wrote: > Err, no. The paper I mentioned is even newer: > > http://arxiv.org/abs/1206.1106 > > Just to make it more explicit about the paper content, the title is: > "No More Pesky Learning Rates" and it's a method for estimating the > optimal learning rate schedule online from the data, while learning a > model using SGD with a smooth loss (convex or not). It's a pre-print > for NIPS 2012. It looks very promising. It would great to try and > reproduce some of their empirical results. > Thanks, I will look at it. David ```