You can subscribe to this list here.
2010 
_{Jan}
(23) 
_{Feb}
(4) 
_{Mar}
(56) 
_{Apr}
(74) 
_{May}
(107) 
_{Jun}
(79) 
_{Jul}
(212) 
_{Aug}
(122) 
_{Sep}
(289) 
_{Oct}
(176) 
_{Nov}
(531) 
_{Dec}
(268) 

2011 
_{Jan}
(255) 
_{Feb}
(157) 
_{Mar}
(199) 
_{Apr}
(274) 
_{May}
(495) 
_{Jun}
(157) 
_{Jul}
(276) 
_{Aug}
(212) 
_{Sep}
(356) 
_{Oct}
(356) 
_{Nov}
(421) 
_{Dec}
(365) 
2012 
_{Jan}
(530) 
_{Feb}
(236) 
_{Mar}
(495) 
_{Apr}
(286) 
_{May}
(347) 
_{Jun}
(253) 
_{Jul}
(335) 
_{Aug}
(254) 
_{Sep}
(429) 
_{Oct}
(506) 
_{Nov}
(358) 
_{Dec}
(147) 
2013 
_{Jan}
(492) 
_{Feb}
(328) 
_{Mar}
(477) 
_{Apr}
(348) 
_{May}
(248) 
_{Jun}
(237) 
_{Jul}
(526) 
_{Aug}
(407) 
_{Sep}
(253) 
_{Oct}
(263) 
_{Nov}
(202) 
_{Dec}
(184) 
2014 
_{Jan}
(246) 
_{Feb}
(258) 
_{Mar}
(305) 
_{Apr}
(168) 
_{May}
(182) 
_{Jun}
(238) 
_{Jul}
(340) 
_{Aug}
(256) 
_{Sep}
(312) 
_{Oct}
(168) 
_{Nov}
(94) 
_{Dec}

S  M  T  W  T  F  S 







1

2
(10) 
3
(7) 
4
(17) 
5
(4) 
6
(6) 
7
(1) 
8
(4) 
9
(1) 
10
(6) 
11
(3) 
12
(7) 
13
(1) 
14
(34) 
15
(1) 
16

17
(6) 
18
(8) 
19

20
(3) 
21
(3) 
22

23
(1) 
24

25
(6) 
26

27
(7) 
28
(8) 
29
(2) 
30

31
(1) 





From: Andreas Mueller <amueller@ai...>  20121231 17:20:13

Hi David. I am not really familiar with R, but you can just combine several forests by averaging their output. Ideally you should weight their output by the number of trees in each forest. Best, Andy On 12/20/2012 04:17 AM, David Broyles wrote: > Hi, > > I'm new to scikitlearn. Curious if there's a way to combine forests > trained separately, similar to the combine() method in the R > randomForest package. > > Thanks in advance! > > David > > >  > Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS, > MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current > with LearnDevNow  3,200 stepbystep video tutorials by Microsoft > MVPs and experts. SALE $99.99 this month only  learn more at: > http://p.sf.net/sfu/learnmore_122412 > > > _______________________________________________ > Scikitlearngeneral mailing list > Scikitlearngeneral@... > https://lists.sourceforge.net/lists/listinfo/scikitlearngeneral 
From: Paul.C<zodrowski@me...>  20121229 15:57:35

Thanks a lot, Andy, it did the job! Cheers & Thanks, Paul > Hi Paul. > You didn't set verbosity. > The script you linked to set verbosity=1 but I think there is some more > output on higher verbosity levels. > Hth, > Andy This message and any attachment are confidential and may be privileged or otherwise protected from disclosure. If you are not the intended recipient, you must not copy this message or attachment or disclose the contents to any other person. If you have received this transmission in error, please notify the sender immediately and delete the message and any attachment from your system. Merck KGaA, Darmstadt, Germany and any of its subsidiaries do not accept liability for any omissions or errors in this message which may arise as a result of EMailtransmission or for damages resulting from any unauthorized changes of the content of this message and any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its subsidiaries do not guarantee that this message is free of viruses and does not accept liability for any damages caused by any virus transmitted therewith. Click http://www.merckgroup.com/disclaimer to access the German, French, Spanish and Portuguese versions of this disclaimer. 
From: Andreas Mueller <amueller@ai...>  20121229 10:29:56

Hi Paul. You didn't set verbosity. The script you linked to set verbosity=1 but I think there is some more output on higher verbosity levels. Hth, Andy 
From: denis <denisbzgg@t...>  20121228 18:18:32

Mathieu Blondel <mathieu@...> writes: > ... > I started a project some time ago to benchmark sparsesparse dot product > algorithms:https://github.com/mblondel/dotbench Mathieu, that's nice  would like to see more such. Fwiw, dotbench looks plenty fast  mac 2.5 GHz Intel core i5 mem 4 GB 1333 DDR3 Matrix shape: (1000, 10000) Sparsity: 0.1 Vector shape: (1, 10000) Sparsity: 0.7 1 msec sparsedense 83 msec sparsesparse binary search 18 msec sparsesparse hash map 17 msec sparsesparse incremental Matrix shape: (1000, 20000) Sparsity: 0.1 Vector shape: (1, 20000) Sparsity: 0.7 3 msec sparsedense 175 msec sparsesparse binary search 36 msec sparsesparse hash map 33 msec sparsesparse incremental Matrix shape: (1000, 40000) Sparsity: 0.1 Vector shape: (1, 40000) Sparsity: 0.7 7 msec sparsedense 363 msec sparsesparse binary search 74 msec sparsesparse hash map 67 msec sparsesparse incremental What I do is just a spacetime tradeoff: // fast dot numpy dense / scipy sparse vectors // dot sparse A . sparse B in time ~ min( nnz A, nnz B ): // splat A to a big array of zeros, // dotdensesparse( zeros with A, B ) // zeros[A] = 0 again. How fast this is will depend so strongly on cache / VM that it'll be hard to compare. No I haven't tried different sizes / sparsities much  where can we collect numbers from *real* data ? cheers  denis 
From: Paul.C<zodrowski@me...>  20121228 17:15:09

Dear SciKitters, inspired by this script: http://scikitlearn.org/stable/auto_examples/grid_search_text_feature_extraction.html => Very appealing appealing that the remaining computing time is outputted! However, when adapting the script to my purposes, there is no such output in my case: " scores = [('recall',recall_score)] clf_svc_grid = svm.SVC(probability=True) tuned_parameters = [{'C':[1,10,100,1000],\ 'kernel':['linear'],\ 'class_weight':[{1:10},{1:2}],\ }] for score_name, score_func in scores: print "# Tuning hyperparameters for %s" % score_name print score_name,score_func,tuned_parameters clf_svc_grid = GridSearchCV(svm.SVC(probability=True), tuned_parameters,score_func=score_func,n_jobs=30) pprint(tuned_parameters) t0 = time() clf_svc_grid.fit(X_train,y_train_int,cv=5,n_jobs=30) print "Best parameters set found on development set:" print print clf_svc_grid.best_estimator_ print "Grid scores on development set:" for params, mean_score, scores in clf_svc_grid.grid_scores_: print "%0.3f (+/%0.03f) for %r" % ( mean_score, scores.std() / 2, params) print print "done in %0.3fs" % (time()  t0) " Is logging not possible for the way I train the models? Or is there another suggestion to help me out? Cheers & Thanks, Paul This message and any attachment are confidential and may be privileged or otherwise protected from disclosure. If you are not the intended recipient, you must not copy this message or attachment or disclose the contents to any other person. If you have received this transmission in error, please notify the sender immediately and delete the message and any attachment from your system. Merck KGaA, Darmstadt, Germany and any of its subsidiaries do not accept liability for any omissions or errors in this message which may arise as a result of EMailtransmission or for damages resulting from any unauthorized changes of the content of this message and any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its subsidiaries do not guarantee that this message is free of viruses and does not accept liability for any damages caused by any virus transmitted therewith. Click http://www.merckgroup.com/disclaimer to access the German, French, Spanish and Portuguese versions of this disclaimer. 
From: Vlad Niculae <zephyr14@gm...>  20121228 13:43:30

In the matrixmatrix case (as opposed to vectorvector or matrixvector), I played with Mathieu's dotbench and it didn't beat Scipy's very efficient implementation. On Fri, Dec 28, 2012 at 7:51 AM, Mathieu Blondel <mathieu@...>wrote: > I forgot to mention that the multiplication of two sparse matrices in > scipy results in a sparse matrix. In scikitlearn, we have a few > applications where a dense output would be more useful (even if the two > input matrices are sparse). > > Mathieu > > >  > Master HTML5, CSS3, ASP.NET, MVC, AJAX, Knockout.js, Web API and > much more. Get web development skills now with LearnDevNow  > 350+ hours of stepbystep video tutorials by Microsoft MVPs and experts. > SALE $99.99 this month only  learn more at: > http://p.sf.net/sfu/learnmore_122812 > _______________________________________________ > Scikitlearngeneral mailing list > Scikitlearngeneral@... > https://lists.sourceforge.net/lists/listinfo/scikitlearngeneral > > 
From: Olivier Grisel <olivier.grisel@en...>  20121228 09:59:35

2012/12/28 Mathieu Blondel <mathieu@...>: > I forgot to mention that the multiplication of two sparse matrices in scipy > results in a sparse matrix. In scikitlearn, we have a few applications > where a dense output would be more useful (even if the two input matrices > are sparse). I forgot one actual use case already implemented in scikitlearn master: sparse random projections a implemented. The user can select whether the outcome should be a dense numpy array or left sparse.  Olivier http://twitter.com/ogrisel  http://github.com/ogrisel 
From: Andreas Mueller <amueller@ai...>  20121228 09:53:09

Hi Firoj. This looks like a blas problem. Which blas are you linking against? I looks like you build against atlas, but now it can't be found. Best, Andy On 12/28/2012 09:43 AM, Firoj Alam wrote: > Hi Andy, > Thanks for your reply. However, I see i have other problems. I am not > sure, but it could be Mac specific problem. > > ************* > "'NoneType' object has no attribute 'tell'",) in <bound method > memmap.__del__ of memmap(6e323)> ignored > Exception AttributeError: AttributeError("'NoneType' object has no > attribute 'tell'",) in <bound method memmap.__del__ of memmap(6e323)> > ignored > Exception AttributeError: AttributeError("'NoneType' object has no > attribute 'tell'",) in <bound method memmap.__del__ of memmap(6e323)> > ignored > /Users/firojalam/tools/scikitlearn0.12.1/sklearn/externals/joblib/test/test_numpy_pickle.py:182: > Warning: file > "/var/folders/m7/m7Wkv29VGV8fEifJfG0ZtE+++TI/Tmp/tmpd2188g/test.pkl155" > appears to be a zip, ignoring mmap_mode "r" flag passed > numpy_pickle.load(this_filename, mmap_mode='r') > Exception AttributeError: AttributeError("'NoneType' object has no > attribute 'tell'",) in <bound method memmap.__del__ of memmap(6e323)> > ignored > ...............................................................................................E........E.EE.......................................................E.................................EEE.........................................................................EE............EF......E..........E..E.. > ====================================================================== > ERROR: Failure: ImportError > (dlopen(/Users/firojalam/tools/scikitlearn0.12.1/sklearn/cluster/_k_means.so, > 2): Symbol not found: _ATL_ddot > Referenced from: > /Users/firojalam/tools/scikitlearn0.12.1/sklearn/cluster/_k_means.so > Expected in: flat namespace > in > /Users/firojalam/tools/scikitlearn0.12.1/sklearn/cluster/_k_means.so) >  > . > . > . > FAILED (SKIP=3, errors=17, failures=1) > > ****** > Other errors like: > Symbol not found: _ATL_daxpy > Symbol not found: _ATL_ddot > > > Regards > Firoj > > > Date: Fri, 21 Dec 2012 10:59:11 +0100 > From: Andreas Mueller <amueller@... > <mailto:amueller@...>> > Subject: Re: [Scikitlearngeneral] installation problem > scikitlearn0.12.1 > To: scikitlearngeneral@... > <mailto:scikitlearngeneral@...> > MessageID: <50D432EF.2040407@... > <mailto:50D432EF.2040407@...>> > ContentType: text/plain; charset=ISO88591; format=flowed > > Am 21.12.2012 10:49, schrieb Firoj Alam: > > My system info: > > Mac OSX 10.6 Snowleopard > > Python 2.7.3 > > I installed scikitlearn0.12.1 from source but when i test it i am > > getting the error as the following: > > > Hi Fiorj. > I remember this came up before but I don't know the cause. > As far as I know, this is not an error in the actual build, but in the > checking of the build. > > Usually the problem was with Python3, though: > https://github.com/scikitlearn/scikitlearn/issues/1251 > > As a hack, you could try to comment out line 31 in sklearn/__init__.py > that imports __check_build > and then run nosetests on the sklearn folder. > > > Best, > Andy > > > >  > Master HTML5, CSS3, ASP.NET, MVC, AJAX, Knockout.js, Web API and > much more. Get web development skills now with LearnDevNow  > 350+ hours of stepbystep video tutorials by Microsoft MVPs and experts. > SALE $99.99 this month only  learn more at: > http://p.sf.net/sfu/learnmore_122812 > > > _______________________________________________ > Scikitlearngeneral mailing list > Scikitlearngeneral@... > https://lists.sourceforge.net/lists/listinfo/scikitlearngeneral 
From: Firoj Alam <firojalam@gm...>  20121228 08:43:46

Hi Andy, Thanks for your reply. However, I see i have other problems. I am not sure, but it could be Mac specific problem. ************* "'NoneType' object has no attribute 'tell'",) in <bound method memmap.__del__ of memmap(6e323)> ignored Exception AttributeError: AttributeError("'NoneType' object has no attribute 'tell'",) in <bound method memmap.__del__ of memmap(6e323)> ignored Exception AttributeError: AttributeError("'NoneType' object has no attribute 'tell'",) in <bound method memmap.__del__ of memmap(6e323)> ignored /Users/firojalam/tools/scikitlearn0.12.1/sklearn/externals/joblib/test/test_numpy_pickle.py:182: Warning: file "/var/folders/m7/m7Wkv29VGV8fEifJfG0ZtE+++TI/Tmp/tmpd2188g/test.pkl155" appears to be a zip, ignoring mmap_mode "r" flag passed numpy_pickle.load(this_filename, mmap_mode='r') Exception AttributeError: AttributeError("'NoneType' object has no attribute 'tell'",) in <bound method memmap.__del__ of memmap(6e323)> ignored ...............................................................................................E........E.EE.......................................................E.................................EEE.........................................................................EE............EF......E..........E..E.. ====================================================================== ERROR: Failure: ImportError (dlopen(/Users/firojalam/tools/scikitlearn0.12.1/sklearn/cluster/_k_means.so, 2): Symbol not found: _ATL_ddot Referenced from: /Users/firojalam/tools/scikitlearn0.12.1/sklearn/cluster/_k_means.so Expected in: flat namespace in /Users/firojalam/tools/scikitlearn0.12.1/sklearn/cluster/_k_means.so)  . . . FAILED (SKIP=3, errors=17, failures=1) ****** Other errors like: Symbol not found: _ATL_daxpy Symbol not found: _ATL_ddot Regards Firoj > Date: Fri, 21 Dec 2012 10:59:11 +0100 > From: Andreas Mueller <amueller@...> > Subject: Re: [Scikitlearngeneral] installation problem > scikitlearn0.12.1 > To: scikitlearngeneral@... > MessageID: <50D432EF.2040407@...> > ContentType: text/plain; charset=ISO88591; format=flowed > > Am 21.12.2012 10:49, schrieb Firoj Alam: > > My system info: > > Mac OSX 10.6 Snowleopard > > Python 2.7.3 > > I installed scikitlearn0.12.1 from source but when i test it i am > > getting the error as the following: > > > Hi Fiorj. > I remember this came up before but I don't know the cause. > As far as I know, this is not an error in the actual build, but in the > checking of the build. > > Usually the problem was with Python3, though: > https://github.com/scikitlearn/scikitlearn/issues/1251 > > As a hack, you could try to comment out line 31 in sklearn/__init__.py > that imports __check_build > and then run nosetests on the sklearn folder. > > > Best, > Andy > > 
From: Mathieu Blondel <mathieu@mb...>  20121228 07:51:44

I forgot to mention that the multiplication of two sparse matrices in scipy results in a sparse matrix. In scikitlearn, we have a few applications where a dense output would be more useful (even if the two input matrices are sparse). Mathieu 
From: Mathieu Blondel <mathieu@mb...>  20121228 07:41:57

Kernel SVMs with linear kernel or polynomial kernel require sparsesparse dot products. Regarding kmeans, centers are typically dense unless you apply L1ball projections so I think densesparse matrix multiplication may be faster. Can you describe your algorithm and why intuitively it would be faster than the one implemented in scipy? That sounds interesting. Also, have you tried to generate artificial data with different levels of sparsity and compare the algorithms? I started a project some time ago to benchmark sparsesparse dot product algorithms: https://github.com/mblondel/dotbench Dotproduct algorithms can also be used for matrix multiplication. One disadvantage is that if the matrix is n x m, we need to compute n x m dot products. This contrasts with the algorithm implemented in scipy which computes the entire matrix multiplication (in two passes). One advantage however is that this is embarrassingly parallel, i.e., all dot products can be computed in parallel. It would be interesting to compare different sparse matrix multiplication algorithms with respect to the number of cores used. Mathieu 
From: Robert Layton <robertlayton@gm...>  20121227 22:17:17

On 28 December 2012 07:20, Lars Buitinck <L.J.Buitinck@...> wrote: > 2012/12/27 Philipp Singer <killver@...>: > > I agree. You could do something like all pairs cosine similarity using a > > large sparse matrix. > > Or learn a sparse linear model, replace the coef_ with a CSR version > of same, then profile prediction. > >  > Lars Buitinck > Scientific programmer, ILPS > University of Amsterdam > > >  > Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS, > MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current > with LearnDevNow  3,200 stepbystep video tutorials by Microsoft > MVPs and experts. ON SALE this month only  learn more at: > http://p.sf.net/sfu/learnmore_122712 > _______________________________________________ > Scikitlearngeneral mailing list > Scikitlearngeneral@... > https://lists.sourceforge.net/lists/listinfo/scikitlearngeneral >  Public key at: http://pgp.mit.edu/ Search for this email address and select the key from "20110819" (key id: 54BA8735) kmeans may be another option. An n_clusters by n_samples distance matrix needs to be calculated every iteration (and there can be hundreds of iterations). While the algorithm is created around Euclidean distance, it will work* for any distance metric.  Robert * work meaning: local minima found 
From: Lars Buitinck <L.J.B<uitinck@uv...>  20121227 20:20:33

2012/12/27 Philipp Singer <killver@...>: > I agree. You could do something like all pairs cosine similarity using a > large sparse matrix. Or learn a sparse linear model, replace the coef_ with a CSR version of same, then profile prediction.  Lars Buitinck Scientific programmer, ILPS University of Amsterdam 
From: Philipp Singer <killver@gm...>  20121227 17:46:50

Am 27.12.2012 18:32, schrieb Olivier Grisel: > 2012/12/27 denis <denisbzgg@...>: >> Olivier Grisel <olivier.grisel@...> writes: >> >>> 2012/12/27 denis <denisbzgg@...>: >>>> Folks, >>>> does any module in scikitlearn do dot( sparse vec, sparse vec ) a lot ? >>>> I wanted to try out a fast dot_sparse_vec (time ~ nnz, space ~ n) >>>> but so far I see only safe_sparse_dot( big sparse array, numpy array ) >>>> e.g. for RandomPCA. >>> The speed of the <sparse matrix> dot <sparse matrix> depends on the >>> actual implementation of the scipy.sparse matrices. >> Olivier, >> sorry, I wasn't clear: I want to try out my fast NEW implementation of >> dot( sparse vec, sparse vec ) >> and am looking for a testcase in scikitlearn that does a lot of those >> to measure the speedup >> cheers > Alright. AFAIK we don't have a use case in scikitlearn for that kind > of operation yet. > > Computing knn queries using cosine similarity on a prenormalized > sparse vector corpus + query might be a valid use case though. I agree. You could do something like all pairs cosine similarity using a large sparse matrix. > >  > Olivier > http://twitter.com/ogrisel  http://github.com/ogrisel > >  > Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS, > MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current > with LearnDevNow  3,200 stepbystep video tutorials by Microsoft > MVPs and experts. ON SALE this month only  learn more at: > http://p.sf.net/sfu/learnmore_122712 > _______________________________________________ > Scikitlearngeneral mailing list > Scikitlearngeneral@... > https://lists.sourceforge.net/lists/listinfo/scikitlearngeneral 
From: Olivier Grisel <olivier.grisel@en...>  20121227 17:32:47

2012/12/27 denis <denisbzgg@...>: > Olivier Grisel <olivier.grisel@...> writes: > >> >> 2012/12/27 denis <denisbzgg@...>: >> > Folks, >> > does any module in scikitlearn do dot( sparse vec, sparse vec ) a lot ? >> > I wanted to try out a fast dot_sparse_vec (time ~ nnz, space ~ n) >> > but so far I see only safe_sparse_dot( big sparse array, numpy array ) >> > e.g. for RandomPCA. >> >> The speed of the <sparse matrix> dot <sparse matrix> depends on the >> actual implementation of the scipy.sparse matrices. > > Olivier, > sorry, I wasn't clear: I want to try out my fast NEW implementation of > dot( sparse vec, sparse vec ) > and am looking for a testcase in scikitlearn that does a lot of those > to measure the speedup > cheers Alright. AFAIK we don't have a use case in scikitlearn for that kind of operation yet. Computing knn queries using cosine similarity on a prenormalized sparse vector corpus + query might be a valid use case though.  Olivier http://twitter.com/ogrisel  http://github.com/ogrisel 
From: denis <denisbzgg@t...>  20121227 17:22:48

Olivier Grisel <olivier.grisel@...> writes: > > 2012/12/27 denis <denisbzgg@...>: > > Folks, > > does any module in scikitlearn do dot( sparse vec, sparse vec ) a lot ? > > I wanted to try out a fast dot_sparse_vec (time ~ nnz, space ~ n) > > but so far I see only safe_sparse_dot( big sparse array, numpy array ) > > e.g. for RandomPCA. > > The speed of the <sparse matrix> dot <sparse matrix> depends on the > actual implementation of the scipy.sparse matrices. Olivier, sorry, I wasn't clear: I want to try out my fast NEW implementation of dot( sparse vec, sparse vec ) and am looking for a testcase in scikitlearn that does a lot of those to measure the speedup cheers  denis 
From: Olivier Grisel <olivier.grisel@en...>  20121227 16:35:51

2012/12/27 denis <denisbzgg@...>: > Folks, > does any module in scikitlearn do dot( sparse vec, sparse vec ) a lot ? > I wanted to try out a fast dot_sparse_vec (time ~ nnz, space ~ n) > but so far I see only safe_sparse_dot( big sparse array, numpy array ) > e.g. for RandomPCA. The speed of the <sparse matrix> dot <sparse matrix> depends on the actual implementation of the scipy.sparse matrices. Using CSR and CSC datastructure should be ok while other datastructures are not optimized for product computation but for new non zero element insertions for instance. Note that if A and B are two scipy.sparse matrices, then the matrix product can be obtained with: `A * B` as opposed to numpy arrays where it would be obtained with `np.dot(A, B)` or `A.dot(B)`.  Olivier http://twitter.com/ogrisel  http://github.com/ogrisel 
From: denis <denisbzgg@t...>  20121227 15:58:02

Folks, does any module in scikitlearn do dot( sparse vec, sparse vec ) a lot ? I wanted to try out a fast dot_sparse_vec (time ~ nnz, space ~ n) but so far I see only safe_sparse_dot( big sparse array, numpy array ) e.g. for RandomPCA. Thanks, cheers / sante / Prost Neujahr  denis 
From: Alexandre Gramfort <alexandre.gramfort@in...>  20121225 17:41:13

hi, the CV models in coordinate_descent have the same use case. We use warm restarts to fit efficiently for many values of alpha. The way it is done is via a path function that returns a list of models fitted sequentially. Then there is cv loop that runs the path for every fold and picks the best alpha. If you could find a more generic way to do this, it would indeed be neat ! Could we extend the path type approach. A function / something that returns a list of models ? Alex just throwing ideas... On Tue, Dec 25, 2012 at 1:48 PM, Andreas Mueller <amueller@...> wrote: > On 12/25/2012 01:40 PM, Gilles Louppe wrote: >> Second, what do you exactly mean by "generalized" CV? I am not sure to >> have the same idea in mind. Do you mean finding the best parameter >> value without brute force, in a smart way specific to the estimator? >> > Basically yes. Something that fits an estimator for several values of a > parameter in an efficient way. >> In that case, one could do that on min_samples_split, using a post >> pruning procedure. > Yeah, that would be possible and actually pretty cool  it is also > trivially possible for depth btw. > > Basically fitting one fully grown tree (or say, up to depth 20) we could > get the perfect estimator in a single run of fit. > > My question is: how ;) > [And I'd prefer a way that can also cope with being in a pipeline] > >  > LogMeIn Rescue: Anywhere, Anytime Remote support for IT. Free Trial > Remotely access PCs and mobile devices and provide instant support > Improve your efficiency, and focus on delivering more valueadd services > Discover what IT Professionals Know. Rescue delivers > http://p.sf.net/sfu/logmein_12329d2d > _______________________________________________ > Scikitlearngeneral mailing list > Scikitlearngeneral@... > https://lists.sourceforge.net/lists/listinfo/scikitlearngeneral 
From: Andreas Mueller <amueller@ai...>  20121225 12:48:40

On 12/25/2012 01:40 PM, Gilles Louppe wrote: > Second, what do you exactly mean by "generalized" CV? I am not sure to > have the same idea in mind. Do you mean finding the best parameter > value without brute force, in a smart way specific to the estimator? > Basically yes. Something that fits an estimator for several values of a parameter in an efficient way. > In that case, one could do that on min_samples_split, using a post > pruning procedure. Yeah, that would be possible and actually pretty cool  it is also trivially possible for depth btw. Basically fitting one fully grown tree (or say, up to depth 20) we could get the perfect estimator in a single run of fit. My question is: how ;) [And I'd prefer a way that can also cope with being in a pipeline] 
From: Gilles Louppe <g.louppe@gm...>  20121225 12:40:21

Second, what do you exactly mean by "generalized" CV? I am not sure to have the same idea in mind. Do you mean finding the best parameter value without brute force, in a smart way specific to the estimator? In that case, one could do that on min_samples_split, using a post pruning procedure. Thanks, Gilles On Tuesday, 25 December 2012, Gilles Louppe <g.louppe@...> wrote: > Hi Andreas! > > ... and Merry Christmas to all! > > Quick and naive question: what is the point in crossvalidating the > number of trees in RandomForest (or in ExtraTrees)? The rule simple > is simple: the more, the better. > > Gilles > > On 25 December 2012 13:07, Andreas Mueller <amueller@...> wrote: >> Hi everybody and merry Christmas. >> >> I wanted to ask what people think about the future of the generalized >> crossvalidation API. >> Currently, estimators that make some generalized crossvalidation >> possible provide a <Estimator>CV class, >> (RidgeCV, RFECV, LassoLarsCV). >> >> I think we should decide whether we want to do it the same way for >> forests and other estimators. >> As the AdaBoost PR contains some code in this direction and is about to >> be merged soon, I think now is a good >> time to make a decision. >> >> What I don't like about the current way is that it leads to nested >> crossvalidation when used together with GridSearchCV. >> Currently, most CV objects only have one parameter, so this is not so >> much of an issue. But as soon as one >> builds a pipeline, using GridSearchCV is necessary. >> >> For forests only one of the many parameters (the number of trees) allows >> generalized CV, so if we would add a RandomForestsCV >> or an AdaBoostCV, it would always be used together with GridSearchCV, >> resulting in nested cross validation. >> >> I don't have a solution at the moment but I think ideally GridSearchCV >> should somehow be able to handle >> estimators that allow for generalized CV. >> >> What do you think? >> >> Best, >> Andy >> >>  >> LogMeIn Rescue: Anywhere, Anytime Remote support for IT. Free Trial >> Remotely access PCs and mobile devices and provide instant support >> Improve your efficiency, and focus on delivering more valueadd services >> Discover what IT Professionals Know. Rescue delivers >> http://p.sf.net/sfu/logmein_12329d2d >> _______________________________________________ >> Scikitlearngeneral mailing list >> Scikitlearngeneral@... >> https://lists.sourceforge.net/lists/listinfo/scikitlearngeneral > 
From: Andreas Mueller <amueller@ai...>  20121225 12:37:10

On 12/25/2012 01:24 PM, Gilles Louppe wrote: > Hi Andreas! > > ... and Merry Christmas to all! > > Quick and naive question: what is the point in crossvalidating the > number of trees in RandomForest (or in ExtraTrees)? The rule simple > is simple: the more, the better. Ok, maybe RandomForest was a bad example, for gradient boosting and adaboost it is more helpful. For random forests it is still interesting: by looking at the "path" you know if you can gain by increasing the number of trees / lose by decreasing it. So if you compute it for 200 trees and your have a graph that shows you that it stayed constant since 100, you will probably drop the number to 100 to be faster, while if you still see a rise in accuracy, you'll try to increase the number of trees. There is no really easy way to find out what a good number is afaik. 
From: Gilles Louppe <g.louppe@gm...>  20121225 12:25:48

Hi Andreas! ... and Merry Christmas to all! Quick and naive question: what is the point in crossvalidating the number of trees in RandomForest (or in ExtraTrees)? The rule simple is simple: the more, the better. Gilles On 25 December 2012 13:07, Andreas Mueller <amueller@...> wrote: > Hi everybody and merry Christmas. > > I wanted to ask what people think about the future of the generalized > crossvalidation API. > Currently, estimators that make some generalized crossvalidation > possible provide a <Estimator>CV class, > (RidgeCV, RFECV, LassoLarsCV). > > I think we should decide whether we want to do it the same way for > forests and other estimators. > As the AdaBoost PR contains some code in this direction and is about to > be merged soon, I think now is a good > time to make a decision. > > What I don't like about the current way is that it leads to nested > crossvalidation when used together with GridSearchCV. > Currently, most CV objects only have one parameter, so this is not so > much of an issue. But as soon as one > builds a pipeline, using GridSearchCV is necessary. > > For forests only one of the many parameters (the number of trees) allows > generalized CV, so if we would add a RandomForestsCV > or an AdaBoostCV, it would always be used together with GridSearchCV, > resulting in nested cross validation. > > I don't have a solution at the moment but I think ideally GridSearchCV > should somehow be able to handle > estimators that allow for generalized CV. > > What do you think? > > Best, > Andy > >  > LogMeIn Rescue: Anywhere, Anytime Remote support for IT. Free Trial > Remotely access PCs and mobile devices and provide instant support > Improve your efficiency, and focus on delivering more valueadd services > Discover what IT Professionals Know. Rescue delivers > http://p.sf.net/sfu/logmein_12329d2d > _______________________________________________ > Scikitlearngeneral mailing list > Scikitlearngeneral@... > https://lists.sourceforge.net/lists/listinfo/scikitlearngeneral 
From: Andreas Mueller <amueller@ai...>  20121225 12:07:38

Hi everybody and merry Christmas. I wanted to ask what people think about the future of the generalized crossvalidation API. Currently, estimators that make some generalized crossvalidation possible provide a <Estimator>CV class, (RidgeCV, RFECV, LassoLarsCV). I think we should decide whether we want to do it the same way for forests and other estimators. As the AdaBoost PR contains some code in this direction and is about to be merged soon, I think now is a good time to make a decision. What I don't like about the current way is that it leads to nested crossvalidation when used together with GridSearchCV. Currently, most CV objects only have one parameter, so this is not so much of an issue. But as soon as one builds a pipeline, using GridSearchCV is necessary. For forests only one of the many parameters (the number of trees) allows generalized CV, so if we would add a RandomForestsCV or an AdaBoostCV, it would always be used together with GridSearchCV, resulting in nested cross validation. I don't have a solution at the moment but I think ideally GridSearchCV should somehow be able to handle estimators that allow for generalized CV. What do you think? Best, Andy 
From: Jieyun Fu <jieyunfu@gm...>  20121223 02:32:55

Hi all, I found this piece of code (from here<http://stackoverflow.com/questions/10098533/implementingbagofwordsnaivebayesclassifierinnltk>;), which basically tries to classify movie reviews into positive and negative. Now I need to put in weights for positive and negative reviews (for example, negative reviews have a weight of 0.5 and positive review have a weight of 1). Is there a way to do it in Pipeline class? MultinomialNB.fit() has sample_weights parameters, but I can't "set" the sample_weights anywhere in Pipeline (or can I?) Sorry if this is a dumb question. I am quite new to this functionality of sklearn. import numpy as np from nltk.probability import FreqDist from nltk.classify import SklearnClassifier from sklearn.feature_extraction.text import TfidfTransformer from sklearn.feature_selection import SelectKBest, chi2 from sklearn.naive_bayes import MultinomialNB from sklearn.pipeline import Pipeline pipeline = Pipeline([('tfidf', TfidfTransformer()), ('chi2', SelectKBest(chi2, k=1000)), ('nb', MultinomialNB())]) classif = SklearnClassifier(pipeline) from nltk.corpus import movie_reviews pos = [FreqDist(movie_reviews.words(i)) for i in movie_reviews.fileids('pos')] neg = [FreqDist(movie_reviews.words(i)) for i in movie_reviews.fileids('neg')] add_label = lambda lst, lab: [(x, lab) for x in lst] classif.train(add_label(pos[:100], 'pos') + add_label(neg[:100], 'neg')) l_pos = np.array(classif.batch_classify(pos[100:])) l_neg = np.array(classif.batch_classify(neg[100:])) print "Confusion matrix:\n%d\t%d\n%d\t%d" % ( (l_pos == 'pos').sum(), (l_pos == 'neg').sum(), (l_neg == 'pos').sum(), (l_neg == 'neg').sum()) 