Learn how easy it is to sync an existing GitHub or Google Code repo to a SourceForge project! See Demo
Close
From: Dan Halbert <halbert@ha...>  20100505 15:59:20
Attachments:
Message as HTML

scikit.learn provides nice hooks into libsvm that I do not see in other libsvm wrappers. However, one thing we could use is the set of decision values available when using svm_predict_values(). The decision values are especially useful to us because we are usinghighly imbalanced classes (a big background model and a singleelementor very small target set). It is very difficult tocalibrate the models to return the correct class, but the scoresreturned by svm_predict_values() are different enough that they providethe information we need, and we can fuse the scores with other techniques. I see you added returning probabilities for 0.3. Ilooked over the code, and I think this change could be similar. I could work on this to some extent, but if you have a suggestion about structuring it and what the most consistent external API would be, I would be grateful. (E.g, add SVC.predict_values() or incorporate into SVC.predict() itself?) If you feel motivated to add it for your own purposes, that would be wonderful, of course. Thanks, Dan 
From: Fabian Pedregosa <fabian.pedregosa@in...>  20100507 09:16:12

Dan Halbert wrote: > On 5/6/2010 7:19 AM, Fabian Pedregosa wrote: >> I'll do that as I'm fairly familiar with this code, but I'd like you to >> test it afterwards. >> > Great! Thanks very much! >> On the API side, I find predict_values quite ambiguous. What do you >> think of predict_margin ? >> >> > I do like that better than "values". Since you are really predicting the > class, and returning the margins, it is really "predict_with_margins". > But you already have "predict_proba", so to be consistent I guess you'd > leave out the "_with". Is "proba" just short for "probabilities"? It should be working if you check out the latest git master from sourceforge, but it's not extensively tested. A simple example that seems to give the right results: """ In [1]: from scikits.learn import svm In [2]: clf = svm.SVC() In [3]: clf.fit([[0,0], [1, 1]], [0, 1]) Out[3]: <scikits.learn.svm.SVC object at 0x24a9fd0> In [4]: clf.predict_margin([[1, 1], [2, 2]]) Out[4]: array([[ 0.3495638], [0.3495638]]) """ As for predict_proba, it should return the probability estimates (it wraps svm_predict_probability), but I'm having a hard time to interpret the results (see a message with subject "returning probabilities from SVC.predict" in this mailing list)... Cheers, fabian > > Dan > >  > _______________________________________________ > Scikitlearngeneral mailing list > Scikitlearngeneral@... > https://lists.sourceforge.net/lists/listinfo/scikitlearngeneral > 
From: Dan Halbert <halbert@ha...>  20100507 20:30:55
Attachments:
Message as HTML

On Friday, May 7, 2010 5:16am, "Fabian Pedregosa" <fabian.pedregosa@...> said: > Dan Halbert wrote: > > On 5/6/2010 7:19 AM, Fabian Pedregosa wrote: > >> I'll do that as I'm fairly familiar with this code, but I'd like you to > >> test it afterwards. ... > It should be working if you check out the latest git master from > sourceforge, but it's not extensively tested. > > As for predict_proba, it should return the probability estimates (it > wraps svm_predict_probability), but I'm having a hard time to interpret > the results (see a message with subject "returning probabilities from > SVC.predict" in this mailing list)... Thanks for the quick coding! I have tested predict_margin() against my own standard test case (using a linear kernel), and I get almost matching values ("almost": see 3 below). So it seems to work just fine. I also tested predict_proba() and also got almost matching values. A few notes: 1. predict_margin() and predict_proba() return only the margins and probabilities, respectively. However, the underlying libsvm routines return the predict class as well. Having the class returned is useful, I think, so perhaps it could be returned as part of a pair: label, margins = classifier.predict_margin(...) label, probs = classifier.predict_proba(...) 2. There is no straightforward way to set weights right now. fit(...) has an nr_weight parameter, but not weights and labels. My test uses weights, so I set the underlying attributes appropriately, but that is hacky. If/when you do add full support for weights, it seems to me you don't need to have an nr_weight parameter for fit(), since it's the length of the weights or labels vector, which you know. 3. The weights and probabilities I get when using the commandline libsvm tools are slightly different than yours, starting in about the fifth decimal place. I am testing with libsvm2.91. Either that makes a difference or perhaps a 64bit value is being shortened to 32 bits somewhere. However, I don't see such a truncation either in your code or libsvm. I am building and running on a 64bit Ubuntu. 4. I see what you mean in the previous thread about the probability values. I also get different class results for the same training data. We haven't found the probabilities that useful. The probabilities are always close to 1 or 0, but the margin values indicate less certainty than that. The probability models that are built are slightly different than the nonprobability models, as you probably know. Dan 
From: Fabian Pedregosa <fabian.pedregosa@in...>  20100510 08:40:43

Dan Halbert wrote: > On Friday, May 7, 2010 5:16am, "Fabian Pedregosa" > <fabian.pedregosa@...> said: > > Dan Halbert wrote: > > > On 5/6/2010 7:19 AM, Fabian Pedregosa wrote: > > >> I'll do that as I'm fairly familiar with this code, but I'd like > you to > > >> test it afterwards. > ... > > It should be working if you check out the latest git master from > > sourceforge, but it's not extensively tested. > > > > As for predict_proba, it should return the probability estimates (it > > wraps svm_predict_probability), but I'm having a hard time to interpret > > the results (see a message with subject "returning probabilities from > > SVC.predict" in this mailing list)... > > Thanks for the quick coding! I have tested predict_margin() against my > own standard test case (using a linear kernel), and I get almost > matching values ("almost": see 3 below). So it seems to work just fine. > I also tested predict_proba() and also got almost matching values. > > A few notes: > 1. predict_margin() and predict_proba() return only the margins and > probabilities, respectively. However, the underlying libsvm routines > return the predict class as well. Having the class returned is useful, I > think, so perhaps it could be returned as part of a pair: > > label, margins = classifier.predict_margin(...) > label, probs = classifier.predict_proba(...) I thought about this, but label would be redundant as you can extract them from margins and probs (with > 0 and np.argmax respectively). Anyway, I'm not totally convinced about this, and if there's a really good reason for returning label I'll change my mind. > > 2. There is no straightforward way to set weights right now. fit(...) > has an nr_weight parameter, but not weights and labels. My test uses > weights, so I set the underlying attributes appropriately, but that is > hacky. If/when you do add full support for weights, it seems to me you > don't need to have an nr_weight parameter for fit(), since it's the > length of the weights or labels vector, which you know. True, that should be done. I've set up an issue [1] so that I don't forget. > > 3. The weights and probabilities I get when using the commandline > libsvm tools are slightly different than yours, starting in about the > fifth decimal place. I am testing with libsvm2.91. Either that makes a > difference or perhaps a 64bit value is being shortened to 32 bits > somewhere. However, I don't see such a truncation either in your code or > libsvm. I am building and running on a 64bit Ubuntu. I'll take a look into that. I did get slightly different results on different architectures, which I attributed to the use of clib's random function in libsvm, but this does not explain why are we having different values on command line and on library. In any case, it should be documented. I raised another issue for this [2] > > 4. I see what you mean in the previous thread about the probability > values. I also get different class results for the same training data. > We haven't found the probabilities that useful. The probabilities are > always close to 1 or 0, but the margin values indicate less certainty > than that. The probability models that are built are slightly different > than the nonprobability models, as you probably know. Thanks for the info. > > Dan [1] https://sourceforge.net/apps/trac/scikitlearn/ticket/55 [2] https://sourceforge.net/apps/trac/scikitlearn/ticket/56 > > >  > >  > > > >  > > _______________________________________________ > Scikitlearngeneral mailing list > Scikitlearngeneral@... > https://lists.sourceforge.net/lists/listinfo/scikitlearngeneral 
From: Dan Halbert <halbert@ha...>  20100510 19:05:52

On Monday, May 10, 2010 4:40am, "Fabian Pedregosa" <fabian.pedregosa@...> said: > > label, margins = classifier.predict_margin(...) > > label, probs = classifier.predict_proba(...) > I thought about this, but label would be redundant as you can extract > them from margins and probs (with > 0 and np.argmax respectively). > Anyway, I'm not totally convinced about this, and if there's a really > good reason for returning label I'll change my mind. I agree it is redundant, if convenient. I don't mind finding the label as you suggest. > > 3. The weights and probabilities I get when using the commandline > > libsvm tools are slightly different than yours, ... > I'll take a look into that. I did get slightly different results on > different architectures, which I attributed to the use of clib's random > function in libsvm, but this does not explain why are we having > different values on command line and on library. I have been debugging this today. On a very simple dataset, I get these differences for a margin value: scikits.learn: 1.04081633 #1 libsvm2.91 python interface (new ctypes interface): 1.0408163265306118 #2 libsvm2.91 command line: 1.040814693878 #3 The scikits.learn value (#1) is a rounded version of the libsvm python value (#2), so I think scikits.learn is shortening to 32 bits somewhere. The commandline value (#3) is further off. Using libsvm's python interface, I have narrowed down the #2 vs #3 difference to the model changing when it is saved to a file and read back in. In other words, passing the model through libsvm's routines svm_save_model() and svm_load_model() produces result #3. If the model stays in memory, #2 is the result. So #1 vs #2 may be a scikits.learn issue. The other is a libsvm issue, and I will investigate further and report any issues to the libsvm authors. Dan 
From: Dan Halbert <halbert@ha...>  20100510 20:39:23
Attachments:
Message as HTML

On Monday, May 10, 2010 3:05pm, "Dan Halbert" <halbert@...> said: > Using libsvm's python interface, I have narrowed down the #2 vs #3 difference to > the model changing when it is saved to a file and read back in. In other words, > passing the model through libsvm's routines svm_save_model() and svm_load_model() > produces result #3. If the model stays in memory, #2 is the result. What I wrote above is not quite correct. The problem is that the model produced by the commandline executable "svmtrain" is slightly different than the model produced by doing svm_train() and then doing svm_save_model(). The difference is in the indices of the model values. For example, I generated a linearkernel model with this training data: 1 1:1 2:2 3:3 1 1:4 2:5 3:6 The file produced by "svmtrain" is:  svm_type c_svc kernel_type linear nr_class 2 total_sv 2 rho 2.33333 label 1 1 nr_sv 1 1 SV 0.07407407407407407 1:1 2:2 3:3 0.07407407407407407 1:4 2:5 3:6  The file produced by doing svm_train() on the data above and then doing svm_save_model() is:  svm_type c_svc kernel_type linear nr_class 2 total_sv 2 rho 2.33333 label 1 1 nr_sv 1 1 SV 0.07407407407407407 0:1 1:2 2:3 0.07407407407407407 0:4 1:5 2:6  Notice the 1based indexing in the first model file, and the 0based indexing in the second one. This is enough to cause the decision values to be different when doing svm_predict. I will stop cluttering up the list with this stuff now and take it to a more general libsvm forum. Dan 
From: Fabian Pedregosa <fabian.pedregosa@in...>  20100511 16:42:40

Dan Halbert wrote: > > > On Monday, May 10, 2010 3:05pm, "Dan Halbert" <halbert@...> said: > > Using libsvm's python interface, I have narrowed down the #2 vs #3 > difference to > > the model changing when it is saved to a file and read back in. In > other words, > > passing the model through libsvm's routines svm_save_model() and > svm_load_model() > > produces result #3. If the model stays in memory, #2 is the result. > > What I wrote above is not quite correct. The problem is that the model > produced by the commandline executable "svmtrain" is slightly > different than the model produced by doing svm_train() and then doing > svm_save_model(). The difference is in the indices of the model values. > > For example, I generated a linearkernel model with this training data: > 1 1:1 2:2 3:3 > 1 1:4 2:5 3:6 > > The file produced by "svmtrain" is: >  > svm_type c_svc > kernel_type linear > nr_class 2 > total_sv 2 > rho 2.33333 > label 1 1 > nr_sv 1 1 > SV > 0.07407407407407407 1:1 2:2 3:3 > 0.07407407407407407 1:4 2:5 3:6 >  > > The file produced by doing svm_train() on the data above and then doing > svm_save_model() is: >  > svm_type c_svc > kernel_type linear > nr_class 2 > total_sv 2 > rho 2.33333 > label 1 1 > nr_sv 1 1 > SV > 0.07407407407407407 0:1 1:2 2:3 > 0.07407407407407407 0:4 1:5 2:6 >  > > Notice the 1based indexing in the first model file, and the 0based > indexing in the second one. This is enough to cause the decision values > to be different when doing svm_predict. > Thanks for you investigations, for the result from scikits.learn, be aware that numpy shows less decimal places that it actually stores, so it might be showing the exact same reusult ... as for this, I've been looking at the source code from libsvm's wrappings, and I think that maybe if you change line 127 of python/svm.py from svmc.svm_node_array_set(data,j,k,x[k]) to svmc.svm_node_array_set(data,j,k+1,x[k]) it will correctly make 1start index, which is how it should be (from the README). Can you confirm me this ? Cheers, > I will stop cluttering up the list with this stuff now and take it to a > more general libsvm forum. > > Dan > > >  > >  > > > >  > > _______________________________________________ > Scikitlearngeneral mailing list > Scikitlearngeneral@... > https://lists.sourceforge.net/lists/listinfo/scikitlearngeneral 
From: Dan Halbert <halbert@ha...>  20100511 21:15:14
Attachments:
Message as HTML

On Tuesday, May 11, 2010 12:41pm, "Fabian Pedregosa" <fabian.pedregosa@...> said: > for the result from scikits.learn, be aware that numpy shows less > decimal places that it actually stores, so it might be showing the exact > same reusult ... Aha, you are right. If I extract the value from the numpy array and ask for its repr(), it shows the full 64bit value, which matches the other value. >.[0based vs 1based indexing; your suggested change to libsvm ptyhon interface] > svmc.svm_node_array_set(data,j,k+1,x[k]) > > it will correctly make 1start index, which is how it should be (from > the README). Can you confirm me this ? I have figured this out. The 0based vs. 1based is a red herring. The differences I saw in decision values are actually due to the modelbeing saved slightly imprecisely to a file and then read back in. The support vector valuesare printed out in full precision, but "rho" and other values are printed with %g format, which gives only six digits of precision. Sowhen the model is read back in, those values will be slightly differentthan the original inmemory values. Commandline "svmtrain" followed by"svmpredict" exercises this problem. Ideally, svm_save_model() shouldprint out all the model values in full precision. 0based vs 1based doesn't matter, as long as the model and the test data match. It is true that libsvm uses 1based indexing for all its examples, but internally 0based works fine as well. (I see how you consistently make things 1based in dense_to_sparse(...)). The libsvm python interface will use 0based if you use dense data.That should probably be documented better in the libsvm README, as there'san example which misleadingly shows a sparse 1based dataset and theequivalent 0based dense dataset. Side note: Have you noticed libsvmdense, available from the libsvm folks as well? It is the libsvm code with a few added #ifdef's to store the problem and model in a simpler vector format. The README says it can be 1.52 times faster for dense data. Dan 
From: Fabian Pedregosa <fabian.pedregosa@in...>  20100512 08:22:40

Dan Halbert wrote: > On Tuesday, May 11, 2010 12:41pm, "Fabian Pedregosa" > <fabian.pedregosa@...> said: > > for the result from scikits.learn, be aware that numpy shows less > > decimal places that it actually stores, so it might be showing the exact > > same reusult ... > > Aha, you are right. If I extract the value from the numpy array and ask > for its repr(), it shows the full 64bit value, which matches the other > value. > > >.[0based vs 1based indexing; your suggested change to libsvm ptyhon > interface] > > svmc.svm_node_array_set(data,j,k+1,x[k]) > > > > it will correctly make 1start index, which is how it should be (from > > the README). Can you confirm me this ? > > I have figured this out. The 0based vs. 1based is a red herring. The > differences I saw in decision values are actually due to the model being > saved slightly imprecisely to a file and then read back in. The support > vector values are printed out in full precision, but "rho" and other > values are printed with %g format, which gives only six digits of > precision. So when the model is read back in, those values will be > slightly different than the original inmemory values. Commandline > "svmtrain" followed by "svmpredict" exercises this problem. Ideally, > svm_save_model() should print out all the model values in full precision. > > 0based vs 1based doesn't matter, as long as the model and the test > data match. It is true that libsvm uses 1based indexing for all its > examples, but internally 0based works fine as well. (I see how you > consistently make things 1based in dense_to_sparse(...)). > > The libsvm python interface will use 0based if you use dense data. That > should probably be documented better in the libsvm README, as there's an > example which misleadingly shows a sparse 1based dataset and the > equivalent 0based dense dataset. > > Side note: Have you noticed libsvmdense, available from the libsvm > folks as well? It is the libsvm code with a few added #ifdef's to store > the problem and model in a simpler vector format. The README says it can > be 1.52 times faster for dense data. Thanks, I did not know of this project!. Looks really interesting, since now we do not support sparse arrays on input and we loose a lot of time converting from numpy (dense) arrays to libsvm sparse arrays. In the long term, we should have be able to convert scipy sparse matrices to libsvm sparse structure, and use libsvmdense when the input is a plain numpy array. I've set up a ticked for this [1]. fabian [1] https://sourceforge.net/apps/trac/scikitlearn/ticket/57 > > Dan > > > > >  > >  > > > >  > > _______________________________________________ > Scikitlearngeneral mailing list > Scikitlearngeneral@... > https://lists.sourceforge.net/lists/listinfo/scikitlearngeneral 