PyML - a Python Machine Learning package
PyML is a flexible Python framework for using various classification
methods including Support Vector Machines (SVM).
Only Unix/Linux/OS-X is supported.
The numpy package http://sourceforge.net/projects/numpy
A setup.py script is provided so installation follows the standard python idiom:
>>> python setup.py build
>>> python setup.py install
See the Installation section in the tutorial for a trouble-shooting guide.
Tutorial - see doc/tutorial.pdf
Module documentation - see doc/autodoc/public/index.html
Usage examples - see PyML/demo/pyml_test.py
- Fixed aggregate.Aggregate (eliminated argument that is not being
used from the copy constructor)
- Fixed evaluators.roc (thanks to Mark Rogers for the fix)
- Fixed ridgeRegression.RidgeRegression - the regression part wasn't
- Added ; as a possible delimiter for parsing csv files.
- Added mean absolute deviation (MAD) to regression results statistics
- Updated the PyVectorDataSet container so that the Standardizer
object works correctly
- When constructing PyVectorDataSet from a list/array, it
verifies/converts the input to a numpy array
- fixed roc.plotROCs that plots multiple ROC curves given a
list/dictionary of results objects
- The confusion matrix has been transposed, where rows correspond to
the true labels, and columns to predicted lables. This change is for
consistency with how I have seen it in other places (including the
textbook I am using in my ML course).
- Containers can now be indexed with either integers or lists/numpy
arrays: data[i] will return example i, and data[I], where I is a
list or array will create a copy of the data that includes those
- ridgeRegression.RidgeRegression can now be used for classification and
for regression (previously only classification). The training has
been speeded up as well.
- Added vectorDatasets.PyVectorDataSet which is a container that holds
its data in a Numpy array. This is useful for writing classifiers in
- Fixed an issue with computing AUC50 style accuracy measures.
- Fixed the PairData container (thanks Fayyaz for the bug report)
- Fixed the spectrum_data generator - it produced k-mers of a size up
to 1 larger than what you asked for.
- Fixed issue in saving results objects produced using model selector
(thanks Fayyaz for pointing out the problem)
- Speed up of the RidgeRegression classifier
- RFE now works with the liblinear solver
- Fixed compilation issue under gcc 4.6.1 (tracker ID: 3361193)
- Fixed issue with VectorDataSet (tracker ID: 3364921)
- Updated tutorial
- Added wrapper for liblinear linear SVMs. If you only need a linear
SVM, these solvers offer a very significant speedup for large
SVM(optimizer = 'liblinear', loss = 'l2') for l2 loss SVM
SVM(optimizer = 'liblinear', loss = 'l1') for l1 loss SVM
- Added containers.setData.SetData - a dataset container where each
example is a set of objects.
- Chris Hiszpanski reported an issue and fix for ROC calculation that
would fail for a corner case.
- When creating a dataset with numeric labels, the 'numericLabels'
keyword argument was not passed on to the Labels constructor when
creating such a dataset from arrays/lists
- the "stratifiedCV" method of SVR was being called by model
selection. That was addressed by defining it to be regular
cross-validation (stratified CV doesn't make sense for regression).
Better solution would be to have a separate base class for regression.
- Can create an empty SparseDataSet or VectorDataSet, and then
populate it on the fly with features (using its addFeatures method).
- Fixed the classify method of the KNN classifier
(cross-validation worked fine).
- Fixed issue with the __repr__ of the Results objects that was giving
an error when results of an unlabeled dataset were being displayed.
- Added positional kmer dataset creation (an implementation of
Sonenburg et al's weighted degree kernel) that represents the
- Reworked the SequenceData container
- Fixed a bug in Labels.oneAgainstRest (thanks to Marcel Luethi for
- Fixed a bug in feature selection - feature IDs are now correctly retained
- Fixed bug in SparseDataSet's addFeatures method
- Removed dependency on scipy
- Modified the way svm models are saved/loaded. Now you can save
any SVM model. Note that the interface for loading/saving has
OneAgainstRest classifiers also support save/load.
- Fixed the import error from 0.7.3
- Fixed bug 2971509 reported by Vebjorn Ljosa ( ljosa )
- Updated the requirements for PyML (python 2.5 and up).
- Increased the flexibility of the Aggregate container
- ROC curve plotting: when doing cross-validation, PyML
now plots an ROC curve produced by averaging ROC curves
of individual folds.
- AUC calculation gives the correct result when the discriminant
scores are concentrated on a few values.
- the assess module has been split into several modules under
the evaluators directory
- Fixed bug in Aggregate container (in the case of a weighted
combination of datasets) Bug report by Eithon Cadag
- Stephen Picolo found issue when using feature selection with the VectorDataSet
container. Use SparseDataSet instead.
- added k-means clustering (PyML.clusterers.kmeans)
- corrected handling of nan's
- fixed an import statement in classifiers/modelSelect.py
- Compilation error in Kernel.h fixed (shows in gcc 4.3)
- preproc.pca - updated code to new version of numpy
- demo2d.scatter - improved interface to make it more useful
- a restructuring of the module structure. see tutorial for details
- small changes in demo2d
- linear kernel wasn't normalizing properly
- myio.myopen now handles bz2 files as well
- myio.myopen does not open in universal newline support mode
by default - it throws the pyml parser off
- improved the way SequenceData reads fasta files -- it can now
extract labels using a user-supplied function that extracts
the id and label out of the fasta header.
- added a method for generating spectrum kernels
- the svm cacheSize for libsvm wasn't being set up correctly.
thanks to David Doukhan for the bugfix.
- added an 'addFeatures' method to the SparseDataSet and VectorDataSet
- cleaned up the VectorDataSet container
- a dataset's attachLabels now takes either a labels object or a file name.
- when saving a dataset, the format (csv or sparse) is chosen according to
the type of the container (sparse containers save to sparse format and
non-sparse containers save to csv).
- improved the discussion of model selection in the tutorial
- more options in demo2d and a change of format
- Fixed bug in KNN.cpp where in some cases the decision function value
is sometimes wrong (thanks to Martial Hue the bug report and a fix).
- Fixed a bug in loading non-linear models (thanks to shiela reynolds
for the bug report).
- construction from an array is now in the form
datafunc.VectorDataset(x) instead of datafunc.VectorDataset(X=x)
- small changes in demo2d (with the help of Adam Labadorf)
- Updates to SequenceData, including the addition of the positional-kmer-kernel
and a save method.
- Datasets now maintain the norms of the examples, so that computation of
cosine and Gaussian kernels is much faster.
- a 'sample' module has been added with function for creating samples from a
dataset (with and without replacement).
- fixed a segfault in the Aggregate container
- streamlined SWIG wrapper files
- added __init__.py in the ext directory
- PyML now distributed under LGPL license.
- The list of labels input argument for the Labels constructor can now be
any list-like object.
- Fixed bug in loading a saved ResultsList.
- removed usage of numpy.nonzero in assess.py
- debugged svm.save/loadSVM
- debugged demo2d
- wrappers now use swig 1.3.31
- added release number accessible as PyML.__version__
- fixed a bug in datafunc.save
- cleaned up some of the functions in the datafunc module
- when performing CV it could happen that a class would not be represented
in one of the folds, creating a cryptic error/segfault. now this is
- Corrected compilation bug that only showed up in new gcc compilers (version 4.1.1).
(Bug reported by Michael Hamilton).
- Compatibility with latest numpy 1.0 (array codes, numpy.nonzero)
- Rewrote the constructor of 'datafunc.Labels'.
- The 'translate' method of VectorDataSet went missing, and has been restored.
- Cleaned up the libsvm wrapper.
- Dealt with bug in the C++ dataset containers introduced in 0.6.9
- added VectorDataSet -- a C++ non-sparse vector container.
this resulted in some renaming of the dataset containers.
- updated the libsvm wrapper -- now wrapping version 2.82.
to simplify the update of the wrapper, the pure python datasets are
no longer supported for training svm classifiers (it's easy to convert!)
- removed references to Numeric from myio.py
- better handling of nan's as a result of prediction
- a few more tweaks of the RegressionResults class
- reworked demo2d to work with version 0.87 of matplotlib
- the check method of the sparse parser didn't handle correctly files
with only a single line (thanks to Jian-Qiu for the pointer and fix).
- reworked the Results containers, including improved support for regression
- corrected bug in reading in data with numeric labels (the numericLabels flag wasn't working)
- corrected compilation issue in KNN.h (showed up under fedora core 5)
- changed the directory structure of the package; setup.py was updated accordingly.
- corrected bug in RFE that affected cases where the number of desired features
was greater than the number of features in the dataset.
- corrected a couple of import issues related to the transition from Numeric to numpy
- numpy is now used instead of Numeric
- featureCount/featureCounts for the python containers
- rewrote RegressionResults (thanks to Jian Qiu for his contribution).
still some work to be done to make it really useful
- datasets can be read from gzipped files.
- took care of some compilation warnings.
- assess.significance now also reports the medians of the statistics.
- KNN now accepts a generic dataset container.
- there is no longer a need to explicitly set an optimizer for non-vector
- cv and stratified CV can now do partial CV.
- cleaned up copy constructor of KernelData, and unified it with the standard
- fixed bug in using libsvm solver with 'Cosine' kernel.
- assess.loadResults2 / assess.saveResultsObjects now handle lists of lists of
objects, or dictionaries of lists of results objects
- fixed bug in getKernelMatrix; the last element of the diagonal was missing.
- fixed bug in assess.loo
- ROC score is not shown for multi class problems.
- Platt object now tries a couple of times before it raises an exception
- ker.sortKernel2 -- sorts a kernel matrix according to a given list of patterns
- when non-normalized roc curve is computed the normalized area is returned.
- added multi.allOneAgainstRest for performing all one-against-rest classification
- made sure that when making a dataset from an array, that it's a Numeric array
- revamping of the ridge regression classifier (it's still pure python)
- numFolds in cv and stratifiedCV can also be set using a keyword parameter
- added a baseClassifiers.IteratorClassifier from which modelSelection.Param and
- added FeatureSelectAll -- for a backward/forward selection method, this looks at
accuracy as a function of the number of features.
- corrected bug in setting of k in KNN
- corrected bug in Gaussian kernel for the pure python datasets
- demo2d.py -- a module for playing with data in two dimensions
- more options in Platt classifiers (see documentation for details)
- added a 'split' method to the datafunc module; it randomly splits a dataset into two parts
- corrected bug in nCV (setting of number of iterations; this is now a keyword argument)
- Better parsing of pairData input files
- Corrected refcount initialization in KernelData
- Corrected bug in CSVparser.readLabels
- Corrected bug in Results object in a case of one fold
- Merged the KNN and KNNC classes so that the user doesn't
need to worry about that.
- Corrected bugs in CSV parser
autodetect of a header row has proven problematic.
if your CSV file has a header row you now need to specify that
as a keyword argument (headerRow = True). the header row allows you
to specify feature IDs (think of it as an excel spreadsheet).
- corrected bug in construction of a dataset from a Numeric array
- More cleaning up of the Results object
- replaced matlab with pylab
- added a getKernel method to the DataSet interface
- added a ker.showKernel function that displays a kernel matrix
- corrected bug in saving results object (not enough attributes were saved)
- Added save method for Platt classifier
- PyML no longer works with python 2.2 (sub-classing of the 'list' object
has a problem in that version of python).
- fixed problem in Aggregate -- it broke down when the datasets from which
it was constructed went out of scope. other improvements to this class.
- fixed bug for reconstructing saved nonlinear SVM.
- Saved SVM had a problem with unlabeled data since it didn't reload its
classLabels attribute -- bug fixed.
- Results object has been re-designed.
The Results object IS a list of Results_
objects that each provide the results on a different part of the
data as is the case when doing CV.
ROC scores computed using CV were to computed in previous versions by
pooling the decision function values and computing an pooled ROC curve.
This has been changed to an average ROC score that is the average
of the ROC areas from the different folds.
- Results object has a 'toFile' method for saving to an ascii file
rather than pickling the object
- Added a ResultsList object (used for combining CV results on the
same dataset). nCV is now a classifier method and returns a
ResultsList object rather than a list of Results objects
- Gnuplot related functions have been removed. If you use them, just
get them from from an older version.
- the 'xml' option in saving a results object has been discontinued.
- confusionMatrix attribute in a Results object is now a list rather than
a Numeric array for safety in pickling
- the sparse parser has a 'sparsify' keyword argument that is "False"
by default. "sparsify" means not add features whose value in the file
- improved the way ModelSelector recognizes selection via roc scores
- composite.Platt2 - a classifier for rescaling discriminant values into
probability estimates based on Platt's method, but with better numerical
stability (contributed by Tobias Mann).
- addFeature method for the sparse dataset containers
- registered attributes: the user can register attributes to a container;
those attributes are then copied on copy construction
- eliminated the "aux" attribute from a dataset -- no need for that now
that registeredAttributes are available
- the interface for "testing" of a dataset has changed: the input for
the test function includes the training data. The "test" function
of the classifier calls the "test" function of the dataset; therefore
the training data is now saved in the classifier
- corrected error in setup.py in definition of pairdataset extension
- "save" method of SVM objects now working. Thanks to Olav Zimmerman
for his input.
- timing information for training and testing a classfier are saved in the
Result object's log or cvLog attributes
- Corrected bug in parsing space delimited files
- reading files in gist format (including label files)
- added datafunc.sample for sampling a dataset
- CV/stratifiedCV can save intermediate results
- the Labels constructor and the dataset constructors take a keyword
argument 'positiveClass' that tells the Labels constructor which of
the class labels is the positive class. If the class labels are
+1 and -1 or 1 and -1, the positive class is detected automatically.
- assess.resultsByFold -- Given a Results object obtained by CV, divide
it into a list of Results objects for each fold.
- PyML is now supposed to work with 64-bit machines:
the problem was that the hash values generated by python are 64-bit
numbers. These hash values are used as feature IDs in the SparseCDataSet,
so those had to be changed into "long" instead of "int". Using a "myint"
proved problematic since arbitrary templates for vectors are not yet
supported in swig.
- Plugged a couple of memory leaks
- corrected bug in roc50 calculation
- corrected bugs related to SVR
- gist and gradient descent solvers for svm formulation without a bias
- Results object has an attribute foldStart so that the patterns between
in the range [foldStart[i]:foldStart[i+1]] were in cross-validation fold
number i where i is in range(0,numFolds)
- Construction of a dataset from a Numeric array is now possible
- Improvements in assess.plotROC and assess.plotROCs functions
- Param and ParamGrid are iterators:
Param.cv, Param.stratifiedCV, Param.loo etc return an iterator:
p = Parm(...)
for result in p.cv(...) :
- datafunc.BaseDataSet renamed datafunc.VectorDataSet, which is what it
is. datafunc.BaseDataSet is now a more generic base class
- a dataset container has a train method that calls the dataset's
'trainingFunc' method with the keyword arguments it receives.
a classifier first trains the data before training itself.
(BaseClassifier calls the data's train method).
naturally, there is also a "test" method.
assess.test calls a dataset's test method with the keyword arguments it
received. if you redefine a classifier's "test" methos be sure to call
the dataset's test method.
- a dataset's train/test/cv methods are called as: self.method(self, **args)
rather than self.method(self, *options, **args)
- assess.cvFromFolds -- perform cv when training and testing examples for
each fold are given
- assess.stratifiedCV now can receive a list of patterns that are to serve
as training examples in all CV folds
- corrected bug in cosine kernel
- corrected bug in "normalize" function in the case of zero vectors.
- ker.combineKernels -- add or multiply kernels (instead of ker.addKernels)
- ker.sortKernel plus a few other more esoteric functions for dealing with
- patternID created by default for a dataset
- modelSelector object now also supports selection by area under the roc/rocN
- aggregate data object -- combines several c++ data objects into a single
dataset; its dot product is the sum of the dot products of the individual
- corrected bug in assess.scatter
- corrected bugs reported by users in ParamGrid and csv parser
- assess.plotROCs -- plotting of multiple ROC curves for easy comparison
- corrected bugs in PairData
- better save method for vector-based containers
- corrected bug in getPattern method of SparseCDataSet
- datafunc -- labels object remembers what classes it was made from
in the case of copy construction.
C++ part received a major upgrade:
- new level of abstraction for dataset objects -- generic interface for dataset
and kernel objects allows "plug and play" data and kernel objects.
These work with the PyML SMO object for SVM training (code based on libsvm).
- Cleaner wrapping of the C++ dataset container
- KernelData C++ container for an in-memory kernel matrix.
The dotProduct function returns the appropriate entry of the kernel matrix.
Like all containers, a kernel function can be defined on top of the dotProduct.
- Results object can be constructed from a Container object
- data object supports an "aux" attribute which is a list which stores auxiliary
information on each pattern.
uses -- a value of "C" for svm training.
- write kernel to a file, now directly in C++ (kernelWrap.py)
- adding kernels that are stored in a file (kernelWrap.py)
- Data container for data whose patterns are composed of pairs of objects
- removed some "warts" from the code with the help of pychecker
- updated documentation
PyML 0.43 (unreleased)
- preprocessing object
- save method for svm classifiers (not really working yet)
- improved save method for dataset containers
- corrected bug in ModelSelector object (CV did not work)
- In some cases a decision function may return a 'nan'. Python has
some peculiarities in handling nan. This situation is now handled.
- improved CSV parser; interface modified (indexing starts at 0).
- succ feature score
- Results object computes rocN scores where N is 50 by default
support for One-class SVM and SVM regression
speedup of reading csv files
memory save mode for linear svm, some memory saving in general and plugging
a couple of memory leaks
memory still leaks from the swig interface -- help would be apprecited.
Correction of bugs introduced in 0.4
Addition of a sparse dataset implemented in C++ - PyML runs several times
faster in some cases.
Use the KNNC nearest neighbor classifier with this container.
RFE and Filter methods do automatic selection of the number of features.
Multiplicative update (a.k.a zero-norm) feature selection.
Changes in the interfaces:
All the dataset classes now have a Kernel automatically attached to them, so
a dataset is actually a feature-space.
There is now one SVM class - the object knows what type of SVM (linear,
nonlinear) to train according to the kernel attached to the data object on
which it is trained.
Changes in the classifier interface:
The "test" method of a classifier receives a dataset and a list of patterns
on which to test.
The "classify" and "decisionFunc" members of a classifier receive as input
a dataset and the id of a pattern rather than the pattern itself.
A DataSet now supports kernel/similarity data by:
>>> d = datafunc.DataSet(K=K, L = L)
where K is a Numeric 2-d matrix, and L is a list of labels (optional).
Construction of a dataset from a Numeric array X is now:
>>> d = datafunc.DataSet(X=X, L = L)
This help support usage of precomputed kernels:
If you install William Stafford Noble's Gist package, you can train and
test SVMs on any kernel matrix that was computed in advance.
Changes in the "Result" class; also added better support for saving objects
so that they are recoverable even when changes are made in the code.
The Platt classifier (in composite.py) - is basically a translation of the spider
code; it suffers from some numerical instability that sometimes makes
it break down.