MOLL Code

Brought to you by: opx

Tree [a56168] master /

History

HTTPS access

File	Date	Author	Commit
doc	2009-06-21	Ondrej Pacovsky	[c086db] created git repo
examples	2009-09-08	Ondrej Pacovsky	[06c580] transition mostly done
java	2009-06-21	Ondrej Pacovsky	[c086db] created git repo
python	2009-09-09	Ondrej Pacovsky	[a56168] cleanup
scripts	2009-06-21	Ondrej Pacovsky	[c086db] created git repo
tests	2009-06-21	Ondrej Pacovsky	[c086db] created git repo
LICENSE	2009-06-21	Ondrej Pacovsky	[c086db] created git repo
README	2009-06-21	Ondrej Pacovsky	[c086db] created git repo
TODO	2009-06-21	Ondrej Pacovsky	[c086db] created git repo

Read Me

This is MOLL, the Machine learning library, version 0.3 beta
        ----


Features
--------
* multi-core support
* run and analyse everything from python
* very easy to test different algorithms and parameters
* tested on large data (around a giga-byte per each set)
* currently implemented algorithms: MLP, GA, GP, ESN and RBF


History
-------
I originally created a few python scripts that would allow me running ML experiments using multiple ML libraries at once. Then came the need to run them in parallel and I added some result analysis tools and here we go, I decided to call it a library. The company I was doing research for, RSJ Invest, was kind enough to allow releasing the code to the public, under the Apache licence. After all, we were using other libraries, so it is only natural to give something back to the community


Status
------
This is the first public release. From the user point of view, it has some rough edges, but it usable. It was actually used to run quite a few large-scale experiments (gigs of training data). 
From the developer point of view, there are parts that would deserve a rewrite, especially the dataset pipelines and caching. In the likely case that I will use the lib for future projects, I will keep updating it so that it fits my needs. Should anyone else from the community want to add some features or rewrite some parts, I will add documentation where needed.


License
-------
Apache License 2.0 - see the LICENSE file


Contents
--------
python/   - the core python sources and wrappers for various ML models
java/     - JAVA sources for the ECJ wrapper
examples/ - some examples


Installing
----------
There's no python egg installer or a distribution-specific package for Moll yet. So grab the source, sort out the dependencies you need:

Core dependencies:
	Numpy
	multiprocessing (or python >= 2.6)

Specific dependencies:
	ESN: 
		Aureservoir (tested with SVN rev.60), http://aureservoir.sf.net

	Genetic (GP / GA): 
		ECJ (tested with v18) http://www.cs.gmu.edu/~eclab/projects/ecj/
	
	MLP nets:
		ffnet (tested with SVN rev. 272), http://ffnet.sf.net/

	Plotting:
		pylab / pygraphviz

Moll was tested on the Linux amd64 (Debian). 


Directory structure
-------------------
 * java/ and python/ contain the source files.
 * examples/ - moll usage examples
 * analyse/ - result analysis utils
 * scripts/ - shell scripts for easier invocation of multiple experiments, joining the results etc.


Running
-------
Make sure you have moll and the needed dependencies in PYTHONPATH.

Use moll.data to create a dataset. Probably the easiest is to use MatrixDataset that just wraps an ordinary numpy array say arr'. You can either create and train a model of your choice on that dataset directly:
	dataset = moll.data.MatrixDataset(arr, ins=2)
	nn = moll.ml.nn.FFNET((1, 10, 1))
	nn.train(dataset, iterations=1000, descent_algo='cg')
	output = nn.run(dataset.inputs)

.. or, have Hustler deal with the dirty stuff. In that case, the main difference is that you're defining the jobs not by actually creating datasets and models, but by supplying the appropriate parameters:
	kfold = CrossValidator(5, Hustler(), TrainJob)
	kfold.add_job(MatrixDataset, {'_arr':arr, 'ins':1}, FFNET, {'topology': (1, 10, 1)}, {'iterations':5000, 'descent_algo':'tnc'})
	kfold.add_job(MatrixDataset, {'_arr':arr, 'ins':1}, RBF, {'nodes': 15}, {})
	hustler.go()

	As you might have noticed, we've employed the 5-fold cross-validator as well. Thus each job is actually run 5 times on the folded parts of the dataset. Hustler uses all available cores to run the jobs by default.  The resulting models and errors are then in hustler.jobs. Voila!


Todo's
------
See the TODO file

MOLL Code

Branches

Tree [a56168] master / Download Snapshot History

Read Me

Tree [a56168] master /

History