Hi Nina,

What I am wondering is - the output of Java serializers is not compatible across versions of the object (e.g. any tiny change in the cdk may result in non-readable database content ) .  Probably it is not of concern with your projects?

Jave serialisation is compatible if you add/remove fields and recalculate the version id it does fail if you rename stuff though. This method is not susceptible to name changes in that say for example we change the 'setLabel' method on IPseudoAtom to 'setAlias' you just need to change the reader to invoke 'setAlias' instead of the 'setLabel' the actual core data in the stream is the same and can still be read. 

It is possible to make this fully version safe by tagging each marshal:

@Version("1.4.0")
public class Point2DMarshal_1_4_0 { }

@Version("1.4.1")
public class Point2DMarshal_1_4_1 { }

I've done this on other projects but this prototype was really for super quick storage where we still have the original files (e.g. you can always reload files into a database with a new version).

Cheers,
J

On 15 Jun 2012, at 15:49, Nina Jeliazkova wrote:


Interesting !

On 15 June 2012 17:15, John May <johnmay@ebi.ac.uk> wrote:
Hi all,

I've been thinking about this for a while and finally got down to writing a prototype this Wednesday. What I wanted to do was speed up the access of molecules (IAtomContainers) when accessing data that will only be used by the CDK. This could for example be retrieving structures from a MySQL/NoSQL database or transmitting between components as required by Knime. The point is not to provide a future-proof versioned format that other applications can read but to provide a fast, flexible and efficient (in size) way of storing and accessing molecules. Commonly we've been storing structures as mdl (not very good) or CML text in a database. An alternative is to store the bytes from Java's default 'ObjectOutputStream' or use a custom serialisation library such as Kryo. Kryo is very nice but I didn't really want to add a dependency for relatively trivial task. You also have the problem that a format like CML/MDL may not store everything you want and the serialisation may store too much (do you really want to transmit ChemObjectListeners?).  

Listeners (and other variables which doesn't make sense to be serialized) might be declared as transient, then they will not be serialized.

http://docs.oracle.com/javaee/5/api/javax/persistence/Transient.html

What I am wondering is - the output of Java serializers is not compatible across versions of the object (e.g. any tiny change in the cdk may result in non-readable database content ) .  Probably it is not of concern with your projects?

Best regards,
Nina


 

The following prototype code is on branch in the package org.openscience.cdk.io.stream which can be found here: https://github.com/johnmay/cdk/tree/high-performance-serializer/src/main/org/openscience/cdk/io/stream

I defined two new interfaces IChemObjectInput<C extends IChemObject> and IChemObjectOutput<C extends IChemObject> that indicate binary IO,  Java uses Reader/Writer suffix to indicate handling of text and automatic handling of different char formats and I wanted to make it distinct that these were binary input/output.  The idea was originally to have an input/output for Containers, Atoms, Bonds etc.. and just assemble them but unfortunately I had to defined separate ones for Atom and Bonds as some atom/bond properties require the container to calculate their value. I would much appreciate if anyone can see a way of simplifying the API so I could use a common interface at the moment there's been a lot of copy paste :-).

Anyways to the syntax..

You can create an output and provided select marshals that will write certain values. This example will write atom number/symbol, pseudo label (IPseudoAtoms), 2D coordinates, valency, formal charge for atoms and atom connections, order and stereo for bonds.

IChemObjectOutput<IAtomContainer> out = new AtomContainerOutput(out,
                                                                new AggregatedAtomOutput(new AtomicNumberOutput(),
                                                                  new PseudoLabelOutput(),
                                         new Point2DOutput(),
                                                                 new ValencyOutput(),
                                                                 new FormalChargeOutput()),
                                                                new AggregatedBondOutput(new SmallBondOutput(),
                                                                                         new LargeBondOutput(),
                                                                                         new BondOrderOutput(),
                                         new BondStereoOutput()));

Each of the marshals (i.e. AtomicNumberOutput) is stateless and can easily be defined anonymously, let's add one for atom id output (lambdas will make this even easier in future):

new AtomOutputMarshal() {
    @Override
    public void write(DataOutput out, IAtomContainer container, IAtom atom) throws IOException {
        out.writeUTF(atom.getID());
    }
}

But what if getID returns a null string? Well this is where some efficiency savings comes in. We can check a field for a default value which means it won't get written to the output (we can add any default value in when we read). 

new AtomOutputMarshal() {
    @Override
    public void write(DataOutput out, IAtomContainer container, IAtom atom) throws IOException {
        out.writeUTF(atom.getID());
    }

    @Override
    public boolean isDefault(IAtom atom) {
        return atom.getID() == null;
    }
}

Many output formats will tag input field type and then write a value to indicate null. This method defines a 'format' at the start of the atom stream. It's probably a bit long to go into here but think of it like a 'fingerprint' that indicates which fields should be
written/read for each atom object. If you want to see it have a look at 'createAtomInput()' in this file: https://github.com/johnmay/cdk/blob/high-performance-serializer/src/main/org/openscience/cdk/io/stream/atom/AggregatedAtomInput.java
Each fingerprint is a single byte but you can nest AggregatedAtomOutput if your require more variable fields. We also achieve some major efficiency savings by writing integer (32 bits) values within range as bytes (8 bits) and shorts (16 bits).

Performance:
It's a bit hard to benchmark this as different formats choose to save different values, the configuration above saves most of what you will find in the MDL/CML file spec and of course you can just add extra marshals if you need. Anyways here's some results for using ChEBI. I tested IteratingSDFReader, CMLReader and default Java Serialisation all of these were written in memory to a ByteArrayOutput stream. Note that CML may be slightly faster but I kept getting errors if I wrote all molecules to a single output so I had to write them to separate outputs and take the totals. I did some profiling and the bottle neck with streaming is the object creation (we actually clone to preserve default values) and I don't think it's possible to get much faster.

<time.png>

We also save on storage size. Most of the storage in the stream is actually taken up by the 2D coordinates and it should be possible to write a marshal that will only write coordinates to a certain precission.
You can also wrap the output in a 'DeflaterOutputStream' that will store ChEBI using the stream output in ~3 Mb and only a small cost to speed.
<size.png>
This next one shows the same information as the first but is intended to allow some rough calculations on how fast you could read your data set. As a note I did some tests storing the bare minimum (connections and atomic number) and managed to hit 96,000 molecules per second
<flux.png>

If you're interested in the values:

method time bytes megabytes flux
Stream 346 11716899 11.17410564 63739
SDF 4159 54285753 51.77092838 5302
CML 18605 95958030 91.51270866 1185
ObjectInputStream 5552 98481548 93.91932297 3972


As I said this was a prototype to see if it was actually worth while implementing and would greatly appreciate feedback and thoughts.

Cheers,
J


------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Cdk-devel mailing list
Cdk-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/cdk-devel


------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/_______________________________________________
Cdk-devel mailing list
Cdk-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/cdk-devel