Re: [Cdk-devel] Fwd: High Performance Structure IO

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Hi Nina,

> Yet, this could be done for parts of the code, e.g. the CDK code.  What if any of the dependencies had changed ?  Probably feasible to rewrite the streams, but lot of work.

Not sure I understand what you mean here... could you expand a bit.

> Yes, for a prototype or a project where everything is under your control this is fine.  Not really user friendly, if the software/database is used elsewhere  ...

Yep this isn't for distributing data.

Cheers,
J

On 15 Jun 2012, at 16:19, Nina Jeliazkova wrote:

> Hi John,
> 
> On 15 June 2012 18:07, John May <joh...@gm...> wrote:
> Hi Nina,
> 
>> What I am wondering is - the output of Java serializers is not compatible across versions of the object (e.g. any tiny change in the cdk may result in non-readable database content ) .  Probably it is not of concern with your projects?
> 
> 
> Jave serialisation is compatible if you add/remove fields and recalculate the version id it does fail if you rename stuff though.
> 
> Indeed.  (I thought removing fields fails also, but may be had changed since earlier Java versions).
>  
> This method is not susceptible to name changes in that say for example we change the 'setLabel' method on IPseudoAtom to 'setAlias' you just need to change the reader to invoke 'setAlias' instead of the 'setLabel' the actual core data in the stream is the same and can still be read. 
> 
> 
> Thanks, I was not aware of this.  
>  
> It is possible to make this fully version safe by tagging each marshal:
> 
>> @Version("1.4.0")
>> public class Point2DMarshal_1_4_0 { }
> 
>> @Version("1.4.1")
>> public class Point2DMarshal_1_4_1 { }
> 
> 
> 
> Yet, this could be done for parts of the code, e.g. the CDK code.  What if any of the dependencies had changed ?  Probably feasible to rewrite the streams, but lot of work.
>  
> I've done this on other projects but this prototype was really for super quick storage where we still have the original files (e.g. you can always reload files into a database with a new version).
> 
> Yes, for a prototype or a project where everything is under your control this is fine.  Not really user friendly, if the software/database is used elsewhere  ...
> 
> On the positive side, your stats look great, and the approach is definitely useful for certain use cases (caching, perhaps workflows). 
> 
> Nina
> 
> P.S. I've sent a wrong link for the transient keyword in the previous email. This should be more clear .
> http://en.wikibooks.org/wiki/Java_Programming/Keywords/transient
> 
> and perhaps indeed we need to think which class members could be declared transient.
> 
>  
> 
> Cheers,
> J
> 
> On 15 Jun 2012, at 15:49, Nina Jeliazkova wrote:
> 
>> 
>> Interesting !
>> 
>> On 15 June 2012 17:15, John May <jo...@eb...> wrote:
>> Hi all,
>> 
>> I've been thinking about this for a while and finally got down to writing a prototype this Wednesday. What I wanted to do was speed up the access of molecules (IAtomContainers) when accessing data that will only be used by the CDK. This could for example be retrieving structures from a MySQL/NoSQL database or transmitting between components as required by Knime. The point is not to provide a future-proof versioned format that other applications can read but to provide a fast, flexible and efficient (in size) way of storing and accessing molecules. Commonly we've been storing structures as mdl (not very good) or CML text in a database. An alternative is to store the bytes from Java's default 'ObjectOutputStream' or use a custom serialisation library such as Kryo. Kryo is very nice but I didn't really want to add a dependency for relatively trivial task. You also have the problem that a format like CML/MDL may not store everything you want and the serialisation may store too much (do you really want to transmit ChemObjectListeners?).  
>> 
>> Listeners (and other variables which doesn't make sense to be serialized) might be declared as transient, then they will not be serialized.
>> 
>> http://docs.oracle.com/javaee/5/api/javax/persistence/Transient.html
>> 
>> What I am wondering is - the output of Java serializers is not compatible across versions of the object (e.g. any tiny change in the cdk may result in non-readable database content ) .  Probably it is not of concern with your projects?
>> 
>> Best regards,
>> Nina
>> 
>> 
>>  
>> 
>> The following prototype code is on branch in the package org.openscience.cdk.io.stream which can be found here: https://github.com/johnmay/cdk/tree/high-performance-serializer/src/main/org/openscience/cdk/io/stream
>> 
>> I defined two new interfaces IChemObjectInput<C extends IChemObject> and IChemObjectOutput<C extends IChemObject> that indicate binary IO,  Java uses Reader/Writer suffix to indicate handling of text and automatic handling of different char formats and I wanted to make it distinct that these were binary input/output.  The idea was originally to have an input/output for Containers, Atoms, Bonds etc.. and just assemble them but unfortunately I had to defined separate ones for Atom and Bonds as some atom/bond properties require the container to calculate their value. I would much appreciate if anyone can see a way of simplifying the API so I could use a common interface at the moment there's been a lot of copy paste :-).
>> 
>> Anyways to the syntax..
>> 
>> You can create an output and provided select marshals that will write certain values. This example will write atom number/symbol, pseudo label (IPseudoAtoms), 2D coordinates, valency, formal charge for atoms and atom connections, order and stereo for bonds.
>> 
>> IChemObjectOutput<IAtomContainer> out = new AtomContainerOutput(out,
>>                                                                 new AggregatedAtomOutput(new AtomicNumberOutput(),
>>                                                                 			 new PseudoLabelOutput(),
>> 			  	                    			                 new Point2DOutput(),
>>                                                                 			 new ValencyOutput(),
>>                                                                 			 new FormalChargeOutput()),
>>                                                                 new AggregatedBondOutput(new SmallBondOutput(),
>>                                                                                          new LargeBondOutput(),
>>                                                                                          new BondOrderOutput(),
>> 						                                         new BondStereoOutput()));
>> 
>> Each of the marshals (i.e. AtomicNumberOutput) is stateless and can easily be defined anonymously, let's add one for atom id output (lambdas will make this even easier in future):
>> 
>> new AtomOutputMarshal() {
>>     @Override
>>     public void write(DataOutput out, IAtomContainer container, IAtom atom) throws IOException {
>>         out.writeUTF(atom.getID());
>>     }
>> }
>> 
>> But what if getID returns a null string? Well this is where some efficiency savings comes in. We can check a field for a default value which means it won't get written to the output (we can add any default value in when we read). 
>> 
>> new AtomOutputMarshal() {
>>     @Override
>>     public void write(DataOutput out, IAtomContainer container, IAtom atom) throws IOException {
>>         out.writeUTF(atom.getID());
>>     }
>> 
>>     @Override
>>     public boolean isDefault(IAtom atom) {
>>         return atom.getID() == null;
>>     }
>> }
>> 
>> Many output formats will tag input field type and then write a value to indicate null. This method defines a 'format' at the start of the atom stream. It's probably a bit long to go into here but think of it like a 'fingerprint' that indicates which fields should be
>> written/read for each atom object. If you want to see it have a look at 'createAtomInput()' in this file: https://github.com/johnmay/cdk/blob/high-performance-serializer/src/main/org/openscience/cdk/io/stream/atom/AggregatedAtomInput.java
>> Each fingerprint is a single byte but you can nest AggregatedAtomOutput if your require more variable fields. We also achieve some major efficiency savings by writing integer (32 bits) values within range as bytes (8 bits) and shorts (16 bits).
>> 
>> Performance:
>> It's a bit hard to benchmark this as different formats choose to save different values, the configuration above saves most of what you will find in the MDL/CML file spec and of course you can just add extra marshals if you need. Anyways here's some results for using ChEBI. I tested IteratingSDFReader, CMLReader and default Java Serialisation all of these were written in memory to a ByteArrayOutput stream. Note that CML may be slightly faster but I kept getting errors if I wrote all molecules to a single output so I had to write them to separate outputs and take the totals. I did some profiling and the bottle neck with streaming is the object creation (we actually clone to preserve default values) and I don't think it's possible to get much faster.
>> 
>> <time.png>
>> 
>> We also save on storage size. Most of the storage in the stream is actually taken up by the 2D coordinates and it should be possible to write a marshal that will only write coordinates to a certain precission.
>> You can also wrap the output in a 'DeflaterOutputStream' that will store ChEBI using the stream output in ~3 Mb and only a small cost to speed.
>> <size.png>
>> This next one shows the same information as the first but is intended to allow some rough calculations on how fast you could read your data set. As a note I did some tests storing the bare minimum (connections and atomic number) and managed to hit 96,000 molecules per second
>> <flux.png>
>> 
>> If you're interested in the values:
>> 
>> method	time	bytes	megabytes	flux
>> Stream	346	11716899	11.17410564	63739
>> SDF	4159	54285753	51.77092838	5302
>> CML	18605	95958030	91.51270866	1185
>> ObjectInputStream	5552	98481548	93.91932297	3972
>> 
>> 
>> As I said this was a prototype to see if it was actually worth while implementing and would greatly appreciate feedback and thoughts.
>> 
>> Cheers,
>> J
>> 
>> 
>> ------------------------------------------------------------------------------
>> Live Security Virtual Conference
>> Exclusive live event will cover all the ways today's security and
>> threat landscape has changed and how IT managers can respond. Discussions
>> will include endpoint security, mobile security and the latest in malware
>> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
>> _______________________________________________
>> Cdk-devel mailing list
>> Cdk...@li...
>> https://lists.sourceforge.net/lists/listinfo/cdk-devel
>> 
>> 
>> ------------------------------------------------------------------------------
>> Live Security Virtual Conference
>> Exclusive live event will cover all the ways today's security and 
>> threat landscape has changed and how IT managers can respond. Discussions 
>> will include endpoint security, mobile security and the latest in malware 
>> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/_______________________________________________
>> Cdk-devel mailing list
>> Cdk...@li...
>> https://lists.sourceforge.net/lists/listinfo/cdk-devel
> 
> 
> ------------------------------------------------------------------------------
> Live Security Virtual Conference
> Exclusive live event will cover all the ways today's security and
> threat landscape has changed and how IT managers can respond. Discussions
> will include endpoint security, mobile security and the latest in malware
> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
> _______________________________________________
> Cdk-devel mailing list
> Cdk...@li...
> https://lists.sourceforge.net/lists/listinfo/cdk-devel
> 
> 
> ------------------------------------------------------------------------------
> Live Security Virtual Conference
> Exclusive live event will cover all the ways today's security and 
> threat landscape has changed and how IT managers can respond. Discussions 
> will include endpoint security, mobile security and the latest in malware 
> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/_______________________________________________
> Cdk-devel mailing list
> Cdk...@li...
> https://lists.sourceforge.net/lists/listinfo/cdk-devel