HI Till,

Currently, i just tested the things out and it is not used in any release versions. However, i am looking into integrating the speed boost. Is there any plan when the 1.6 stable will be released? Does it make sense to use a 1.5.x version or are the interfaces to unstable at the moment?

The original plan was end of this sumer :(.  

Here’s the main things I would like to get done before 1.6. Maven is kind of the big one and ideally I don’t want to switch to maven and then release straight away so we can find problems. In my mind the release will be 1 month/1.5 months after the maven switch. 

  1. correct perception of tetrahedral and double bond stereochemistry from 2D/3D coordinates (in progress)
  2. convert build system to maven (depends on go ahead from Egon and having existing patches applied)
  3. SMILES utility including universal SMILES - http://www.jcheminf.com/content/4/1/22 (depends on maven)

Here’s what would be nice to have:
  1. much faster fingerprint generation (will probably add this in 1.6 as a separate module like nio in the JDK ‘nfp=new fingerprint')
  2. depict stereochemisy of double bonds and wedge/hatch placement for tetrahedral centres. If anyone knows or wants to do this that would be a great help?
Does it make sense to use a 1.5.x version or are the interfaces to unstable at the moment?

So I *might* change the stereochemistry interfaces but that only affects you if you are doing manual manipulation of the IStereoElements. The SMILES changes will be as a new class and I intend to keep the parser/generate using the same API (the default functionality might change though). I did want to deprecate IAtomContainer but I don’t think it’s feasible for this version.

We need canonical SMILES. Do i understand correctly, that the current git versions isomeric SMILES are canonical?

Yes they are at the moment. But the CDK canonical labeller does not consider steroechemsitry and so generating canonical isomeric SMILES (absolute SMILES) has always been impossible it’s just never been indicated as such. The addition of Universal SMILES (which uses the InChI algorithm) will provide correct canonical isomeric SMILES. 

Might I also suggest the ‘cdk-hash’ module. This will generate 64-bit stereo specific hash codes for indexing structures and quickly finding identical structures. The entry point is http://cdk.github.io/cdk/1.5/docs/api/index.html?org/openscience/cdk/hash/HashGeneratorMaker.html. Oh and the (sub)graph isomorphism testing is also faster now and stereospecific. This is unreviewed but you can find it in the maven release I made the other day. Entry points for the substructure matching is detailed here: http://efficientbits.blogspot.co.uk/2013/11/improved-substructure-matching.html 

Will a 1.6 canonical string match a 1.4 canonical string? We use the SMILES to match molecules. 

The same canonicalisation algorithm is used but the SMILES generated also depends on other rules such as ring numbering, visiting double bonds first, aromaticity etc. The inclusion of Universal SMILES will definitely be different. I think it's safer to presume they are different. It’s unfortunate but is better in the long run.

Thanks,
J

On 15 Nov 2013, at 12:43, Till Schäfer <till2.schaefer@tu-dortmund.de> wrote:

Hi,
thanks for the detailed explanations. I have a few further qustions:

Am Freitag, 15. November 2013, 09:58:06 schrieb John May:
No problem,

So it’s still work in progress (before 1.6) and there are a couple of caveats and things to be aware of. Here’s some info that might not be obvious if you’re using the developer version. The current parser and generator will be more of a do it yourself SMILES with another utility class which will do it correct (i.e. ensure correct aromaticity etc).
Currently, i just tested the things out and it is not used in any release versions. However, i am looking into integrating the speed boost. Is there any plan when the 1.6 stable will be released? Does it make sense to use a 1.5.x version or are the interfaces to unstable at the moment?

Generator
- The 6 seconds is still slow but that is likely due to the canonicalisation, strictly speaking isomeric SMILES is non-canonical (that would be absolute SMILES) and in future canonical generation won’t be on by default.
We need canonical SMILES. Do i understand correctly, that the current git versions isomeric SMILES are canonical?
Will a 1.6 canonical string match a 1.4 canonical string? We use the SMILES to match molecules.

- Aromaticity is no longer redone for SMILES generation. The generator outputs what ever you give it. If you’re using the SMILES for indexing structures they should be aromatised first (see below).
Good to know! Applying the aromaticity does not seem to add a measurable amount of computation time to the SMILES generation.


Greetings
Till


- Tetrahedral and Double-Bond stereo chemistry are now round tripped between SMILES/InChI (working on MDL and interpreting depictions / 3D coordinates).
- implicit hydrogen specification on the organic subset is now correct

Parser
- molecules read from SMILES have their implicit hydrogen counts all set (depending on what else you use this means you might not need to atom type your structures)
- SMILES are kekulised automatically on load - if a molecule could not be kekulised an exception is throw. The kekulisation is fast enough (< 10 s on 1 mil structures) that it’s a good sanity check. If you find a molecule throws an exceptions check with Daylight’s DEPICT service. If they accept it then it’s a bug - otherwise the generated SMILES is invalid (normally missing Hs on nitrogens).

For the aromaticity, there is a new (faster) class. Need to go through and replaced the existing uses but here is a summary:

Aromaticity aromaticity = new Aromaticity(ElectronDontation.daylight(), // CDK model needs atom types, Daylight model need hydrogens
 Cycles.all());                // will timeout on fullerenes but I have a fix on the patch tracker

aromaticity.apply(molecule); // apply the aromaticity model to the container (removing any previous specification)

Cheers,
John

On 14 Nov 2013, at 18:49, Till Schäfer <till2.schaefer@tu-dortmund.de> wrote:

Hi,
the new isomeric SmilesGenerator (todays git) is incredible fast. For a small (110 mols) data set with huge molecules the smiles creation time went down from 110 seconds (scaffold hunters "optimized" 1.4.19 version) to 6 seconds!

in the following: the largest mol in the data set :-)

[H]OC1([H])C([H])([H])C([H])(OC1([H])C([H])([H])OP(=O)(O[H])OC2([H])C([H])([H])C([H])(OC2([H])C([H])([H])OP(=O)(O[H])OC3([H])C([H])([H])C([H])(OC3([H])C([H])([H])OP(=O)(O[H])OC4([H])C([H])(O[H])C([H])(OC4([H])C([H])([H])OP(=O)(O[H])OC5([H])C([H])(O[H])C([H])(OC5([H])C([H])([H])OP(=O)(O[H])OC6([H])C([H])(O[H])C([H])(OC6([H])C([H])([H])OP(=O)(O[H])OC7([H])C([H])(O[H])C([H])(OC7([H])C([H])([H])OP(=O)(O[H])OC8([H])C([H])(O[H])C([H])(OC8([H])C([H])([H])OP(=O)(O[H])OC9([H])C([H])(O[H])C([H])(OC9([H])C([H])([H])OP(=O)(O[H])OC%10([H])C([H])(O[H])C([H])(OC%10([H])C([H])([H])OP(=O)(O[H])OC%11([H])C([H])(O[H])C([H])(OC%11([H])C([H])([H])OP(=O)(O[H])OC%12([H])C([H])(O[H])C([H])(OC%12([H])C([H])([H])OP(=O)(O[H])OC%13([H])C([H])(O[H])C([H])(OC%13([H])C([H])([H])OP(=O)(O[H])OC%14([H])C([H])(O[H])C([H])(OC%14([H])C([H])([H])OP(=O)(O[H])OC%15([H])C([H])(O[H])C([H])(OC%15([H])C([H])([H])OP(=O)(O[H])OC%16([H])C([H])(O[H])C([H])(OC%16([H])C([H])([H])OP(=O)(O[H])OC%17([H])C([H])(O[H])C([H])(OC%17([H])C([H])([H])OP(=O)(O[H])OC%18([H])C([H])(O[H])C([H])(OC%18([H])C([H])([H])OP(=O)(O[H])OC%19([H])C([H])(O[H])C([H])(OC%19([H])C([H])([H])OP(=O)(O[H])OC%20([H])C([H])(O[H])C([H])(OC%20([H])C([H])([H])OP(=O)(O[H])OC%21([H])C([H])(O[H])C([H])(OC%21([H])C([H])([H])OP(=O)(O[H])OC%22([H])C([H])(O[H])C([H])(OC%22([H])C([H])([H])OP(=O)(O[H])OC%23([H])C([H])(O[H])C([H])(OC%23([H])C([H])([H])OP(=O)(O[H])OC%24([H])C([H])(O[H])C([H])(OC%24([H])C([H])([H])OP(=O)(O[H])OC%25([H])C([H])(O[H])C([H])(OC%25([H])C([H])([H])OP(=O)(O[H])OC%26([H])C([H])([H])C([H])(OC%26([H])C([H])([H])OP(=O)(O[H])OC%27([H])C([H])([H])C([H])(OC%27([H])C([H])([H])OP(=O)(O[H])OC%28([H])C([H])([H])C([H])(OC%28([H])C([H])([H])OP(=O)(O[H])O[H])N%29C([H])=NC=%30C(=O)N([H])C(=NC%30%29)N([H])[H])N%31C([H])=NC=%32C(=O)N([H])C(=NC%32%31)N([H])[H])N%33C(=O)N=C(C([H])=C%33[H])N([H])[H])N%34C([H])=C([H])C(=O)N([H])C%34=O)N%35C([H])=NC=%36C(=O)N([H])C(=NC%36%35)N([H])[H])N%37C([H])=NC=%38C(=O)N([H])C(=NC%38%37)N([H])[H])N%39C([H])=NC=%40C(=O)N([H])C(=NC%40%39)N([H])[H])N%41C(=O)N=C(C([H])=C%41[H])N([H])[H])N%42C([H])=NC=%43C(=O)N([H])C(=NC%43%42)N([H])[H])N%44C(=O)N=C(C([H])=C%44[H])N([H])[H])N%45C([H])=NC%46=C(N=C([H])N=C%46%45)N([H])[H])N%47C(=O)N=C(C([H])=C%47[H])N([H])[H])N%48C([H])=C([H])C(=O)N([H])C%48=O)N%49C([H])=C([H])C(=O)N([H])C%49=O)N%50C(=O)N=C(C([H])=C%50[H])N([H])[H])N%51C([H])=NC=%52C(=O)N([H])C(=NC%52%51)N([H])[H])N%53C([H])=NC=%54C(=O)N([H])C(=NC%54%53)N([H])[H])N%55C([H])=C([H])C(=O)N([H])C%55=O)N%56C([H])=NC=%57C(=O)N([H])C(=NC%57%56)N([H])[H])N%58C(=O)N=C(C([H])=C%58[H])N([H])[H])N%59C([H])=NC=%60C(=O)N([H])C(=NC%60%59)N([H])[H])N%61C([H])=NC=%62C(=O)N([H])C(=NC%62%61)N([H])[H])N%63C([H])=C([H])C(=O)N([H])C%63=O)N%64C(=O)N=C(C([H])=C%64[H])N([H])[H])N%65C([H])=NC%66=C(N=C([H])N=C%66%65)N([H])[H])N%67C([H])=NC=%68C(=O)N([H])C(=NC%68%67)N([H])[H])N%69C(=O)N=C(C([H])=C%69[H])N([H])[H])N%70C(=O)N=C(C([H])=C%70[H])N([H])[H]


Thanks for the good work
Till Schäfer



--
Dipl.-Inf. Till Schäfer
Technische Universität Dortmund
Chair 11 - Algorithm Engineering
Otto-Hahn-Str. 14 / Raum 237
44227 Dortmund, Germany

e-mail: till.schaefer@cs.tu-dortmund.de
phone: +49(231)755-7706
fax: +49(231)755-7740
web: http://ls11-www.cs.uni-dortmund.de/staff/schaefer
pgp: https://keyserver2.pgp.com/vkd/SubmitSearch.event?&&SearchCriteria=0xD84DED79------------------------------------------------------------------------------
DreamFactory - Open Source REST & JSON Services for HTML5 & Native Apps
OAuth, Users, Roles, SQL, NoSQL, BLOB Storage and External API Access
Free app hosting. Or install the open source package on any LAMP server.
Sign up and see examples for AngularJS, jQuery, Sencha Touch and Native!
http://pubads.g.doubleclick.net/gampad/clk?id=63469471&iu=/4140/ostg.clktrk_______________________________________________
Cdk-user mailing list
Cdk-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/cdk-user