Thanks for the explaination - much clearer now!


On Thu, Nov 7, 2013 at 1:05 PM, John May <johnmay@ebi.ac.uk> wrote:
Hi all,

Yep the SMILES parser changed on master and won’t accept invalid SMILES by default. Notice how daylight rejects it also. It should now be the case now that if CDK rejects it - daylight also rejects it (If not then it’s a bug). The new parser automatically kekulises on load, verifying the bond orders can be assigned to aromatic systems. This is much friendly for the CDK as you don’t have molecules with all single aromatic bonds floating about. When we added this it fixed 2 failing unit tests.

In the molecule you're missing a hydrogen of one or more nitrogens, to know which ones is the problem.

The SMILES should be:

c4ccc2c(cc1=Nc3[nH]cccc3(Cn12))c4


Some toolkits will fix this by default but that’s making several assumptions and it’s nothing more than an hack for broken SMILES input. To fix this you need to change the formula of the molecule which is never a good start.  You can still parse it with the CDK by turning on ‘preserve aromaticity’ (need to rename) this disables electron checking but I strongly discourage that. The actual fix involves checking every possible combination of hydrogens on aromatic nitrogens and phosphates, checkout the fixarom core from http://www.daylight.com/download/contrib/.  

Now where this molecules come from is probably more interesting. Most likely it’s people using the aromaticity models on formats which don’t support it. The MDL model for example doesn’t allow lone pair contributions. If you have marvin sketch, try loading ‘[nH]1cccc1’ and then generating an MDL mol file. You’ll notice they have there own non-portable work around to ensure the hydrogen is kept. Of course everyone knows you should never store aromaticity in the mol file :-).

  Mrv0541 11071317592D          

  5  5  0  0  0  0            999 V2000
    1.2964    0.6723    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
    1.9639    0.1874    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    1.7089   -0.5972    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.8839   -0.5972    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.6290    0.1874    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
  1  2  4  0  0  0  0
  2  3  4  0  0  0  0
  4  5  4  0  0  0  0
  1  5  4  0  0  0  0
  3  4  4  0  0  0  0
M  STY  1   1 DAT
M  SAL   1  1   1
M  SDT   1 MRV_IMPLICIT_H                                        
M  SDD   1     0.0000    0.0000    DR    ALL  0       0  
M  SED   1 IMPL_H1
M  END


Oh some more examples which are now correctly rejected.

C/1.C/C=C/1
C-1.C/C=C=1
ccc
ccccc
p1cccc1 <- generated by older CDK versions!

Cheers,
J

On 7 Nov 2013, at 16:33, Nina Jeliazkova <jeliazkova.nina@gmail.com> wrote:




On 7 November 2013 18:26, Nina Jeliazkova <jeliazkova.nina@gmail.com> wrote:



On 7 November 2013 18:18, Rajarshi Guha <rajarshi.guha@gmail.com> wrote:
It seems 

c4ccc2c(cc1=Nc3ncccc3(Cn12))c4

does not parse using the latest CDK master, but does parse fine using http://apps.ideaconsult.net:8080/ambit2/depict?search=c4ccc2c%28cc1%3DNc3ncccc3%28Cn12%29%29c4&smarts=

I'm not sure what version ambit is using


cdk 1.4.11

There is also a test version using cdk 1.5.3 (Sep 2013) and seems to parse fine 

Nina

Regards,
Nina 

but could somebody confirm this issue with the latest master?


--
Rajarshi Guha | http://blog.rguha.net
NIH Center for Advancing Translational Science

------------------------------------------------------------------------------
November Webinars for C, C++, Fortran Developers
Accelerate application performance with scalable programming models. Explore
techniques for threading, error checking, porting, and tuning. Get the most
from the latest Intel processors and coprocessors. See abstracts and register
http://pubads.g.doubleclick.net/gampad/clk?id=60136231&iu=/4140/ostg.clktrk
_______________________________________________
Cdk-user mailing list
Cdk-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/cdk-user



------------------------------------------------------------------------------
November Webinars for C, C++, Fortran Developers
Accelerate application performance with scalable programming models. Explore
techniques for threading, error checking, porting, and tuning. Get the most
from the latest Intel processors and coprocessors. See abstracts and register
http://pubads.g.doubleclick.net/gampad/clk?id=60136231&iu=/4140/ostg.clktrk_______________________________________________
Cdk-user mailing list
Cdk-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/cdk-user




--
Rajarshi Guha | http://blog.rguha.net
NIH Center for Advancing Translational Science