#1316 Error with SMILES parsing with lots of rings

cdk-1.6.x
closed
nobody
None
5
2015-02-15
2013-11-09
Duece99
No

Hello,

Observe the four SMILES strings below (noting that the 1st and the 4th are supposed to represent the same molecule)...

import org.openscience.cdk.DefaultChemObjectBuilder;
import org.openscience.cdk.exception.InvalidSmilesException;
import org.openscience.cdk.interfaces.IAtomContainer;
import org.openscience.cdk.smiles.SmilesGenerator;
import org.openscience.cdk.smiles.SmilesParser;

public class SMILESGeneratorBug {

    public static void main( String[] argv ) {

        String s1 = "CC(=O)NC1C(O)C(O)C(CO)OC1OC2C3NC(=O)C(NC(=O)C4NC(=O)C5NC(=O)C(Cc6ccc(Oc7cc4cc(Oc8ccc2cc8Cl)c7O)cc6)NC(=O)C(N)c9ccc(O)c(Oc%10cc(O)cc5c%10)c9)c%11ccc(O)c(c%11)-c%12c(O)cc(O)cc%12C(NC3=O)C(=O)O";
        String s2 = "O=C(CCCC1(c2ccccc2)C34c5c6-c7c4c8c9c%10c%11c%12c(c%13c5c%14c%15c6c%16c%17c7c%18c8c%19c9c%20c%11c%21c%22c%12c%23c%13c%14c%24c%25c%15c%16c%26c%27c%17c%18c%28c%19c%29c%20c%21c%30c%31c%22c%23c%24c%32c%25c%26c%33c%27c%28c%29c%30c%33c%31%32)C%1013)NCc%34c[nH]cn%34";
        String s3 = "CN(C)C(=N)NC(=O)C1(C(=O)NC(=N)N(C)C)C23c4c5-c6c3c7c8c9c%10c%11c(c%12c4c%13c%14c5c%15c%16c6c%17c7c%18c8c%19c%10c%20c%21c%11c%22c%12c%13c%23c%24c%14c%15c%25c%26c%16c%17c%27c%18c%28c%19c%20c%29c%30c%21c%22c%23c%31c%24c%25c%32c%26c%27c%28c%29c%32c%30%31)C912";
        String s4 = "CC(=O)NC1C(O)C(O)C(CO)OC1OC1C2NC(=O)C(NC(=O)C3NC(=O)C4NC(=O)C(Cc5ccc(Oc6cc3cc(Oc3ccc1cc3Cl)c6O)cc5)NC(=O)C(N)c1ccc(O)c(Oc3cc(O)cc4c3)c1)c1ccc(O)c(c1)-c1c(O)cc(O)cc1C(NC2=O)C(O)=O";
        // s1 and s4 represent the same molecule, though s4 is special as there're no % symbols used for ring notation

        SmilesParser sp = new SmilesParser( DefaultChemObjectBuilder.getInstance() );
        SmilesGenerator smiG = new SmilesGenerator(true);

        IAtomContainer mol1, mol2, mol3, mol4;
        try {
            mol1 = sp.parseSmiles(s1);
            mol2 = sp.parseSmiles(s2);
            mol3 = sp.parseSmiles(s3);
            mol4 = sp.parseSmiles(s4);

            System.out.println( "mol1 - " + mol1.getAtomCount() + " & SMILES IS " + smiG.createSMILES(mol1) );  // no SMILES reported
            System.out.println( "mol2 - " + mol2.getAtomCount() + " & SMILES IS " + smiG.createSMILES(mol2) );  // no SMILES reported
            System.out.println( "mol3 - " + mol3.getAtomCount() + " & SMILES IS " + smiG.createSMILES(mol3) );  // no SMILES reported
            System.out.println( "mol4 - " + mol4.getAtomCount() + " & SMILES IS " + smiG.createSMILES(mol4) );  // SMILES IS reported!
        } catch (InvalidSmilesException e) {
            e.printStackTrace();
        }

    }
}

Running this code yields no SMILES generated from the SmilesGenerator object for the first 3 molecules (no errors AFAIK), yet SMILES is yielded for the fourth!

Note that the 4th molecule has its ring notation recycled - there're no "%" symbols in its SMILES string. Unsure if that's the cause of the problem, but I assume its a bug or feature-lack in the SmilesParser class.

Any input?

Ed.

Related

Bugs: #1316

Discussion

  • John May
    John May
    2013-11-09

    Hi Ed,

    I've rewritten the parser and generator, on master I get:

    mol1 - 98 & SMILES IS O=C(O)C1NC(=O)C2NC(=O)C(NC(=O)C3NC(=O)C4NC(=O)C(NC(=O)C(N)c5ccc(O)c(Oc6cc(O)cc4c6)c5)Cc7ccc(Oc8cc3cc(Oc9ccc(cc9Cl)C2OC%10OC(CO)C(O)C(O)C%10NC(=O)C)c8O)cc7)c%11ccc(O)c(c%11)-c%12c(O)cc(O)cc%121
    mol2 - 79 & SMILES IS O=C(NCc1nc[nH]c1)CCCC2(c3ccccc3)C45c6c-7c8c9c%10c%11c%12c%13c%14c%15c%16c%12c%17c%18c%16c%19c%20c%15c%21c%14c%22c%23c%13c%11c%24c%25c%23c%26c%22c%27c%21c%28c%20c%29c%19c%30c%18c(c8c%17%10)c6c%30c%31c%29c%32c%28c%27c%33c%26c%34c%25c(c%249)c7c5c%34c%33c%32C%3142
    mol3 - 77 & SMILES IS O=C(NC(=N)N(C)C)C1(C(=O)NC(=N)N(C)C)C23c4c-5c6c7c8c9c%10c%11c%12c%13c%14c%10c%15c%16c%14c%17c%18c%13c%19c%12c%20c%21c%11c9c%22c%23c%21c%24c%20c%25c%19c%26c%18c%27c%17c%28c%16c(c6c%158)c4c%28c%29c%27c%30c%26c%25c%31c%24c%32c%23c(c%227)c5c3c%32c%31c%30C%2912
    mol4 - 98 & SMILES IS O=C(O)C1NC(=O)C2NC(=O)C(NC(=O)C3NC(=O)C4NC(=O)C(NC(=O)C(N)c5ccc(O)c(Oc6cc(O)cc4c6)c5)Cc7ccc(Oc8cc3cc(Oc9ccc(cc9Cl)C2OC%10OC(CO)C(O)C(O)C%10NC(=O)C)c8O)cc7)c%11ccc(O)c(c%11)-c%12c(O)cc(O)cc%121

    Is that resolved?

    When generating SMILES the defat current scheme uses unique ring numbers as much as possible and then reuses. If it hits a point where it ran out of ring numbers it will throw an exception (e.g. fullerene C720). I do have a config which allows reuse of ring numbers as much as possible (i.e. less %) but that isn’t the default as OpenSMILES recommends unique numbers.

    J

    On 9 Nov 2013, at 17:45, Duece99 duece99@users.sf.net wrote:

    [bugs:#1316] Error with SMILES parsing with lots of rings

    Status: open
    Created: Sat Nov 09, 2013 05:45 PM UTC by Duece99
    Last Updated: Sat Nov 09, 2013 05:45 PM UTC
    Owner: nobody

    Hello,

    Observe the four SMILES strings below (noting that the 1st and the 4th are supposed to represent the same molecule)...

    import org.openscience.cdk.DefaultChemObjectBuilder;
    import org.openscience.cdk.exception.InvalidSmilesException;
    import org.openscience.cdk.interfaces.IAtomContainer;
    import org.openscience.cdk.smiles.SmilesGenerator;
    import org.openscience.cdk.smiles.SmilesParser;

    public class SMILESGeneratorBug {

    public static void main( String[] argv ) {
    
        String s1 = "CC(=O)NC1C(O)C(O)C(CO)OC1OC2C3NC(=O)C(NC(=O)C4NC(=O)C5NC(=O)C(Cc6ccc(Oc7cc4cc(Oc8ccc2cc8Cl)c7O)cc6)NC(=O)C(N)c9ccc(O)c(Oc%10cc(O)cc5c%10)c9)c%11ccc(O)c(c%11)-c%12c(O)cc(O)cc%12C(NC3=O)C(=O)O";
        String s2 = "O=C(CCCC1(c2ccccc2)C34c5c6-c7c4c8c9c%10c%11c%12c(c%13c5c%14c%15c6c%16c%17c7c%18c8c%19c9c%20c%11c%21c%22c%12c%23c%13c%14c%24c%25c%15c%16c%26c%27c%17c%18c%28c%19c%29c%20c%21c%30c%31c%22c%23c%24c%32c%25c%26c%33c%27c%28c%29c%30c%33c%31%32)C%1013)NCc%34c[nH]cn%34";
        String s3 = "CN(C)C(=N)NC(=O)C1(C(=O)NC(=N)N(C)C)C23c4c5-c6c3c7c8c9c%10c%11c(c%12c4c%13c%14c5c%15c%16c6c%17c7c%18c8c%19c%10c%20c%21c%11c%22c%12c%13c%23c%24c%14c%15c%25c%26c%16c%17c%27c%18c%28c%19c%20c%29c%30c%21c%22c%23c%31c%24c%25c%32c%26c%27c%28c%29c%32c%30%31)C912";
        String s4 = "CC(=O)NC1C(O)C(O)C(CO)OC1OC1C2NC(=O)C(NC(=O)C3NC(=O)C4NC(=O)C(Cc5ccc(Oc6cc3cc(Oc3ccc1cc3Cl)c6O)cc5)NC(=O)C(N)c1ccc(O)c(Oc3cc(O)cc4c3)c1)c1ccc(O)c(c1)-c1c(O)cc(O)cc1C(NC2=O)C(O)=O";
        // s1 and s4 represent the same molecule, though s4 is special as there're no % symbols used for ring notation
    
        SmilesParser sp = new SmilesParser( DefaultChemObjectBuilder.getInstance() );
        SmilesGenerator smiG = new SmilesGenerator(true);
    
        IAtomContainer mol1, mol2, mol3, mol4;
        try {
            mol1 = sp.parseSmiles(s1);
            mol2 = sp.parseSmiles(s2);
            mol3 = sp.parseSmiles(s3);
            mol4 = sp.parseSmiles(s4);
    
            System.out.println( "mol1 - " + mol1.getAtomCount() + " & SMILES IS " + smiG.createSMILES(mol1) );  // no SMILES reported
            System.out.println( "mol2 - " + mol2.getAtomCount() + " & SMILES IS " + smiG.createSMILES(mol2) );  // no SMILES reported
            System.out.println( "mol3 - " + mol3.getAtomCount() + " & SMILES IS " + smiG.createSMILES(mol3) );  // no SMILES reported
            System.out.println( "mol4 - " + mol4.getAtomCount() + " & SMILES IS " + smiG.createSMILES(mol4) );  // SMILES IS reported!
        } catch (InvalidSmilesException e) {
            e.printStackTrace();
        }
    
    }
    

    }
    Running this code yields no SMILES generated from the SmilesGenerator object for the first 3 molecules (no errors AFAIK), yet SMILES is yielded for the fourth!

    Note that the 4th molecule has its ring notation recycled - there're no "%" symbols in its SMILES string. Unsure if that's the cause of the problem, but I assume its a bug or feature-lack in the SmilesParser class.

    Any input?

    Ed.

    Sent from sourceforge.net because cdk-bugs@lists.sf.net is subscribed to https://sourceforge.net/p/cdk/bugs/

    To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/cdk/admin/bugs/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.


    November Webinars for C, C++, Fortran Developers
    Accelerate application performance with scalable programming models. Explore
    techniques for threading, error checking, porting, and tuning. Get the most
    from the latest Intel processors and coprocessors. See abstracts and register
    http://pubads.g.doubleclick.net/gampad/clk?id=60136231&iu=/4140/ostg.clktrk_______
    Cdk-bugs mailing list
    Cdk-bugs@lists.sourceforge.net
    https://lists.sourceforge.net/lists/listinfo/cdk-bugs

     

    Related

    Bugs: #1316

  • Duece99
    Duece99
    2013-11-12

    Hi John, only just seen this :/

    The UniversalIsomorphismTester for the first 3 molecules says that the SMILES you've generated (compared to the original SMILES) are isomorphic, which is good :) Thank you.

    Now, how do I get the master build :S

     
  • John May
    John May
    2013-11-12

    I do want to do a release but have a couple of patches waiting and another which I need to finish. Egon and I are really busy though so not sure when that will be.

    Also might be relevant to your project but I’ve written Ullman/VF substructure matching (+stereochemistry) specifically for CDK now - http://efficientbits.blogspot.co.uk/2013/11/improved-substructure-matching.html. I know you need MCS but might be of use.

    Cheers,
    J

    On 12 Nov 2013, at 09:48, Duece99 duece99@users.sf.net wrote:

    Hi John, only just seen this :/

    The UniversalIsomorphismTester for the first 3 molecules says that the SMILES you've generated (compared to the original SMILES) are isomorphic, which is good :) Thank you.

    Now, how do I get the master build :S

    [bugs:#1316] Error with SMILES parsing with lots of rings

    Status: open
    Created: Sat Nov 09, 2013 05:45 PM UTC by Duece99
    Last Updated: Sat Nov 09, 2013 05:45 PM UTC
    Owner: nobody

    Hello,

    Observe the four SMILES strings below (noting that the 1st and the 4th are supposed to represent the same molecule)...

    import org.openscience.cdk.DefaultChemObjectBuilder;
    import org.openscience.cdk.exception.InvalidSmilesException;
    import org.openscience.cdk.interfaces.IAtomContainer;
    import org.openscience.cdk.smiles.SmilesGenerator;
    import org.openscience.cdk.smiles.SmilesParser;

    public class SMILESGeneratorBug {

    public static void main( String[] argv ) {
    
        String s1 = "CC(=O)NC1C(O)C(O)C(CO)OC1OC2C3NC(=O)C(NC(=O)C4NC(=O)C5NC(=O)C(Cc6ccc(Oc7cc4cc(Oc8ccc2cc8Cl)c7O)cc6)NC(=O)C(N)c9ccc(O)c(Oc%10cc(O)cc5c%10)c9)c%11ccc(O)c(c%11)-c%12c(O)cc(O)cc%12C(NC3=O)C(=O)O";
        String s2 = "O=C(CCCC1(c2ccccc2)C34c5c6-c7c4c8c9c%10c%11c%12c(c%13c5c%14c%15c6c%16c%17c7c%18c8c%19c9c%20c%11c%21c%22c%12c%23c%13c%14c%24c%25c%15c%16c%26c%27c%17c%18c%28c%19c%29c%20c%21c%30c%31c%22c%23c%24c%32c%25c%26c%33c%27c%28c%29c%30c%33c%31%32)C%1013)NCc%34c[nH]cn%34";
        String s3 = "CN(C)C(=N)NC(=O)C1(C(=O)NC(=N)N(C)C)C23c4c5-c6c3c7c8c9c%10c%11c(c%12c4c%13c%14c5c%15c%16c6c%17c7c%18c8c%19c%10c%20c%21c%11c%22c%12c%13c%23c%24c%14c%15c%25c%26c%16c%17c%27c%18c%28c%19c%20c%29c%30c%21c%22c%23c%31c%24c%25c%32c%26c%27c%28c%29c%32c%30%31)C912";
        String s4 = "CC(=O)NC1C(O)C(O)C(CO)OC1OC1C2NC(=O)C(NC(=O)C3NC(=O)C4NC(=O)C(Cc5ccc(Oc6cc3cc(Oc3ccc1cc3Cl)c6O)cc5)NC(=O)C(N)c1ccc(O)c(Oc3cc(O)cc4c3)c1)c1ccc(O)c(c1)-c1c(O)cc(O)cc1C(NC2=O)C(O)=O";
        // s1 and s4 represent the same molecule, though s4 is special as there're no % symbols used for ring notation
    
        SmilesParser sp = new SmilesParser( DefaultChemObjectBuilder.getInstance() );
        SmilesGenerator smiG = new SmilesGenerator(true);
    
        IAtomContainer mol1, mol2, mol3, mol4;
        try {
            mol1 = sp.parseSmiles(s1);
            mol2 = sp.parseSmiles(s2);
            mol3 = sp.parseSmiles(s3);
            mol4 = sp.parseSmiles(s4);
    
            System.out.println( "mol1 - " + mol1.getAtomCount() + " & SMILES IS " + smiG.createSMILES(mol1) );  // no SMILES reported
            System.out.println( "mol2 - " + mol2.getAtomCount() + " & SMILES IS " + smiG.createSMILES(mol2) );  // no SMILES reported
            System.out.println( "mol3 - " + mol3.getAtomCount() + " & SMILES IS " + smiG.createSMILES(mol3) );  // no SMILES reported
            System.out.println( "mol4 - " + mol4.getAtomCount() + " & SMILES IS " + smiG.createSMILES(mol4) );  // SMILES IS reported!
        } catch (InvalidSmilesException e) {
            e.printStackTrace();
        }
    
    }
    

    }
    Running this code yields no SMILES generated from the SmilesGenerator object for the first 3 molecules (no errors AFAIK), yet SMILES is yielded for the fourth!

    Note that the 4th molecule has its ring notation recycled - there're no "%" symbols in its SMILES string. Unsure if that's the cause of the problem, but I assume its a bug or feature-lack in the SmilesParser class.

    Any input?

    Ed.

    Sent from sourceforge.net because you indicated interest in https://sourceforge.net/p/cdk/bugs/1316/

    To unsubscribe from further messages, please visit https://sourceforge.net/auth/subscriptions/

     

    Related

    Bugs: #1316

  • Duece99
    Duece99
    2013-11-12

    Hi,

    Ah those specs look cool! I'm assuming this' the Gouda & Hassaan adjustment to the Ullmann algorithm rather than vanilla Ullmann? I see a different SMILES class in there too :)

    I await that release. In the mean time I guess I'll have to try using SDF in order to circumvent this problem, though at the cost of computational time :(

     
  • John May
    John May
    2013-11-12

    Nope haven’t added the Gouda adjustment yet. Ullmann is actually pretty good.

    J

    On 12 Nov 2013, at 11:27, Duece99 duece99@users.sf.net wrote:

    Hi,

    Ah those specs look cool! I'm assuming this' the Gouda & Hassaan adjustment to the Ullmann algorithm rather than vanilla Ullmann? I see a different SMILES class in there too :)

    I await that release. In the mean time I guess I'll have to try using SDF in order to circumvent this problem, though at the cost of computational time :(

    [bugs:#1316] Error with SMILES parsing with lots of rings

    Status: open
    Created: Sat Nov 09, 2013 05:45 PM UTC by Duece99
    Last Updated: Tue Nov 12, 2013 09:48 AM UTC
    Owner: nobody

    Hello,

    Observe the four SMILES strings below (noting that the 1st and the 4th are supposed to represent the same molecule)...

    import org.openscience.cdk.DefaultChemObjectBuilder;
    import org.openscience.cdk.exception.InvalidSmilesException;
    import org.openscience.cdk.interfaces.IAtomContainer;
    import org.openscience.cdk.smiles.SmilesGenerator;
    import org.openscience.cdk.smiles.SmilesParser;

    public class SMILESGeneratorBug {

    public static void main( String[] argv ) {
    
        String s1 = "CC(=O)NC1C(O)C(O)C(CO)OC1OC2C3NC(=O)C(NC(=O)C4NC(=O)C5NC(=O)C(Cc6ccc(Oc7cc4cc(Oc8ccc2cc8Cl)c7O)cc6)NC(=O)C(N)c9ccc(O)c(Oc%10cc(O)cc5c%10)c9)c%11ccc(O)c(c%11)-c%12c(O)cc(O)cc%12C(NC3=O)C(=O)O";
        String s2 = "O=C(CCCC1(c2ccccc2)C34c5c6-c7c4c8c9c%10c%11c%12c(c%13c5c%14c%15c6c%16c%17c7c%18c8c%19c9c%20c%11c%21c%22c%12c%23c%13c%14c%24c%25c%15c%16c%26c%27c%17c%18c%28c%19c%29c%20c%21c%30c%31c%22c%23c%24c%32c%25c%26c%33c%27c%28c%29c%30c%33c%31%32)C%1013)NCc%34c[nH]cn%34";
        String s3 = "CN(C)C(=N)NC(=O)C1(C(=O)NC(=N)N(C)C)C23c4c5-c6c3c7c8c9c%10c%11c(c%12c4c%13c%14c5c%15c%16c6c%17c7c%18c8c%19c%10c%20c%21c%11c%22c%12c%13c%23c%24c%14c%15c%25c%26c%16c%17c%27c%18c%28c%19c%20c%29c%30c%21c%22c%23c%31c%24c%25c%32c%26c%27c%28c%29c%32c%30%31)C912";
        String s4 = "CC(=O)NC1C(O)C(O)C(CO)OC1OC1C2NC(=O)C(NC(=O)C3NC(=O)C4NC(=O)C(Cc5ccc(Oc6cc3cc(Oc3ccc1cc3Cl)c6O)cc5)NC(=O)C(N)c1ccc(O)c(Oc3cc(O)cc4c3)c1)c1ccc(O)c(c1)-c1c(O)cc(O)cc1C(NC2=O)C(O)=O";
        // s1 and s4 represent the same molecule, though s4 is special as there're no % symbols used for ring notation
    
        SmilesParser sp = new SmilesParser( DefaultChemObjectBuilder.getInstance() );
        SmilesGenerator smiG = new SmilesGenerator(true);
    
        IAtomContainer mol1, mol2, mol3, mol4;
        try {
            mol1 = sp.parseSmiles(s1);
            mol2 = sp.parseSmiles(s2);
            mol3 = sp.parseSmiles(s3);
            mol4 = sp.parseSmiles(s4);
    
            System.out.println( "mol1 - " + mol1.getAtomCount() + " & SMILES IS " + smiG.createSMILES(mol1) );  // no SMILES reported
            System.out.println( "mol2 - " + mol2.getAtomCount() + " & SMILES IS " + smiG.createSMILES(mol2) );  // no SMILES reported
            System.out.println( "mol3 - " + mol3.getAtomCount() + " & SMILES IS " + smiG.createSMILES(mol3) );  // no SMILES reported
            System.out.println( "mol4 - " + mol4.getAtomCount() + " & SMILES IS " + smiG.createSMILES(mol4) );  // SMILES IS reported!
        } catch (InvalidSmilesException e) {
            e.printStackTrace();
        }
    
    }
    

    }
    Running this code yields no SMILES generated from the SmilesGenerator object for the first 3 molecules (no errors AFAIK), yet SMILES is yielded for the fourth!

    Note that the 4th molecule has its ring notation recycled - there're no "%" symbols in its SMILES string. Unsure if that's the cause of the problem, but I assume its a bug or feature-lack in the SmilesParser class.

    Any input?

    Ed.

    Sent from sourceforge.net because you indicated interest in https://sourceforge.net/p/cdk/bugs/1316/

    To unsubscribe from further messages, please visit https://sourceforge.net/auth/subscriptions/

     

    Related

    Bugs: #1316

  • John May
    John May
    2013-11-16

    • Description has changed:

    Diff:

    --- old
    +++ new
    @@ -1,4 +1,3 @@
    -
     Hello, 
    
     Observe the four SMILES strings below (noting that the 1st and the 4th are supposed to represent the same molecule)...
    
    • status: open --> closed