#183 Timeout for fingerprints

None
open
nobody
9
2014-08-15
2012-07-20
Nina Jeliazkova
No

It turns out there are few cases, when fingerprints take very long time and lot of memory (examples below)
This applies at least for FingerPrinter and ExtendedFingerprinter. Some of the other fingerprints also fail, because of AllRingsFinder timeout.
In practice with all the default settings it is hard to generate any kind of fingerprints for the example structures.

One easy workaround is to be able to set a timeout for fingerprints, similar to AllRingsFinder.
Yet another feature request would be to allow external AllRingsFinder for fingerprints, which depend on it. Some use an own instance in the constructor.

@Test
public void parseBad() throws Exception {
int n = 3; //don't run with n > 6, unless on a supercomputer
String[] problemStructures = {
"InChI=1S/C23H30/c24-15(25-24)4-2-1-3(2)6(1,2,4,15)5(2,4,15,36(15)32(15)45(4,5,15,36)43(4,5,15)28(4)15)7(1,2,3,4,6,15,27(2)41(1,2,7)35(1,2)7)12(1,2,3,6)10-11(12)13(1,3,10,12,30-31(13)46(3,12,13,30)42(1,3,12)13)19(3,10,11,12)9(6)8(4,5,6)14(4,5,6,9,15)17(8,9,19)16(9,10,11,19)18(8,9,14,17,19,47(8,9,14,17)48(8,9,14)37(8,14)38(8,14)48)20(3,9,10,11,12,13,16,17,19)21(10,11,13,19)22(10,11,19,20,26-39(10,21)22,33(21)34(21)22,40(21)44(11,21,22)49(10,11,21,22)40)23(9,10,11,16,17,18,19,20,21)29(16)52(16,17,18,23)50(16,17,18,23)51(16,17,18,23)52/h28H",
"InChI=1S/C4Cl12/c5-1-2(5)3(1)4(1,2)7(3)8(3,4)12(1,2,3,4)10(1,2,3,4)6(1,2,5,14(1,2,3,4,5,10,12)16(1,2,3,4,7,8,10)12)13(1,2,3,4,5)9(1,2,3,4,5)11(1,2,3,4,7,13)15(1,2,3,4,7,8,9)13",
"InChI=1S/C30H28N6O6S4/c37-27-23(54(27)55(23)27)25(27,60(23,27)61(23,27)37)11-12(25)24(11,43-44-25)14-19(9-6-2-3(6,48-2,51(2)6)8(2,6,9)17(6,9,19,52(8)9)28(9,14,19,24,63(8,9,17)19)34(14,17,19,24,57(14,19)28)36(11,12,14,24,28,66(14,24,28)34)40(11,12,24)33(11,12,23,25,27,59(23,25)27)41(11,12,25,36)40)21-15-4-1(47-4)5(4,15,50(1)4)16(4,15,21)29(15,21)22(21)26(29)10-7-18(10)13-20(18,42(13,49-13)58(13)64(13,20,42)65(13,20,42)58)31(7,10,18,38(7)18,53(18)20)46(18,20)45(10,26)39(7,10,26)32(7,10,22,26)35(21,22,26,29,56(22)32)30(15,16,21,22,26,29,62(5,15,16)29,67(15,16,26)29)68(21,22,29)35/h1-2,14,20,22,37H",
"InChI=1S/C76H68/c77-10(5-15-17(5)27(15,107(15,17)81-17)21-22-25(15,21,27,99(21)27)26(21,22)16(5)18(5)28(16,22,26,100(22)26)108(16,18)82-18)19(77)20(10,78-10)29(19)30(19,20)41(29)31-32(41)46-40-48(32,46,86-106(46)48)58(29,30,31,32,41,102(32)48,110(29,30)41)57(29,30,31,32,41,109(29,30)41)47(31,101(31)57)39-45(31,47,105(47)85-47)53(39)65(39,45,125(39,53,89-53)123(39,45,47)53)66(40,46)54(40,46,90-126(40,54,66)124(40,46,48)54)70(65,66)61-35-36(61,62(35,61,70)69(53,61,65,66,70)72(61,62,65,66,70,129(61,62,69,93-61)133(61,65,69,72)121(53,65)69,130(61,62,70,94-62)134(62,66,70,72)122(54,66)70)76(39,40,45,46,53,54,65,66,69,70)73(31,32,39,40,41,45,46,47,48,57)58)42(35)49-33-37-43(33,49)51(33,49,87-117(33,51)113(33,43)51)59(35,36,42,49,103(49)51,111(35,36)42)50(42,49)34-38-44(34,50)52(34,50,88-118(34,52)114(34,44)52)60(35,36,42,49,50,59,104(50)52,112(35,36)42)75(33,34,42,43,44,49,50,51,52,59)74(33,34,37,38,43,44)63(37,43,115(37,43)83-37)64(38,44,74,116(38,44)84-38)67(37,63,74)55-2-1(4-6-8(11(4,6)79-97(6)11)13-14-9-7(4,12(4,9)80-98(7)12)24(9,13,14,96(9)14)23(6,8,13,14)95(8)13)3(2,55)56(2,55,67)68(38,55,63,64,67,74)71(55,56,63,64,67,74,127(55,56,67,91-55)131(55,63,67,71)119(37,63)67)128(55,56,68,92-56)132(56,64,68,71)120(38,64)68/h8-9,13-14,19-22,27-28H",
"InChI=1S/C12H37O3/c16-5-1-3-6(1,5)9(1,3,5,16,21-39(9)22-9,10(1,3,5,6,17-5,29(1)30(1)10,41(1,5)18-5,26-48(1,3,6,9)10)43(6)37(6)44(6,10)43)13(3)4-2-7(33(5)34(5)7)8(2,4,13)11(2,4,7,13,23-40(11)24-11,35(7)36(7)11,14(3,4,6,9,13,27(3)6)15(3,4,8,9,11,13)28(4)8)12(2,4,7,8,19-7,31(2)32(2)12,42(2,7)20-7,25-47(2,4,8)12)45(8)38(8)46(8,12)45/h1-4H",
"InChI=1S/C18H44/c19-45-10-6-7(10,33(6)34(6)7,35(6)36(6)7)11(6,10)14(10,45)3-5(14,31(3)32(3)5)8(3,10,14)15(3,5,14,22-5,44(3,5,8)14,17(6,7,8,10,11,14,45)18(6,7,8,10,11,14,23-8,24-8,38(10)17,48(6,7,10,11,17)49(6,7,10,11,17)18)39(11)50(8,10,11,17,18)52(8,10,11,14,17,18)42(11,17)25-11)4(14,41(14,15)45)9(3,5,15)1-2(9,27(1)28(1)2,29(1)30(1)2)12(1,3,5,9,15)13(1,2,4,9,15,20-4,21-4)16(1,2,3,4,5,9,12,14,15,37(9)13,46(1,2,9,12)13)43(12,26-12)51(4,9,12,13,15,16)47(4,9,12,13,16)40(12)13/h1-8,14,19H",
"InChI=1S/C32.C18H13PS2.C10H15.C2Cl2.2CF3O3S.2Ru/c1-6-9(1)18(1,6)2-7-8(2)15(7)3-5-4-16-10-11(16)12(10)19-13-14(19)17(6,13,19)21(13,14,19)20(9,18)22(15)23(16)24(3,4,5,22)27(3,5,15,22,23)26(2,7,8,18,20,22)25(1,6,9,18,20,21,32(7,8,15,20,22,24,26)27)31(13,14,17,19,21,23)29(10,11,12,16,23,24)28(4,5,16,22,23,24,27)30(10,11,12,19,21,23,29)31;20-17-12-6-4-10-15(17)19(14-8-2-1-3-9-14)16-11-5-7-13-18(16)21;1-6-7(2)9(4)10(5)8(6)3;3-1-2(3)4-1;2*2-1(3,4)8(5,6)7;;/h;1-13H;1-5H3;;;;;",
"InChI=1S/C8H16Cl16/c25-1-3-5(1)7(1,3,25,29(3)5)11(1)9(1,19(1,3,5,7,11)17(3,5,7)21(1,3,5,7,19,33(3,5)17,37(5,17)39(3,5,17)21)23(1,3,5,7,9,11,17,19,27(1)19)31(1,7)25)15(3,5,7,11)13(3,5,7,35(3,7)15)14-4-2-6(4,14)8(2,4,14,30(4)6)12(2)10(2,16(4,6,8,12,14)36(4,8)14)20(2,4,6,8,12)18(4,6,8)22(2,4,6,8,20,34(4,6)18,38(6,18)40(4,6,18)22)24(2,4,6,8,10,12,18,20,28(2)20)32(2,8)26(2)8"
};

    for (String inchi : problemStructures) {

        InChIGeneratorFactory f = InChIGeneratorFactory.getInstance();
        InChIToStructure c =f.getInChIToStructure(inchi, SilentChemObjectBuilder.getInstance());
        Assert.assertEquals(INCHI_RET.OKAY, c.getReturnStatus());
        IAtomContainer mol = c.getAtomContainer();

        //takes forever and eats all the memory it could find, starting with depth 6
        for (int i=1; i < n; i++) {
            Fingerprinter fp = new Fingerprinter(1024,i);

// here add code to configure atom types - and make sure to set atomic numbers as well,
//because perceiveAtomType() method doesn't set them, and they are null after JNIInchi

            long now = System.currentTimeMillis();
            BitSet bitset = fp.getFingerprint(mol);
            System.out.println(String.format("Atoms\t%d\tBonds\t%d\tFP Depth\t%d\tElapsed time\t%d\tBitset\t%s",
                    mol.getAtomCount(),mol.getBondCount(),
                    i,(System.currentTimeMillis()-now),bitset));
            /** 
             * depth 1 : 188 ms
             * depth 2 : 291 ms
             * depth 3 : 525 ms
             * depth 4 : 1474 ms
             * depth 5 : 9147 ms
             * depth 6 : 71634 ms
             */
        }
        System.out.println();
    }
}

Discussion

  • Not saying that this should not be added at all, but this is exactly why I wrote they hybridization fingerprinter... it does not use any aromaticity, and thus does not use any ring finder.

    Actually, how about just removing the original fingerprinter? We already have enough people shooting themselves in the foot...

     
  • The problem is likely not the rings finder (it does time out if involved). None of the fingerprinter classes complete with these compounds for one or another reason.

     
  • OK, if it applies to HybridizationFingerprinter, then indeed it is not the time out, but something more annoying. This needs exploring...

    But then we should move this to the bug tracker... or maybe file a separate bug report there?

     
  • John May
    John May
    2012-07-21

    Can't look at all the fingerprint classes ATM but for the basic Fingerprinter I'm hedging my bets on 'getPathsOfLengthUpto'. From the few molecules I could load they looked very interconnected and I would imagine would take a long time to enumerate all paths of length 8 for example. I'm not sure if the other fingerprinters use the path finder but doing some profiling could show where the problem is.

    Just checking but these structures are what you expect..? One I opened had a carbon with 14 bonds and some hydrogens with 4 - very Toxic i guess :-).

     
  • John, I am guessing more or less the same, path enumeration cost for some reason increases exponentially in the Fingerprinter.

    The structures are user queries, and for the record, the InChI library is parsing them just fine.

     
  • Asad
    Asad
    2012-07-23

    Dear Nina and Developers,

    The problem you have highlighted is a genuine one and I came across this bottleneck last year too!
    In order to resolve this issue I have made a minor change in the concept of the path based fingerprints.

    I have replaced the PathTool.getPathOfLength this by shortest path between atoms and its works very well.

    The reasoning is very simple as the number of bond and atoms will increase the complexity will give rise to exponential runtime.
    The shortest path will overcome this bottle neck without loosing the connectivity information.

    I have also tweaked few aspects of the present Fingerprinter in the CDK to take care of rings etc.

    Here is the code https://github.com/asad/ShortestPathMoleculeFingerprinter

    The fingerprints from Nina's query are also attached.

    git://gist.github.com/3161698.git

    Best,

    Asad

     
  • Egon, not sure why we should remove the original fingerprinter if we have not the slighted idea about what the bug is!
    There is no clear indication what the bug is and looking at the implementation, I cannot see why there should be a combinatorial explosion. I have not been able so far to depict the molecules. As John points out, they are "special cases" and I would love to see table with those molecules before making any judgement.

     
  • Structures and fingerprints upt to depth 6

     
  • Asad, thanks for confirming the problem is a genuine one. Pity I was not aware of your experience, would have saved some time!

    Christoph, I agree these structures are special cases, there are indeed atoms with unusual connectivity. Generating structure diagrams unfortunately fails, I haven't investigated the reason. The attached structures_and_fp_depth6.zip file contains SDF file and Java code building the structures (by CDKSourceCodeWriter ), as well as statistics for Fingerprinter, ExtendedFingerprinter and HybridizationFingerprinter up to depth 6 for all example structures.

    However, just by Google search by InChI, these structures can be found in the Chempound repository. Links and statistics at https://gist.github.com/3192020

    Regardless of the structure correctness, once these InChIs are in the wild, they may come as queries or input to any running cheminformatics system. These are not the only problem structures in our logs, just most difficult ones. They are successfully parsed by the InChI library, and while CDK atom typing rightly fails for most of the atoms, it is not a reason to automatically discard the structures. I've suggested a timeout, because it is the easiest way to protect a system from hanging and consuming resources indefinitely. A better approach by checking chemical correctness would be most useful.

     
  • Christoph,

    Egon, not sure why we should remove the original fingerprinter if we have
    not the slighted idea about what the bug is!

    The time out is not about a bug, I guess. It is about the algorithm just taking long. It could be a bug, but I have no seen any argument for that.

    There is no clear indication what the bug is and looking at the
    implementation, I cannot see why there should be a combinatorial explosion.

    The original Fingerprinter uses 'aromaticity'... that involves ring finding. Now, the newer CDK aromaticity perception code does not try all ring combinations, but as you know ring finding has been biting us many, many times. That is on top of the situation that longer paths just give more possibilities... I think that goes combinatorial.

    The Fingerprinter is reincarnated in the HybridizationFingerprinter; it is the same code, except that it substitutes 'aromaticity' by 'delocalized' (or sp2, to be precise)... So, removing the Fingerprinter class would not be the same as removing the original fingerprinter code.

     
  • John May
    John May
    2012-10-23

    • Request for close -

    Reason: Asad has implemented the shortest path fingerprinter to fix the provided issues

     
  • John May
    John May
    2013-10-22

    I will double check these molecules but the PathTools now checks if too many paths are being generated.

     
  • John May
    John May
    2013-10-22

    • Group: -->
     
  • John May
    John May
    2013-12-29

     
  • John May
    John May
    2013-12-29

    Should be fixed - need to confirm.