From: SourceForge.net <no...@so...> - 2012-07-28 08:52:31
|
Feature Requests item #3546357, was opened at 2012-07-20 08:02 Message generated for change (Comment added) made by egonw You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=370024&aid=3546357&group_id=20024 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: cdk.fingerprint Group: None Status: Open Priority: 9 Private: No Submitted By: Nina Nikolova - Jeliazkova (vedina) Assigned to: Nobody/Anonymous (nobody) Summary: Timeout for fingerprints Initial Comment: It turns out there are few cases, when fingerprints take very long time and lot of memory (examples below) This applies at least for FingerPrinter and ExtendedFingerprinter. Some of the other fingerprints also fail, because of AllRingsFinder timeout. In practice with all the default settings it is hard to generate any kind of fingerprints for the example structures. One easy workaround is to be able to set a timeout for fingerprints, similar to AllRingsFinder. Yet another feature request would be to allow external AllRingsFinder for fingerprints, which depend on it. Some use an own instance in the constructor. @Test public void parseBad() throws Exception { int n = 3; //don't run with n > 6, unless on a supercomputer String[] problemStructures = { "InChI=1S/C23H30/c24-15(25-24)4-2-1-3(2)6(1,2,4,15)5(2,4,15,36(15)32(15)45(4,5,15,36)43(4,5,15)28(4)15)7(1,2,3,4,6,15,27(2)41(1,2,7)35(1,2)7)12(1,2,3,6)10-11(12)13(1,3,10,12,30-31(13)46(3,12,13,30)42(1,3,12)13)19(3,10,11,12)9(6)8(4,5,6)14(4,5,6,9,15)17(8,9,19)16(9,10,11,19)18(8,9,14,17,19,47(8,9,14,17)48(8,9,14)37(8,14)38(8,14)48)20(3,9,10,11,12,13,16,17,19)21(10,11,13,19)22(10,11,19,20,26-39(10,21)22,33(21)34(21)22,40(21)44(11,21,22)49(10,11,21,22)40)23(9,10,11,16,17,18,19,20,21)29(16)52(16,17,18,23)50(16,17,18,23)51(16,17,18,23)52/h28H", "InChI=1S/C4Cl12/c5-1-2(5)3(1)4(1,2)7(3)8(3,4)12(1,2,3,4)10(1,2,3,4)6(1,2,5,14(1,2,3,4,5,10,12)16(1,2,3,4,7,8,10)12)13(1,2,3,4,5)9(1,2,3,4,5)11(1,2,3,4,7,13)15(1,2,3,4,7,8,9)13", "InChI=1S/C30H28N6O6S4/c37-27-23(54(27)55(23)27)25(27,60(23,27)61(23,27)37)11-12(25)24(11,43-44-25)14-19(9-6-2-3(6,48-2,51(2)6)8(2,6,9)17(6,9,19,52(8)9)28(9,14,19,24,63(8,9,17)19)34(14,17,19,24,57(14,19)28)36(11,12,14,24,28,66(14,24,28)34)40(11,12,24)33(11,12,23,25,27,59(23,25)27)41(11,12,25,36)40)21-15-4-1(47-4)5(4,15,50(1)4)16(4,15,21)29(15,21)22(21)26(29)10-7-18(10)13-20(18,42(13,49-13)58(13)64(13,20,42)65(13,20,42)58)31(7,10,18,38(7)18,53(18)20)46(18,20)45(10,26)39(7,10,26)32(7,10,22,26)35(21,22,26,29,56(22)32)30(15,16,21,22,26,29,62(5,15,16)29,67(15,16,26)29)68(21,22,29)35/h1-2,14,20,22,37H", "InChI=1S/C76H68/c77-10(5-15-17(5)27(15,107(15,17)81-17)21-22-25(15,21,27,99(21)27)26(21,22)16(5)18(5)28(16,22,26,100(22)26)108(16,18)82-18)19(77)20(10,78-10)29(19)30(19,20)41(29)31-32(41)46-40-48(32,46,86-106(46)48)58(29,30,31,32,41,102(32)48,110(29,30)41)57(29,30,31,32,41,109(29,30)41)47(31,101(31)57)39-45(31,47,105(47)85-47)53(39)65(39,45,125(39,53,89-53)123(39,45,47)53)66(40,46)54(40,46,90-126(40,54,66)124(40,46,48)54)70(65,66)61-35-36(61,62(35,61,70)69(53,61,65,66,70)72(61,62,65,66,70,129(61,62,69,93-61)133(61,65,69,72)121(53,65)69,130(61,62,70,94-62)134(62,66,70,72)122(54,66)70)76(39,40,45,46,53,54,65,66,69,70)73(31,32,39,40,41,45,46,47,48,57)58)42(35)49-33-37-43(33,49)51(33,49,87-117(33,51)113(33,43)51)59(35,36,42,49,103(49)51,111(35,36)42)50(42,49)34-38-44(34,50)52(34,50,88-118(34,52)114(34,44)52)60(35,36,42,49,50,59,104(50)52,112(35,36)42)75(33,34,42,43,44,49,50,51,52,59)74(33,34,37,38,43,44)63(37,43,115(37,43)83-37)64(38,44,74,116(38,44)84-38)67(37,63,74)55-2-1(4-6-8(11(4,6)79-97(6)11)13-14-9-7(4,12(4,9)80-98(7)12)24(9,13,14,96(9)14)23(6,8,13,14)95(8)13)3(2,55)56(2,55,67)68(38,55,63,64,67,74)71(55,56,63,64,67,74,127(55,56,67,91-55)131(55,63,67,71)119(37,63)67)128(55,56,68,92-56)132(56,64,68,71)120(38,64)68/h8-9,13-14,19-22,27-28H", "InChI=1S/C12H37O3/c16-5-1-3-6(1,5)9(1,3,5,16,21-39(9)22-9,10(1,3,5,6,17-5,29(1)30(1)10,41(1,5)18-5,26-48(1,3,6,9)10)43(6)37(6)44(6,10)43)13(3)4-2-7(33(5)34(5)7)8(2,4,13)11(2,4,7,13,23-40(11)24-11,35(7)36(7)11,14(3,4,6,9,13,27(3)6)15(3,4,8,9,11,13)28(4)8)12(2,4,7,8,19-7,31(2)32(2)12,42(2,7)20-7,25-47(2,4,8)12)45(8)38(8)46(8,12)45/h1-4H", "InChI=1S/C18H44/c19-45-10-6-7(10,33(6)34(6)7,35(6)36(6)7)11(6,10)14(10,45)3-5(14,31(3)32(3)5)8(3,10,14)15(3,5,14,22-5,44(3,5,8)14,17(6,7,8,10,11,14,45)18(6,7,8,10,11,14,23-8,24-8,38(10)17,48(6,7,10,11,17)49(6,7,10,11,17)18)39(11)50(8,10,11,17,18)52(8,10,11,14,17,18)42(11,17)25-11)4(14,41(14,15)45)9(3,5,15)1-2(9,27(1)28(1)2,29(1)30(1)2)12(1,3,5,9,15)13(1,2,4,9,15,20-4,21-4)16(1,2,3,4,5,9,12,14,15,37(9)13,46(1,2,9,12)13)43(12,26-12)51(4,9,12,13,15,16)47(4,9,12,13,16)40(12)13/h1-8,14,19H", "InChI=1S/C32.C18H13PS2.C10H15.C2Cl2.2CF3O3S.2Ru/c1-6-9(1)18(1,6)2-7-8(2)15(7)3-5-4-16-10-11(16)12(10)19-13-14(19)17(6,13,19)21(13,14,19)20(9,18)22(15)23(16)24(3,4,5,22)27(3,5,15,22,23)26(2,7,8,18,20,22)25(1,6,9,18,20,21,32(7,8,15,20,22,24,26)27)31(13,14,17,19,21,23)29(10,11,12,16,23,24)28(4,5,16,22,23,24,27)30(10,11,12,19,21,23,29)31;20-17-12-6-4-10-15(17)19(14-8-2-1-3-9-14)16-11-5-7-13-18(16)21;1-6-7(2)9(4)10(5)8(6)3;3-1-2(3)4-1;2*2-1(3,4)8(5,6)7;;/h;1-13H;1-5H3;;;;;", "InChI=1S/C8H16Cl16/c25-1-3-5(1)7(1,3,25,29(3)5)11(1)9(1,19(1,3,5,7,11)17(3,5,7)21(1,3,5,7,19,33(3,5)17,37(5,17)39(3,5,17)21)23(1,3,5,7,9,11,17,19,27(1)19)31(1,7)25)15(3,5,7,11)13(3,5,7,35(3,7)15)14-4-2-6(4,14)8(2,4,14,30(4)6)12(2)10(2,16(4,6,8,12,14)36(4,8)14)20(2,4,6,8,12)18(4,6,8)22(2,4,6,8,20,34(4,6)18,38(6,18)40(4,6,18)22)24(2,4,6,8,10,12,18,20,28(2)20)32(2,8)26(2)8" }; for (String inchi : problemStructures) { InChIGeneratorFactory f = InChIGeneratorFactory.getInstance(); InChIToStructure c =f.getInChIToStructure(inchi, SilentChemObjectBuilder.getInstance()); Assert.assertEquals(INCHI_RET.OKAY, c.getReturnStatus()); IAtomContainer mol = c.getAtomContainer(); //takes forever and eats all the memory it could find, starting with depth 6 for (int i=1; i < n; i++) { Fingerprinter fp = new Fingerprinter(1024,i); // here add code to configure atom types - and make sure to set atomic numbers as well, //because perceiveAtomType() method doesn't set them, and they are null after JNIInchi long now = System.currentTimeMillis(); BitSet bitset = fp.getFingerprint(mol); System.out.println(String.format("Atoms\t%d\tBonds\t%d\tFP Depth\t%d\tElapsed time\t%d\tBitset\t%s", mol.getAtomCount(),mol.getBondCount(), i,(System.currentTimeMillis()-now),bitset)); /** * depth 1 : 188 ms * depth 2 : 291 ms * depth 3 : 525 ms * depth 4 : 1474 ms * depth 5 : 9147 ms * depth 6 : 71634 ms */ } System.out.println(); } } ---------------------------------------------------------------------- >Comment By: Egon Willighagen (egonw) Date: 2012-07-28 01:52 Message: Christoph, > Egon, not sure why we should remove the original fingerprinter if we have > not the slighted idea about what the bug is! The time out is not about a bug, I guess. It is about the algorithm just taking long. It could be a bug, but I have no seen any argument for that. > There is no clear indication what the bug is and looking at the > implementation, I cannot see why there should be a combinatorial explosion. The original Fingerprinter uses 'aromaticity'... that involves ring finding. Now, the newer CDK aromaticity perception code does not try all ring combinations, but as you know ring finding has been biting us many, many times. That is on top of the situation that longer paths just give more possibilities... I think that goes combinatorial. The Fingerprinter is reincarnated in the HybridizationFingerprinter; it is the same code, except that it substitutes 'aromaticity' by 'delocalized' (or sp2, to be precise)... So, removing the Fingerprinter class would not be the same as removing the original fingerprinter code. ---------------------------------------------------------------------- Comment By: Nina Nikolova - Jeliazkova (vedina) Date: 2012-07-27 23:42 Message: Asad, thanks for confirming the problem is a genuine one. Pity I was not aware of your experience, would have saved some time! Christoph, I agree these structures are special cases, there are indeed atoms with unusual connectivity. Generating structure diagrams unfortunately fails, I haven't investigated the reason. The attached structures_and_fp_depth6.zip file contains SDF file and Java code building the structures (by CDKSourceCodeWriter ), as well as statistics for Fingerprinter, ExtendedFingerprinter and HybridizationFingerprinter up to depth 6 for all example structures. However, just by Google search by InChI, these structures can be found in the Chempound repository. Links and statistics at https://gist.github.com/3192020 Regardless of the structure correctness, once these InChIs are in the wild, they may come as queries or input to any running cheminformatics system. These are not the only problem structures in our logs, just most difficult ones. They are successfully parsed by the InChI library, and while CDK atom typing rightly fails for most of the atoms, it is not a reason to automatically discard the structures. I've suggested a timeout, because it is the easiest way to protect a system from hanging and consuming resources indefinitely. A better approach by checking chemical correctness would be most useful. ---------------------------------------------------------------------- Comment By: Christoph Steinbeck (steinbeck) Date: 2012-07-24 06:05 Message: Egon, not sure why we should remove the original fingerprinter if we have not the slighted idea about what the bug is! There is no clear indication what the bug is and looking at the implementation, I cannot see why there should be a combinatorial explosion. I have not been able so far to depict the molecules. As John points out, they are "special cases" and I would love to see table with those molecules before making any judgement. ---------------------------------------------------------------------- Comment By: Asad (asadrahman) Date: 2012-07-22 19:22 Message: Dear Nina and Developers, The problem you have highlighted is a genuine one and I came across this bottleneck last year too! In order to resolve this issue I have made a minor change in the concept of the path based fingerprints. I have replaced the PathTool.getPathOfLength this by shortest path between atoms and its works very well. The reasoning is very simple as the number of bond and atoms will increase the complexity will give rise to exponential runtime. The shortest path will overcome this bottle neck without loosing the connectivity information. I have also tweaked few aspects of the present Fingerprinter in the CDK to take care of rings etc. Here is the code https://github.com/asad/ShortestPathMoleculeFingerprinter The fingerprints from Nina's query are also attached. git://gist.github.com/3161698.git Best, Asad ---------------------------------------------------------------------- Comment By: Nina Nikolova - Jeliazkova (vedina) Date: 2012-07-21 04:42 Message: John, I am guessing more or less the same, path enumeration cost for some reason increases exponentially in the Fingerprinter. The structures are user queries, and for the record, the InChI library is parsing them just fine. ---------------------------------------------------------------------- Comment By: John May (jwmay) Date: 2012-07-21 04:07 Message: Can't look at all the fingerprint classes ATM but for the basic Fingerprinter I'm hedging my bets on 'getPathsOfLengthUpto'. From the few molecules I could load they looked very interconnected and I would imagine would take a long time to enumerate all paths of length 8 for example. I'm not sure if the other fingerprinters use the path finder but doing some profiling could show where the problem is. Just checking but these structures are what you expect..? One I opened had a carbon with 14 bonds and some hydrogens with 4 - very Toxic i guess :-). ---------------------------------------------------------------------- Comment By: Egon Willighagen (egonw) Date: 2012-07-21 00:44 Message: OK, if it applies to HybridizationFingerprinter, then indeed it is not the time out, but something more annoying. This needs exploring... But then we should move this to the bug tracker... or maybe file a separate bug report there? ---------------------------------------------------------------------- Comment By: Nina Nikolova - Jeliazkova (vedina) Date: 2012-07-20 22:59 Message: The problem is likely not the rings finder (it does time out if involved). None of the fingerprinter classes complete with these compounds for one or another reason. ---------------------------------------------------------------------- Comment By: Egon Willighagen (egonw) Date: 2012-07-20 22:44 Message: Not saying that this should not be added at all, but this is exactly why I wrote they hybridization fingerprinter... it does not use any aromaticity, and thus does not use any ring finder. Actually, how about just removing the original fingerprinter? We already have enough people shooting themselves in the foot... ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=370024&aid=3546357&group_id=20024 |