Thread: [Rdkit-discuss] Substructure search
Open-Source Cheminformatics and Machine Learning
Brought to you by:
glandrum
From: Evgueni K. <eko...@gm...> - 2009-10-21 13:25:33
|
Dear Greg, I do not quite understand how SubstructMatch function works - what will be in vector matchVect which is <int, int> vector? is it Atomic matches? What is the simplest way to find substructures/superstructures/exact match of query structure in the set of molecules? Regards, Evgueni |
From: <gro...@al...> - 2016-04-23 19:06:25
|
Hello, Very nice work on this project! Sorry if this is a known issue. I looked through the mailing lists and didn't see the same problem listed. When I perform a substructure search using the postgres cartridge, >99% of the time it works perfectly and is incredibly fast. Sometimes I encounter situations where the system never returns a result, even after many hours on a small dataset. A good example is this: select count(substance_id) from substance where rdkmol@>'CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCBr' (rdkmol is type mol with the index in place) The only way to stop is by restarting postgres. Interestingly though, the following returns the count rather quickly: select count(substance_id) from substance where rdkmol@>'CCCCCCCCCCCCCCBr' I've encountered other examples where repeated atoms or components, such as the O's in the example below cause the same problem: select count(substance_id) from substance where rdkmol@>'O.O.O.O.O.O.O.O.O.O.OS(O)(=O)=O' I'd like to be able to run this on an internal webserver. When the query hangs, the cpu is at ~100%. Unfortunately, setting the postgres statement_timeout parameter does not help in this case. Any suggestions on how to improve the query or how to kill it after a certain amount of time without restarting postgres? Thanks a lot, Greg |
From: Greg L. <gre...@gm...> - 2016-04-24 09:29:24
|
Hi Greg, On Sat, Apr 23, 2016 at 8:41 PM, <gro...@al...> wrote: > > Very nice work on this project! > Thanks! > Sorry if this is a known issue. I looked through the mailing lists and > didn't see the same problem listed. > > Nope, this is new, at least as far as I remember. There are two things going wrong here: 1) The fingerprint isn't doing a good job of screening results. 2) The long-running queries aren't being properly stopped. The first one is comparatively easy to explain: the patterns that are used to build the fingerprint that's used for screening are quite small, so they don't directly cover the long chains. The one impact of the long chain is to increase the count of the number of times that a pattern occurs, but this turns out to not be particularly effective. The second problem I don't have an easy answer to. When I try this on my linux box with the long carbon chains, I am able to terminate a query with ^C or statement_timout: chembl_20=# select * from rdk.mols where m@>'CCCCCCCCCCCCCCCCCCCCCCCCCBr' limit 10; ^CCancel request sent ERROR: canceling statement due to user request Time: 5236.520 ms chembl_20=# set statement_timeout=5000; SET Time: 0.148 ms chembl_20=# select * from rdk.mols where m@>'CCCCCCCCCCCCCCCCCCCCCCCCCBr' limit 10; ERROR: canceling statement due to statement timeout Time: 5003.716 ms But this doesn't work for your query with the dot-disconnected oxygens, there I have to kill the query manually. I at first thought it might be due to the index scan, but turning that off doesn't help. It is certainly a function of the size of the query. Here's a small session run without the index: chembl_20=# set enable_bitmapscan=false;set enable_indexscan=false;set statement_timeout=5000; SET Time: 0.149 ms SET Time: 0.027 ms SET Time: 0.029 ms chembl_20=# select * from rdk.mols where m@>'O.O.O.OS(O)(=O)=O' limit 10; ERROR: canceling statement due to statement timeout Time: 5164.735 ms chembl_20=# select * from rdk.mols where m@>'O.O.O.O.O.OS(O)(=O)=O' limit 10; ERROR: canceling statement due to statement timeout Time: 5403.237 ms chembl_20=# select * from rdk.mols where m@>'O.O.O.O.O.O.O.OS(O)(=O)=O' limit 10; ERROR: canceling statement due to statement timeout Time: 5930.456 ms chembl_20=# select * from rdk.mols where m@>'O.O.O.O.O.O.O.O.O.OS(O)(=O)=O' limit 10; ERROR: canceling statement due to statement timeout Time: 36374.204 ms Clearly something happens with that last one. Here's my guess: The highly redundant query is getting hung up on one large molecule where there are a large number of possible matches. The substructure engine is taking a long time to determine whether or not that particular molecule has a match. PostgreSQL can only interrupt the query when that call returns (the substructure engine itself has no built-in timeout). This one is easy, though time consuming, to track down. I'll see if I can do so. An aside: the fingerprint is also not going to work particularly well for queries with large numbers of dot-disconnected pieces, particularly if those pieces are single atoms. The fingerprint doesn't set any bits for individual atoms (which is something that should change). -greg When I perform a substructure search using the postgres cartridge, >99% > of the time it works perfectly and is incredibly fast. Sometimes I > encounter situations where the system never returns a result, even after > many hours on a small dataset. A good example is this: > > select count(substance_id) from substance where > rdkmol@>'CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCBr' > > (rdkmol is type mol with the index in place) > > The only way to stop is by restarting postgres. > > Interestingly though, the following returns the count rather quickly: > > select count(substance_id) from substance where > rdkmol@>'CCCCCCCCCCCCCCBr' > > I've encountered other examples where repeated atoms or components, such > as the O's in the example below cause the same problem: > > select count(substance_id) from substance where > rdkmol@>'O.O.O.O.O.O.O.O.O.O.OS(O)(=O)=O' > > I'd like to be able to run this on an internal webserver. When the > query hangs, the cpu is at ~100%. Unfortunately, setting the postgres > statement_timeout parameter does not help in this case. > > Any suggestions on how to improve the query or how to kill it after a > certain amount of time without restarting postgres? > > Thanks a lot, > > Greg > > > > > > > > > ------------------------------------------------------------------------------ > Find and fix application performance issues faster with Applications > Manager > Applications Manager provides deep performance insights into multiple > tiers of > your business applications. It resolves application problems quickly and > reduces your MTTR. Get your free trial! > https://ad.doubleclick.net/ddm/clk/302982198;130105516;z > _______________________________________________ > Rdkit-discuss mailing list > Rdk...@li... > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss > |
From: Greg L. <gre...@gm...> - 2016-04-24 09:48:01
|
On Sun, Apr 24, 2016 at 11:28 AM, Greg Landrum <gre...@gm...> wrote: > > Here's my guess: The highly redundant query is getting hung up on one > large molecule where there are a large number of possible matches. The > substructure engine is taking a long time to determine whether or not that > particular molecule has a match. PostgreSQL can only interrupt the query > when that call returns (the substructure engine itself has no built-in > timeout). This one is easy, though time consuming, to track down. I'll see > if I can do so. > And there it is. Ironically it is the first molecule in my chembl_20 structure table: chembl_20=# select * from rdk.mols limit 1; molregno | m ----------+--------------------------------------------------------------------------------------------------- 23681 | O[C@H]1[C@H](O)[C@@H](O)[C@@H](O)[C@H](O[C@H]2[C@@H](O)[C@H ](O)[C@@H](O)[C@@H](O)[C@@H]2O)[C@H]1O (1 row) chembl_20=# select 'O[C@H]1[C@H](O)[C@@H](O)[C@@H](O)[C@H](O[C@H]2[C@ @H](O)[C@H](O)[C@@H](O)[C@@H](O)[C@@H]2O)[C@H]1O'::mol@ >'O.O.O.O.O.O.O.O.O.OS(O)(=O)=O'; ERROR: canceling statement due to statement timeout Time: 35996.985 ms Here's the same thing from Python: In [3]: m = Chem.MolFromSmiles('O[C@H]1[C@H](O)[C@@H](O)[C@@H](O)[C@H](O[C@H ]2[C@@H](O)[C@H](O)[C@@H](O)[C@@H](O)[C@@H]2O)[C@H]1O') In [4]: p = Chem.MolFromSmiles('O.O.O.O.O.O.O.O.O.OS(O)(=O)=O') In [5]: t1=time.time();m.HasSubstructMatch(p);t2=time.time();print(t2-t1) 36.09873843193054 Here's the github issue: https://github.com/rdkit/rdkit/issues/880 So now my task is to figure out why this substructure query is taking so long (there's clearly something pathological going on here since that molecule doesn't have a single S in it) and to explore adding a timeout to the substructure searching code. Thanks for reporting this! -greg |
From: <gro...@al...> - 2016-04-25 16:56:38
|
Hi Greg, Thank you very much for your quick reply and taking the time to look into this. As a crude work around, if I split the dot-disconnected string into individual and unique components then include in the where clause, the query returns the result rapidly: select * from rdk.mols where m@>'O' and m@>'OS(O)(=O)=O' and m@>'O.O.O.O.O.O.O.O.O.OS(O)(=O)=O' limit 10; I suppose this won't help in every case, but it helps. Best regards, Greg On 2016-04-24 04:47, Greg Landrum wrote: > On Sun, Apr 24, 2016 at 11:28 AM, Greg Landrum > <gre...@gm...> wrote: > >> Here's my guess: The highly redundant query is getting hung up on >> one large molecule where there are a large number of possible >> matches. The substructure engine is taking a long time to determine >> whether or not that particular molecule has a match. PostgreSQL can >> only interrupt the query when that call returns (the substructure >> engine itself has no built-in timeout). This one is easy, though >> time consuming, to track down. I'll see if I can do so. > > And there it is. Ironically it is the first molecule in my chembl_20 > structure table: > > chembl_20=# select * from rdk.mols limit 1; > molregno | m > > ----------+--------------------------------------------------------------------------------------------------- > 23681 | > O[C@H]1[C@H](O)[C@@H](O)[C@@H](O)[C@H](O[C@H]2[C@@H](O)[C@H](O)[C@@H](O)[C@@H](O)[C@@H]2O)[C@H]1O > (1 row) > > chembl_20=# select > 'O[C@H]1[C@H](O)[C@@H](O)[C@@H](O)[C@H](O[C@H]2[C@@H](O)[C@H](O)[C@@H](O)[C@@H](O)[C@@H]2O)[C@H]1O'::mol@>'O.O.O.O.O.O.O.O.O.OS(O)(=O)=O'; > ERROR: canceling statement due to statement timeout > Time: 35996.985 ms > > Here's the same thing from Python: > > In [3]: m = > Chem.MolFromSmiles('O[C@H]1[C@H](O)[C@@H](O)[C@@H](O)[C@H](O[C@H]2[C@@H](O)[C@H](O)[C@@H](O)[C@@H](O)[C@@H]2O)[C@H]1O') > > In [4]: p = Chem.MolFromSmiles('O.O.O.O.O.O.O.O.O.OS(O)(=O)=O') > > In [5]: > t1=time.time();m.HasSubstructMatch(p);t2=time.time();print(t2-t1) > 36.09873843193054 > > Here's the github issue: https://github.com/rdkit/rdkit/issues/880 [1] > > So now my task is to figure out why this substructure query is taking > so long (there's clearly something pathological going on here since > that molecule doesn't have a single S in it) and to explore adding a > timeout to the substructure searching code. > > Thanks for reporting this! > -greg > > > > Links: > ------ > [1] https://github.com/rdkit/rdkit/issues/880 |
From: Greg L. <gre...@gm...> - 2009-10-21 18:33:13
|
Dear Evgueni, On Wed, Oct 21, 2009 at 3:25 PM, Evgueni Kolossov <eko...@gm...> wrote: > > I do not quite understand how SubstructMatch function works - what will be > in vector matchVect which is <int, int> vector? is it Atomic matches? It's a vector with information about the match: atom id in query : atom id in matched molecule > What is the simplest way to find substructures/superstructures/exact match > of query structure in the set of molecules? substructure: use SubstructMatch(mol,query,...) superstructure: use SubstructMatch(query,mol,...) exact match: use canonical SMILES -greg -greg |
From: Evgueni K. <eko...@gm...> - 2009-11-04 14:27:05
|
Hi Greg, I found that SubstructMatch would not work if query is a fragment (with * atoms). Can you suggest solution for this problem? Regards, Evgueni 2009/10/21 Greg Landrum <gre...@gm...> > Dear Evgueni, > > On Wed, Oct 21, 2009 at 3:25 PM, Evgueni Kolossov <eko...@gm...> > wrote: > > > > I do not quite understand how SubstructMatch function works - what will > be > > in vector matchVect which is <int, int> vector? is it Atomic matches? > > It's a vector with information about the match: > atom id in query : atom id in matched molecule > > > What is the simplest way to find substructures/superstructures/exact > match > > of query structure in the set of molecules? > > substructure: use SubstructMatch(mol,query,...) > superstructure: use SubstructMatch(query,mol,...) > exact match: use canonical SMILES > > -greg > > > > -greg > -- Dr. Evgueni Kolossov (PhD) eko...@gm... Tel. +44(0)1628 627168 Mob. +44(0)7812070446 |
From: Greg L. <gre...@gm...> - 2009-11-04 18:55:09
|
Hi Evgueni, On Wed, Nov 4, 2009 at 3:26 PM, Evgueni Kolossov <eko...@gm...> wrote: > > I found that SubstructMatch would not work if query is a fragment (with * > atoms). > Can you suggest solution for this problem? That's a bug. Dummy atoms (things with atomic number zero) that do not have an isotope specification should match anything. If you have a sourceforge account, please enter the bug, otherwise let me know and I will enter it. Thanks for finding the problem. -greg |
From: Greg L. <gre...@gm...> - 2009-11-06 05:36:44
|
On Wed, Nov 4, 2009 at 7:54 PM, Greg Landrum <gre...@gm...> wrote: > > On Wed, Nov 4, 2009 at 3:26 PM, Evgueni Kolossov <eko...@gm...> wrote: >> >> I found that SubstructMatch would not work if query is a fragment (with * >> atoms). >> Can you suggest solution for this problem? > > That's a bug. Dummy atoms (things with atomic number zero) that do not > have an isotope specification should match anything. If you have a > sourceforge account, please enter the bug, otherwise let me know and I > will enter it. After going back through the code and thinking about this for a while I'm going to change my original answer: it's not a bug that standard dummy atoms only match other dummy atoms. When I saw the "*" in the original message I started thinking about the QueryAtoms produced by a "*" in SMARTS, which definitely should (and do) match other dummies. The behavior with standard Atoms is useful for things like flagging attachment points of R groups on a scaffold. Here's an example: [5] >>> f= Chem.MolFromSmiles('c1cccnc1*') [6] >>> p = Chem.MolFromSmarts('c1cccnc1*') [9] >>> m = Chem.MolFromSmiles('c1ccc(C)nc1*') Matching using f, which has dummy Atoms only gives one match: [10] >>> m.GetSubstructMatches(f) Out[10]: ((0, 1, 2, 3, 5, 6, 7),) But matching using p, which has a QueryAtom built from "*" matches twice: [11] >>> m.GetSubstructMatches(p) Out[11]: ((0, 1, 2, 3, 5, 6, 7), (2, 1, 0, 6, 5, 3, 4)) For your use case, I'd suggest replacing the dummies in your fragments with QueryAtoms that have the appropriate query, something like this (not tested): //--------------------------------------------- #include <GraphMol/RDKitQueries.h> void replaceDummies(RWMol *frag){ QueryAtom *qat = new QueryAtom(); qat->setQuery(makeAtomNullQuery()); for(unsigned int i=0;i<frag->getNumAtoms();++i){ if(frag->getAtomWithIdx(i)->getAtomicNum()==0){ frag->replaceAtom(i,qat); } } delete qat; } //--------------------------------------------- I hope this helps, -greg |
From: Evgueni K. <eko...@gm...> - 2009-11-06 06:59:33
|
Hi Greg, Yes, this is solution I been thinking about as well but there is 2 problems: 1. It will slow dawn mapping process which is slow already 2. What atom to use for replacement? What if I will just remove this atom(s)? Regards, Evgueni 2009/11/6 Greg Landrum <gre...@gm...> > On Wed, Nov 4, 2009 at 7:54 PM, Greg Landrum <gre...@gm...> > wrote: > > > > On Wed, Nov 4, 2009 at 3:26 PM, Evgueni Kolossov <eko...@gm...> > wrote: > >> > >> I found that SubstructMatch would not work if query is a fragment (with > * > >> atoms). > >> Can you suggest solution for this problem? > > > > That's a bug. Dummy atoms (things with atomic number zero) that do not > > have an isotope specification should match anything. If you have a > > sourceforge account, please enter the bug, otherwise let me know and I > > will enter it. > > After going back through the code and thinking about this for a while > I'm going to change my original answer: it's not a bug that standard > dummy atoms only match other dummy atoms. When I saw the "*" in the > original message I started thinking about the QueryAtoms produced by a > "*" in SMARTS, which definitely should (and do) match other dummies. > The behavior with standard Atoms is useful for things like flagging > attachment points of R groups on a scaffold. Here's an example: > > [5] >>> f= Chem.MolFromSmiles('c1cccnc1*') > > [6] >>> p = Chem.MolFromSmarts('c1cccnc1*') > > [9] >>> m = Chem.MolFromSmiles('c1ccc(C)nc1*') > > Matching using f, which has dummy Atoms only gives one match: > [10] >>> m.GetSubstructMatches(f) > Out[10]: ((0, 1, 2, 3, 5, 6, 7),) > > But matching using p, which has a QueryAtom built from "*" matches twice: > [11] >>> m.GetSubstructMatches(p) > Out[11]: ((0, 1, 2, 3, 5, 6, 7), (2, 1, 0, 6, 5, 3, 4)) > > For your use case, I'd suggest replacing the dummies in your fragments > with QueryAtoms that have the appropriate query, something like this > (not tested): > > //--------------------------------------------- > #include <GraphMol/RDKitQueries.h> > > void replaceDummies(RWMol *frag){ > QueryAtom *qat = new QueryAtom(); > qat->setQuery(makeAtomNullQuery()); > for(unsigned int i=0;i<frag->getNumAtoms();++i){ > if(frag->getAtomWithIdx(i)->getAtomicNum()==0){ > frag->replaceAtom(i,qat); > } > } > delete qat; > } > //--------------------------------------------- > > I hope this helps, > -greg > |
From: Evgueni K. <eko...@gm...> - 2009-11-06 08:03:29
|
Greg, I think you should distinguish between dummy atoms and connection points - for fragments it is connection points we are talking about. So, it suppose to ignore this atom (but not bond!) during matching process. May be just add another bool flag to allow user select different behavior? Regards, Evgueni 2009/11/6 Evgueni Kolossov <eko...@gm...> > Hi Greg, > > Yes, this is solution I been thinking about as well but there is 2 > problems: > 1. It will slow dawn mapping process which is slow already > 2. What atom to use for replacement? > > What if I will just remove this atom(s)? > > Regards, > Evgueni > > 2009/11/6 Greg Landrum <gre...@gm...> > > On Wed, Nov 4, 2009 at 7:54 PM, Greg Landrum <gre...@gm...> >> wrote: >> > >> > On Wed, Nov 4, 2009 at 3:26 PM, Evgueni Kolossov <eko...@gm...> >> wrote: >> >> >> >> I found that SubstructMatch would not work if query is a fragment (with >> * >> >> atoms). >> >> Can you suggest solution for this problem? >> > >> > That's a bug. Dummy atoms (things with atomic number zero) that do not >> > have an isotope specification should match anything. If you have a >> > sourceforge account, please enter the bug, otherwise let me know and I >> > will enter it. >> >> After going back through the code and thinking about this for a while >> I'm going to change my original answer: it's not a bug that standard >> dummy atoms only match other dummy atoms. When I saw the "*" in the >> original message I started thinking about the QueryAtoms produced by a >> "*" in SMARTS, which definitely should (and do) match other dummies. >> The behavior with standard Atoms is useful for things like flagging >> attachment points of R groups on a scaffold. Here's an example: >> >> [5] >>> f= Chem.MolFromSmiles('c1cccnc1*') >> >> [6] >>> p = Chem.MolFromSmarts('c1cccnc1*') >> >> [9] >>> m = Chem.MolFromSmiles('c1ccc(C)nc1*') >> >> Matching using f, which has dummy Atoms only gives one match: >> [10] >>> m.GetSubstructMatches(f) >> Out[10]: ((0, 1, 2, 3, 5, 6, 7),) >> >> But matching using p, which has a QueryAtom built from "*" matches twice: >> [11] >>> m.GetSubstructMatches(p) >> Out[11]: ((0, 1, 2, 3, 5, 6, 7), (2, 1, 0, 6, 5, 3, 4)) >> >> For your use case, I'd suggest replacing the dummies in your fragments >> with QueryAtoms that have the appropriate query, something like this >> (not tested): >> >> //--------------------------------------------- >> #include <GraphMol/RDKitQueries.h> >> >> void replaceDummies(RWMol *frag){ >> QueryAtom *qat = new QueryAtom(); >> qat->setQuery(makeAtomNullQuery()); >> for(unsigned int i=0;i<frag->getNumAtoms();++i){ >> if(frag->getAtomWithIdx(i)->getAtomicNum()==0){ >> frag->replaceAtom(i,qat); >> } >> } >> delete qat; >> } >> //--------------------------------------------- >> >> I hope this helps, >> -greg >> > > > > |
From: Greg L. <gre...@gm...> - 2009-11-07 09:33:12
|
Combining two answers into one: On Fri, Nov 6, 2009 at 7:59 AM, Evgueni Kolossov <eko...@gm...> wrote: > Hi Greg, > > Yes, this is solution I been thinking about as well but there is 2 problems: > 1. It will slow dawn mapping process which is slow already > 2. What atom to use for replacement? I'm not sure I understand what you mean about slowing down the mapping process. If you replace the dummies in your fragments with query atoms, as I proposed in the sample code in my earlier message, the substructure search should not be substantially slower. The replacement itself also won't take that long, unless you really have a *lot* of fragments. On Fri, Nov 6, 2009 at 9:03 AM, Evgueni Kolossov <eko...@gm...> wrote: > > I think you should distinguish between dummy atoms and connection points - > for fragments it is connection points we are talking about. The code doesn't understand anything about connection points... it just has atoms. Dummy atoms are atoms with atomic number zero. The substructure matching code applied to normal Atoms (i.e. not QueryAtoms) compares two atoms by checking to see if their atomic numbers match, so dummies match dummies. Additionally, when isotopes are specified, it checks that the specified isotopes match. QueryAtoms, on the other had, allow client code to specify the function that's used for matching. The example I provided showed how to use a function that matches any atom; which I think is what you are looking for. > So, it suppose > to ignore this atom (but not bond!) during matching process. May be just add > another bool flag to allow user select different behavior? The substructure matching uses atoms and bonds, and returns the results as lists of atom indices; how (and why) would you propose to ignore an atom but not a bond? -greg |
From: Evgueni K. <eko...@gm...> - 2009-11-07 11:35:20
|
Thanks Greg, I have calculated it will slow down on about 30% using this replacement which is significant for big datasets. >The substructure matching uses atoms and bonds, and returns the >results as lists of atom indices; how (and why) would you propose to >ignore an atom but not a bond? I mean take bond in account as it is but use "match any" for dummy atom Regards, Evgueni 2009/11/7 Greg Landrum <gre...@gm...> > Combining two answers into one: > > On Fri, Nov 6, 2009 at 7:59 AM, Evgueni Kolossov <eko...@gm...> > wrote: > > Hi Greg, > > > > Yes, this is solution I been thinking about as well but there is 2 > problems: > > 1. It will slow dawn mapping process which is slow already > > 2. What atom to use for replacement? > > I'm not sure I understand what you mean about slowing down the mapping > process. If you replace the dummies in your fragments with query > atoms, as I proposed in the sample code in my earlier message, the > substructure search should not be substantially slower. The > replacement itself also won't take that long, unless you really have a > *lot* of fragments. > > > On Fri, Nov 6, 2009 at 9:03 AM, Evgueni Kolossov <eko...@gm...> > wrote: > > > > I think you should distinguish between dummy atoms and connection points > - > > for fragments it is connection points we are talking about. > > The code doesn't understand anything about connection points... it > just has atoms. Dummy atoms are atoms with atomic number zero. The > substructure matching code applied to normal Atoms (i.e. not > QueryAtoms) compares two atoms by checking to see if their atomic > numbers match, so dummies match dummies. Additionally, when isotopes > are specified, it checks that the specified isotopes match. > QueryAtoms, on the other had, allow client code to specify the > function that's used for matching. The example I provided showed how > to use a function that matches any atom; which I think is what you are > looking for. > > > So, it suppose > > to ignore this atom (but not bond!) during matching process. May be just > add > > another bool flag to allow user select different behavior? > > The substructure matching uses atoms and bonds, and returns the > results as lists of atom indices; how (and why) would you propose to > ignore an atom but not a bond? > > -greg > |
From: Greg L. <gre...@gm...> - 2009-11-07 13:48:59
|
On Sat, Nov 7, 2009 at 12:35 PM, Evgueni Kolossov <eko...@gm...> wrote: > > I have calculated it will slow down on about 30% using this replacement > which is significant for big datasets. Agreed, that's a huge difference. How does it come about? Where is the time being spent? -greg |
From: Evgueni K. <eko...@gm...> - 2009-11-07 14:45:05
|
Hi Greg, I have not done full profiling - this came just from the difference between time with and without Replace Dummmy Regards, Evgueni 2009/11/7 Greg Landrum <gre...@gm...> > On Sat, Nov 7, 2009 at 12:35 PM, Evgueni Kolossov <eko...@gm...> > wrote: > > > > I have calculated it will slow down on about 30% using this replacement > > which is significant for big datasets. > > Agreed, that's a huge difference. How does it come about? Where is the > time being spent? > > -greg > |
From: Greg L. <gre...@gm...> - 2009-11-07 16:34:47
|
On Sat, Nov 7, 2009 at 3:44 PM, Evgueni Kolossov <eko...@gm...> wrote: > > I have not done full profiling - this came just from the difference between > time with and without Replace Dummmy are you doing the replace dummy for each fragment every time before you do a search or do you do it just once? I would guess that replacing the dummy atoms shouldn't take very long at all, and then doing the searches should also be reasonably quick. One complication might be that having the query atoms will return a lot more matches than the non-query dummies; this will naturally take longer. -greg |
From: Evgueni K. <eko...@gm...> - 2009-11-07 16:43:34
|
>are you doing the replace dummy for each fragment every time before >you do a search or do you do it just once? I am iterating through all the structures and all the fragments: so for each structure do for each fragment ( and need to replace dummy here) probably can do it another way: for each fragment do for each structure In this case will need to do it only once for each fragment Regards, Evgueni 2009/11/7 Greg Landrum <gre...@gm...> > On Sat, Nov 7, 2009 at 3:44 PM, Evgueni Kolossov <eko...@gm...> > wrote: > > > > I have not done full profiling - this came just from the difference > between > > time with and without Replace Dummmy > > are you doing the replace dummy for each fragment every time before > you do a search or do you do it just once? > > I would guess that replacing the dummy atoms shouldn't take very long > at all, and then doing the searches should also be reasonably quick. > One complication might be that having the query atoms will return a > lot more matches than the non-query dummies; this will naturally take > longer. > > -greg > |
From: Greg L. <gre...@gm...> - 2009-11-07 16:47:28
|
On Sat, Nov 7, 2009 at 5:43 PM, Evgueni Kolossov <eko...@gm...> wrote: >>are you doing the replace dummy for each fragment every time before >>you do a search or do you do it just once? > I am iterating through all the structures and all the fragments: > so for each structure do > for each fragment ( and need to replace dummy > here) > > probably can do it another way: > for each fragment do > for each structure > > In this case will need to do it only once for each fragment yes, I imagine that will help a lot. or: for each fragment do: replace dummy atom for each structure do for each fragment do something -greg -greg |