Re: [Rdkit-discuss] Substructure search
Open-Source Cheminformatics and Machine Learning
Brought to you by:
glandrum
From: Greg L. <gre...@gm...> - 2016-04-24 09:29:24
|
Hi Greg, On Sat, Apr 23, 2016 at 8:41 PM, <gro...@al...> wrote: > > Very nice work on this project! > Thanks! > Sorry if this is a known issue. I looked through the mailing lists and > didn't see the same problem listed. > > Nope, this is new, at least as far as I remember. There are two things going wrong here: 1) The fingerprint isn't doing a good job of screening results. 2) The long-running queries aren't being properly stopped. The first one is comparatively easy to explain: the patterns that are used to build the fingerprint that's used for screening are quite small, so they don't directly cover the long chains. The one impact of the long chain is to increase the count of the number of times that a pattern occurs, but this turns out to not be particularly effective. The second problem I don't have an easy answer to. When I try this on my linux box with the long carbon chains, I am able to terminate a query with ^C or statement_timout: chembl_20=# select * from rdk.mols where m@>'CCCCCCCCCCCCCCCCCCCCCCCCCBr' limit 10; ^CCancel request sent ERROR: canceling statement due to user request Time: 5236.520 ms chembl_20=# set statement_timeout=5000; SET Time: 0.148 ms chembl_20=# select * from rdk.mols where m@>'CCCCCCCCCCCCCCCCCCCCCCCCCBr' limit 10; ERROR: canceling statement due to statement timeout Time: 5003.716 ms But this doesn't work for your query with the dot-disconnected oxygens, there I have to kill the query manually. I at first thought it might be due to the index scan, but turning that off doesn't help. It is certainly a function of the size of the query. Here's a small session run without the index: chembl_20=# set enable_bitmapscan=false;set enable_indexscan=false;set statement_timeout=5000; SET Time: 0.149 ms SET Time: 0.027 ms SET Time: 0.029 ms chembl_20=# select * from rdk.mols where m@>'O.O.O.OS(O)(=O)=O' limit 10; ERROR: canceling statement due to statement timeout Time: 5164.735 ms chembl_20=# select * from rdk.mols where m@>'O.O.O.O.O.OS(O)(=O)=O' limit 10; ERROR: canceling statement due to statement timeout Time: 5403.237 ms chembl_20=# select * from rdk.mols where m@>'O.O.O.O.O.O.O.OS(O)(=O)=O' limit 10; ERROR: canceling statement due to statement timeout Time: 5930.456 ms chembl_20=# select * from rdk.mols where m@>'O.O.O.O.O.O.O.O.O.OS(O)(=O)=O' limit 10; ERROR: canceling statement due to statement timeout Time: 36374.204 ms Clearly something happens with that last one. Here's my guess: The highly redundant query is getting hung up on one large molecule where there are a large number of possible matches. The substructure engine is taking a long time to determine whether or not that particular molecule has a match. PostgreSQL can only interrupt the query when that call returns (the substructure engine itself has no built-in timeout). This one is easy, though time consuming, to track down. I'll see if I can do so. An aside: the fingerprint is also not going to work particularly well for queries with large numbers of dot-disconnected pieces, particularly if those pieces are single atoms. The fingerprint doesn't set any bits for individual atoms (which is something that should change). -greg When I perform a substructure search using the postgres cartridge, >99% > of the time it works perfectly and is incredibly fast. Sometimes I > encounter situations where the system never returns a result, even after > many hours on a small dataset. A good example is this: > > select count(substance_id) from substance where > rdkmol@>'CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCBr' > > (rdkmol is type mol with the index in place) > > The only way to stop is by restarting postgres. > > Interestingly though, the following returns the count rather quickly: > > select count(substance_id) from substance where > rdkmol@>'CCCCCCCCCCCCCCBr' > > I've encountered other examples where repeated atoms or components, such > as the O's in the example below cause the same problem: > > select count(substance_id) from substance where > rdkmol@>'O.O.O.O.O.O.O.O.O.O.OS(O)(=O)=O' > > I'd like to be able to run this on an internal webserver. When the > query hangs, the cpu is at ~100%. Unfortunately, setting the postgres > statement_timeout parameter does not help in this case. > > Any suggestions on how to improve the query or how to kill it after a > certain amount of time without restarting postgres? > > Thanks a lot, > > Greg > > > > > > > > > ------------------------------------------------------------------------------ > Find and fix application performance issues faster with Applications > Manager > Applications Manager provides deep performance insights into multiple > tiers of > your business applications. It resolves application problems quickly and > reduces your MTTR. Get your free trial! > https://ad.doubleclick.net/ddm/clk/302982198;130105516;z > _______________________________________________ > Rdkit-discuss mailing list > Rdk...@li... > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss > |