Re: [Rdkit-discuss] SMARTS queries using FP in Postgres Cartridge
Open-Source Cheminformatics and Machine Learning
Brought to you by:
glandrum
|
From: Greg L. <gre...@gm...> - 2012-02-24 04:44:42
|
Jean-Paul, On Thu, Feb 23, 2012 at 11:49 AM, JP <jea...@in...> wrote: > > > I have a bunch of SMARTS patterns (around a 100) I'd like to apply to an > RDKit enabled postgres database. > > So I have a query like, say: > select * from molecule where > rdkitmol@>'[!#1;!#6;!#7;!#8;!#9;!#16;!Cl;!Br;!I]'::qmol; > > But this is taking literally ages on a few million molecules (I'm on my > third cup of Grande coffee). > > Is there any way to use one of the fingerprints I have created to speed this > up? Here's how I was going to answer: ---------- Assuming that you created an index on the mol column of the table that contains your molecules (that's the "create index molidx on rdk.mols using gist(m);" step on this page: http://code.google.com/p/rdkit/wiki/LoadingChEMBL), then the fingerprints should be being used already. The problem is probably that the fingerprints are not particularly effective when filtering using SMARTS. Here's a demonstration of two searches on the ChEMBL molecule table (a bit over 1 million rows). First use SMILES: chembl_12=# select count(*) from rdk.mols where m@>'O=c1ccc2ccccc2o1'; count ------- 9231 (1 row) Time: 15185.272 ms Then try it with SMARTS: chembl_12=# select count(*) from rdk.mols where m@>'O=c1ccc2ccccc2o1'::qmol; count ------- 9231 (1 row) Time: 60181.999 ms ---------- I was then going to show you how to check that the index is in fact being used; chembl_12=# explain select count(*) from rdk.mols where m@>'O=c1ccc2ccccc2o1'; QUERY PLAN -------------------------------------------------------------------------------- Aggregate (cost=3985.61..3985.62 rows=1 width=0) -> Bitmap Heap Scan on mols (cost=121.71..3982.92 rows=1076 width=0) Recheck Cond: (m @> 'O=c1oc2c(cccc2)cc1'::mol) -> Bitmap Index Scan on molidx (cost=0.00..121.44 rows=1076 width=0) Index Cond: (m @> 'O=c1oc2c(cccc2)cc1'::mol) (5 rows) chembl_12=# explain select count(*) from rdk.mols where m@>'O=c1ccc2ccccc2o1'::qmol; QUERY PLAN ---------------------------------------------------------------------------------- Aggregate (cost=75357.40..75357.41 rows=1 width=0) -> Seq Scan on mols (cost=0.00..75354.71 rows=1076 width=0) Filter: (m @> 'O=c2:,-c:,-c:,-c1:,-c:,-c:,-c:,-c:,-c:,-1:,-o:,-2'::qmol) (3 rows) But then I realized from that last result that the fingerprint index isn't actually being used for queries with qmols; it's always doing a sequential scan. As soon as I get done scratching my head I will put this in the bug tracker. Still, even if the index were being used I would not expect the performance for SMARTS-based queries to be as good as that for SMILES-based queries; the fingerprint just is not going to be as effective. To solve your problem in the short term: you can always apply the brute force approach of running the queries in parallel. So split your query file into, for example, four parts and then execute those four sets of queries simultaneously. Postgres is quite good at making use of multiple CPUs/threads when multiple queries are executing at once. -greg |