Re: [Rdkit-discuss] SMARTS queries using FP in Postgres Cartridge

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Jean-Paul,

On Thu, Feb 23, 2012 at 11:49 AM, JP <jea...@in...> wrote:
>
>
> I have a bunch of SMARTS patterns (around a 100) I'd like to apply to an
> RDKit enabled postgres database.
>
> So I have a query like, say:
> select * from molecule where
> rdkitmol@>'[!#1;!#6;!#7;!#8;!#9;!#16;!Cl;!Br;!I]'::qmol;
>
> But this is taking literally ages on a few million molecules (I'm on my
> third cup of Grande coffee).
>
> Is there any way to use one of the fingerprints I have created to speed this
> up?

Here's how I was going to answer:

----------
Assuming that you created an index on the mol column of the table that
contains your molecules (that's the "create index molidx on rdk.mols
using gist(m);" step on this page:
http://code.google.com/p/rdkit/wiki/LoadingChEMBL), then the
fingerprints should be being used already.

The problem is probably that the fingerprints are not particularly
effective when filtering using SMARTS.

Here's a demonstration of two searches on the ChEMBL molecule table (a
bit over 1 million rows).

First use SMILES:
chembl_12=# select count(*) from rdk.mols where m@>'O=c1ccc2ccccc2o1';
 count
-------
  9231
(1 row)

Time: 15185.272 ms

Then try it with SMARTS:
chembl_12=# select count(*) from rdk.mols where m@>'O=c1ccc2ccccc2o1'::qmol;
 count
-------
  9231
(1 row)

Time: 60181.999 ms

----------

I was then going to show you how to check that the index is in fact being used;

chembl_12=# explain select count(*) from rdk.mols where m@>'O=c1ccc2ccccc2o1';
                                   QUERY PLAN
--------------------------------------------------------------------------------
 Aggregate  (cost=3985.61..3985.62 rows=1 width=0)
   ->  Bitmap Heap Scan on mols  (cost=121.71..3982.92 rows=1076 width=0)
         Recheck Cond: (m @> 'O=c1oc2c(cccc2)cc1'::mol)
         ->  Bitmap Index Scan on molidx  (cost=0.00..121.44 rows=1076 width=0)
               Index Cond: (m @> 'O=c1oc2c(cccc2)cc1'::mol)
(5 rows)

chembl_12=# explain select count(*) from rdk.mols where
m@>'O=c1ccc2ccccc2o1'::qmol;
                                    QUERY PLAN
----------------------------------------------------------------------------------
 Aggregate  (cost=75357.40..75357.41 rows=1 width=0)
   ->  Seq Scan on mols  (cost=0.00..75354.71 rows=1076 width=0)
         Filter: (m @>
'O=c2:,-c:,-c:,-c1:,-c:,-c:,-c:,-c:,-c:,-1:,-o:,-2'::qmol)
(3 rows)

But then I realized from that last result that the fingerprint index
isn't actually being used for queries with qmols; it's always doing a
sequential scan. As soon as I get done scratching my head I will put
this in the bug tracker.

Still, even if the index were being used I would not expect the
performance for SMARTS-based queries to be as good as that for
SMILES-based queries; the fingerprint just is not going to be as
effective.

To solve your problem in the short term: you can always apply the
brute force approach of running the queries in parallel. So split your
query file into, for example, four parts and then execute those four
sets of queries simultaneously. Postgres is quite good at making use
of multiple CPUs/threads when multiple queries are executing at once.

-greg

Re: [Rdkit-discuss] SMARTS queries using FP in Postgres Cartridge

Open-Source Cheminformatics and Machine Learning

Re: [Rdkit-discuss] SMARTS queries using FP in Postgres Cartridge