Re: [Rdkit-discuss] SMARTS queries using FP in Postgres Cartridge
Open-Source Cheminformatics and Machine Learning
                
                Brought to you by:
                
                    glandrum
                    
                
            
            
        
        
        
    | 
      
      
      From: Greg L. <gre...@gm...> - 2012-03-04 17:54:44
      
     | 
| On Fri, Feb 24, 2012 at 5:44 AM, Greg Landrum <gre...@gm...> wrote:
>
> But then I realized from that last result that the fingerprint index
> isn't actually being used for queries with qmols; it's always doing a
> sequential scan. As soon as I get done scratching my head I will put
> this in the bug tracker.
>
> Still, even if the index were being used I would not expect the
> performance for SMARTS-based queries to be as good as that for
> SMILES-based queries; the fingerprint just is not going to be as
> effective.
Ok, I have this resolved in svn in some form.
In order to make it work in a reasonable manner, I had to add a new
function "mol_from_smarts()" that gives you a normal mol from a SMARTS
string.
Here are some example queries.
Start with SMILES:
chembl_12=# select count(*) from rdk.mols where
m@>mol_from_smarts('O=c1ccc2ccccc2o1');
 count
-------
  9231
(1 row)
Time: 14156.885 ms
The same query as SMARTS takes the same amount of time:
chembl_12=# select count(*) from rdk.mols where
m@>mol_from_smarts('O=c1ccc2ccccc2o1');
 count
-------
  9231
(1 row)
Time: 13945.446 ms
Adding a query feature slows things down somewhere between dramatically:
chembl_12=# select count(*) from rdk.mols where
m@>mol_from_smarts('[O,N,C]=c1ccc2ccccc2o1');
 count
-------
  9462
(1 row)
Time: 36799.979 ms
and very dramatically:
chembl_12=# select count(*) from rdk.mols where
m@>mol_from_smarts('O=c1ccc2ccccc2[o,n]1');
 count
-------
 12441
(1 row)
Time: 53893.576 ms
Note that the degradation of performance caused by query features is
going to seriously affect your queries as well.
For the example you give in your original post, which is a single
atom, the index doesn't help at all:
chembl_12=# select count(*) from rdk.mols where
m@>mol_from_smarts('[!#1;!#6;!#7;!#8;!#9;!#16;!Cl;!Br;!I]');
 count
-------
 31656
(1 row)
Time: 57572.277 ms
chembl_12=# set enable_indexscan=off;
SET
Time: 0.173 ms
chembl_12=# set enable_bitmapscan=off;
SET
Time: 0.153 ms
chembl_12=# select count(*) from rdk.mols where
m@>mol_from_smarts('[!#1;!#6;!#7;!#8;!#9;!#16;!Cl;!Br;!I]');
 count
-------
 31656
(1 row)
Time: 55173.435 ms
Best,
-greg
 |