Re: [Rdkit-discuss] SMARTS queries using FP in Postgres Cartridge
Open-Source Cheminformatics and Machine Learning
                
                Brought to you by:
                
                    glandrum
                    
                
            
            
        
        
        
    | 
      
      
      From: JP <jea...@in...> - 2012-03-06 11:02:02
      
     | 
| Just for comparison, how large is your test database?
Do you think I should convert '[!#1;!#6;!#7;!#8;!#9;!#16;!Cl;!Br;!I]' to
multiple SMILES queries instead?  Do you figure that would be faster?
-
Jean-Paul Ebejer
Early Stage Researcher
On 4 March 2012 17:54, Greg Landrum <gre...@gm...> wrote:
> On Fri, Feb 24, 2012 at 5:44 AM, Greg Landrum <gre...@gm...>
> wrote:
> >
> > But then I realized from that last result that the fingerprint index
> > isn't actually being used for queries with qmols; it's always doing a
> > sequential scan. As soon as I get done scratching my head I will put
> > this in the bug tracker.
> >
> > Still, even if the index were being used I would not expect the
> > performance for SMARTS-based queries to be as good as that for
> > SMILES-based queries; the fingerprint just is not going to be as
> > effective.
>
> Ok, I have this resolved in svn in some form.
> In order to make it work in a reasonable manner, I had to add a new
> function "mol_from_smarts()" that gives you a normal mol from a SMARTS
> string.
>
> Here are some example queries.
>
> Start with SMILES:
> chembl_12=# select count(*) from rdk.mols where
> m@>mol_from_smarts('O=c1ccc2ccccc2o1');
>  count
> -------
>  9231
> (1 row)
>
> Time: 14156.885 ms
>
> The same query as SMARTS takes the same amount of time:
> chembl_12=# select count(*) from rdk.mols where
> m@>mol_from_smarts('O=c1ccc2ccccc2o1');
>  count
> -------
>  9231
> (1 row)
>
> Time: 13945.446 ms
>
> Adding a query feature slows things down somewhere between dramatically:
>
> chembl_12=# select count(*) from rdk.mols where
> m@>mol_from_smarts('[O,N,C]=c1ccc2ccccc2o1');
>  count
> -------
>  9462
> (1 row)
>
> Time: 36799.979 ms
>
> and very dramatically:
> chembl_12=# select count(*) from rdk.mols where
> m@>mol_from_smarts('O=c1ccc2ccccc2[o,n]1');
>  count
> -------
>  12441
> (1 row)
>
> Time: 53893.576 ms
>
>
> Note that the degradation of performance caused by query features is
> going to seriously affect your queries as well.
> For the example you give in your original post, which is a single
> atom, the index doesn't help at all:
>
> chembl_12=# select count(*) from rdk.mols where
> m@>mol_from_smarts('[!#1;!#6;!#7;!#8;!#9;!#16;!Cl;!Br;!I]');
>  count
> -------
>  31656
> (1 row)
>
> Time: 57572.277 ms
> chembl_12=# set enable_indexscan=off;
> SET
> Time: 0.173 ms
> chembl_12=# set enable_bitmapscan=off;
> SET
> Time: 0.153 ms
> chembl_12=# select count(*) from rdk.mols where
> m@>mol_from_smarts('[!#1;!#6;!#7;!#8;!#9;!#16;!Cl;!Br;!I]');
>  count
> -------
>  31656
> (1 row)
>
> Time: 55173.435 ms
>
>
> Best,
> -greg
>
 |