Thread: [Rdkit-discuss] Need for speed -- postgresql / rdkit use of indices(/indexes)
Open-Source Cheminformatics and Machine Learning
Brought to you by:
glandrum
From: JP <jea...@in...> - 2014-07-14 15:51:54
|
Hi there, TL;DR Some queries using the index are slower than queries not using it. I have been looking (without much success until now) at improving the speed of a postgresql database with rdkit installed. First some version preliminaries: Postgresql: 9.3.4 RDKit postgresql cartridge version: 0.72.0 (via select rdkit_version();) My database has only one simple table "molecule" (described below), and this was created following the instructions in http://www.rdkit.org/docs/Cartridge.html religiously. moldb-# \d molecule Table "public.molecule" Column | Type | Modifiers --------+---------+----------- id | integer | mol | mol | Indexes: "molidx" gist (mol) This tables contains 15,352,756 molecules. Now, when I make a substructure search which returns a good number of rows (e.g. aromatic ring) the query takes "forever" (137 minutes, more than what Germany employed to get the better of Messi & Co. yesterday). moldb=# select id, mol from molecule where mol@>'c1ccccc1'; Time: 8260085.134 ms The query plan uses two queries (nested) as can be seen by: moldb=# EXPLAIN (ANALYZE, BUFFERS) select id, mol from molecule where mol@ >'c1ccccc1'; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------- Bitmap Heap Scan on molecule (cost=3439.42..57845.93 rows=15355 width=348) (actual time=67358.636..1408828.792 rows=12555041 loops=1) Recheck Cond: (mol @> 'c1ccccc1'::mol) Rows Removed by Index Recheck: 70657 Buffers: shared hit=33551 read=1516077 -> Bitmap Index Scan on molidx (cost=0.00..3435.58 rows=15355 width=0) (actual time=66891.010..66891.010 rows=12625698 loops=1) Index Cond: (mol @> 'c1ccccc1'::mol) Buffers: shared read=825325 >From the postgresql documentation of similar output: "Here the planner has decided to use a two-step plan: the child plan node visits an index to find the locations of rows matching the index condition, and then the upper plan node actually fetches those rows from the table itself. Fetching rows separately is much more expensive than reading them sequentially, but because not all the pages of the table have to be visited, this is still cheaper than a sequential scan. (The reason for using two plan levels is that the upper plan node sorts the row locations identified by the index into physical order before reading them, to minimize the cost of separate fetches. The "bitmap" mentioned in the node names is the mechanism that does the sorting.)" But random access is more expensive than sequential IO. So my next thought was to disable the use of the index in the query. A temp. way to do this is via these commands (note that this should only be done for debugging purposes and not on live systems): moldb=# set enable_hashagg=off; moldb=# set enable_hashjoin=off; moldb=# set enable_indexscan=off; moldb=# set enable_mergejoin=off; moldb=# set enable_nestloop=off; moldb=# set enable_tidscan=off; moldb=# set enable_sort=off; In fact my explain result now looks like (happily bypassing the index): moldb=# explain select id, mol from molecule where mol@>'c1ccccc1'; QUERY PLAN ------------------------------------------------------------------- Seq Scan on molecule (cost=0.00..923708.29 rows=15355 width=348) Filter: (mol @> 'c1ccccc1'::mol) (2 rows) And running this query now takes almost half the time, 81 minutes (still a draw between the two sides). moldb=# select id, mol from molecule where mol@>'c1ccccc1'; Time: 4888271.991 ms Note that a query which returns a few molecules, returns in super quick time (<1s): moldb=# select * from molecule where mol@>'c1cccc2c1nncc2'; -- there are ~6,000 of these Time: 4476.669 ms But substructures which are found more often, return more slowly. moldb=# select count(*) from molecule where mol@>'c1ccccc1C(=O)OC'; count -------- 422439 (1 row) Time: 66134.405 ms I wonder if there is anyone who has looked at this/has similar experience and has some answers to the following questions: - Am I doing anything obviously wrong? / Is there an easy fix to speed these queries up? (without using limit) - Is there a way for RDKit to suggest to Postgresql to NOT use the index in certain cases? - Which RDKit code should I be looking at to try to optimize this? More specifically, even looking for molecules with a carbon atom 'C' triggers a search using the index, even if all my molecules in the database have a carbon atom (so a sequential search would make more sense here). - I have tried to play around with the table statistics - alter table molecule alter mol set statistics 1000; - this should be the thing which suggests to the query planner whether to use the index or not (based on a sample of the table) - but I haven't noticed any difference. Does the RDKit indexing technology support this? Apologies for the long email and happy Monday to everyone! JP - Jean-Paul Ebejer Early Stage Researcher |
From: Greg L. <gre...@gm...> - 2014-07-17 04:38:48
|
Hi JP, I'm between short trips, so this reply is delayed and will be somewhat short. On Mon, Jul 14, 2014 at 5:19 PM, JP <jea...@in...> wrote: > Hi there, > > TL;DR Some queries using the index are slower than queries not using it. > <snip> > I wonder if there is anyone who has looked at this/has similar experience > and has some answers to the following questions: > > - Am I doing anything obviously wrong? / Is there an easy fix to speed > these queries up? (without using limit) > not based on the skim I just did. > - Is there a way for RDKit to suggest to Postgresql to NOT use the index > in certain cases? > You can certainly do it yourself on a query-by-query basis, but I'm not aware of any way to do it automatically. > - Which RDKit code should I be looking at to try to optimize this? More > specifically, even looking for molecules with a carbon atom 'C' triggers a > search using the index, even if all my molecules in the database have a > carbon atom (so a sequential search would make more sense here). > - I have tried to play around with the table statistics - alter table > molecule alter mol set statistics 1000; - this should be the thing which > suggests to the query planner whether to use the index or not (based on a > sample of the table) - but I haven't noticed any difference. Does the > RDKit indexing technology support this? > To restate the problem that you are encountering : the index isn't actually screening much of anything out, so using it ends up being a waste of time. What would be required is some way for the postgresql optimizer to recognize either that the index is not going to be effective for a particular query or that the query is going to return a large percentage of the database so that it can switch strategies. I don't know how to do this in an automated way. One manual approach is to have a small (1K-10K) structure table that is representative of your main structure table and do a "pre-query" on that one. If that pre-query returns a sizable fraction of the rows, then you know that you should disable the index scan for that query. This requires some coding on your part, but should certainly help here. -greg |