|
From: Scott C. <ca...@cs...> - 2003-02-26 18:34:41
|
Hello,
This is going to be fairly long.
I am on a continuing quest to improve query performance for queries used
frequently by gbrowse. The most common time consuming query is to find
all of the features in a given range. To do that, there are two types
of queries that can be used, which I will refer to as Query1 and Query2.
Query1 uses the coordinate system I have advocated the last couple of
weeks, with min and max columns and min < max always. (Those column
names should be changed to avoid conflicts with restricted words.)
Query2 uses a union of two queries that use the other coordinate system
(nbeg and nend) combined with strand. Here are examples:
Query1:
select distinct f.name,fl.min,fl.max,fl.strand,f.type_id,f.feature_id
from feature f, featureloc fl
where
fl.srcfeature_id = 1 and
f.feature_id = fl.feature_id and
fl.max >= 390956 and
fl.min <= 393164
Query2:
select distinct f.name,fl.nbeg,fl.nend,fl.strand,f.type_id,f.feature_id
from feature f, featureloc fl
where
fl.srcfeature_id = 1 and
f.feature_id = fl.feature_id and
fl.nbeg <= 393164 and fl.nend >= 390956 and strand=1
union
select distinct f.name,fl.nbeg,fl.nend,fl.strand,f.type_id,f.feature_id
from feature f, featureloc fl
where
fl.srcfeature_id = 1 and
f.feature_id = fl.feature_id and
fl.nbeg >= 390956 and fl.nend <= 393164 and strand=-1
Here is the odd thing that I can't explain: the performance (read:index
usage) seems to vary by srcfeature_id. Here's a table showing what I
mean (values are msec from explain analyze):
src_id Query1 Query2
1 116433.34 209532.05
2 979.55 1039.81
4 36.76 251.25
5 369.95 901.55
6 108404.04 208322.54
7 798.28 1131.86
Note that srcfeature_id 4 is much faster because there are relatively
few features on it. The biggest difference is srcfeature_ids 1 and 6
taking 3 orders of magnitude longer than the other queries. That is
because the query planner uses seq scans when the srcfeature_id is 1 or
6, but for other srcfeature_ids, it uses the appropriate index (either
featureloc_src_nbeg_nend or featureloc_src_min_max depending on the
query). I am at a total loss as to why postgres would do this; can
anyone with more experience with postgres enlighten me?
Note that this was done with gadfly 3 (filename gadfly_chado_dump2.gz)
if it matters to anyone. I will probably take this to the postgres
mailing list as well to get there input.
Thanks,
Scott
--
------------------------------------------------------------------------
Scott Cain, Ph. D. ca...@cs...
GMOD Coordinator (http://www.gmod.org/) 216-392-3087
Cold Spring Harbor Laboratory
|