[Gmod-schema] Postgresql DBA-type question

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hello,

This is going to be fairly long.  

I am on a continuing quest to improve query performance for queries used
frequently by gbrowse.  The most common time consuming query is to find
all of the features in a given range.  To do that, there are two types
of queries that can be used, which I will refer to as Query1 and Query2.
Query1 uses the coordinate system I have advocated the last couple of
weeks, with min and max columns and min < max always.  (Those column
names should be changed to avoid conflicts with restricted words.)
Query2 uses a union of two queries that use the other coordinate system
(nbeg and nend) combined with strand.  Here are examples:

Query1:
select distinct f.name,fl.min,fl.max,fl.strand,f.type_id,f.feature_id
   from feature f, featureloc fl
     where
     fl.srcfeature_id = 1 and
     f.feature_id  = fl.feature_id and
     fl.max >= 390956 and
     fl.min <= 393164

Query2:
    select distinct f.name,fl.nbeg,fl.nend,fl.strand,f.type_id,f.feature_id
    from feature f, featureloc fl
    where
      fl.srcfeature_id = 1 and
      f.feature_id  = fl.feature_id and
      fl.nbeg <= 393164 and fl.nend >= 390956 and strand=1
   union
    select distinct f.name,fl.nbeg,fl.nend,fl.strand,f.type_id,f.feature_id
    from feature f, featureloc fl
    where
      fl.srcfeature_id = 1 and
      f.feature_id  = fl.feature_id and
      fl.nbeg >= 390956 and fl.nend <= 393164 and strand=-1

Here is the odd thing that I can't explain: the performance (read:index
usage) seems to vary by srcfeature_id.  Here's a table showing what I
mean (values are msec from explain analyze):

src_id      Query1      Query2
1        116433.34   209532.05
2           979.55     1039.81
4            36.76      251.25
5           369.95      901.55
6        108404.04   208322.54
7           798.28     1131.86

Note that srcfeature_id 4 is much faster because there are relatively
few features on it.  The biggest difference is srcfeature_ids 1 and 6
taking 3 orders of magnitude longer than the other queries.  That is
because the query planner uses seq scans when the srcfeature_id is 1 or
6, but for other srcfeature_ids, it uses the appropriate index (either
featureloc_src_nbeg_nend or featureloc_src_min_max depending on the
query).  I am at a total loss as to why postgres would do this; can
anyone with more experience with postgres enlighten me?

Note that this was done with gadfly 3 (filename gadfly_chado_dump2.gz)
if it matters to anyone.  I will probably take this to the postgres
mailing list as well to get there input.

Thanks,
Scott

-- 
------------------------------------------------------------------------
Scott Cain, Ph. D.                                         ca...@cs...
GMOD Coordinator (http://www.gmod.org/)                     216-392-3087
Cold Spring Harbor Laboratory