Thread: [SQLObject] PostgreSQL: costly count
SQLObject is a Python ORM.
Brought to you by:
ianbicking,
phd
From: Frank B. <fb...@fo...> - 2003-12-04 15:10:20
|
Hallo, getting at the lenght of a result with SQLResult's count() finction issues a "select count(*) from X". In my PostgreSQL table, this is quite expensive, and I am wondering, why. Analyzing the query shows, that no index is used: => explain SELECT count(*) from normal; NOTICE: QUERY PLAN: Aggregate (cost=3512.10..3512.10 rows=1 width=0) -> Seq Scan on normal (cost=0.00..3371.28 rows=56328 width=0) My table looks like this with pg_dump: CREATE SEQUENCE "normal_id_seq" start 1 increment 1 maxvalue 9223372036854775807 minvalue 1 cache 1; CREATE TABLE "normal" ( "id" integer DEFAULT nextval('"normal_id_seq"'::text) NOT NULL, "description" text, "format" character varying(16), "artist" character varying(200), "category" character varying(100), "title" character varying(200), "price" double precision, "label" character varying(200), "catalog" character varying(20), "datum" timestamp with time zone, "special" character varying(20), Constraint "normal_pkey" Primary Key ("id") ); ciao -- Frank Barknecht _ ______footils.org__ |
From: Ian B. <ia...@co...> - 2003-12-04 16:35:45
|
On Dec 4, 2003, at 9:10 AM, Frank Barknecht wrote: > Hallo, > > getting at the lenght of a result with SQLResult's count() finction > issues a "select count(*) from X". In my PostgreSQL table, this is > quite expensive, and I am wondering, why. Analyzing the query shows, > that no index is used: > => explain SELECT count(*) from normal; > NOTICE: QUERY PLAN: I don't think an index could be used in this instance. You aren't finding specific rows, you are finding how many total rows there are. It's kind of dumb that PostgreSQL can't figure this out more quickly, since it must have internal records of the table's size already calculated. I know MySQL has specific optimizations for this case ("SELECT count(*) FROM blah"). If you figure something out, be sure to get back to us, maybe there's something that can be added to SQLObject to fix this. -- Ian Bicking | ia...@co... | http://blog.ianbicking.org |
From: J-P L. <jp...@si...> - 2003-12-04 16:45:42
|
This is a common problem with most db's. select count(id) is usually the prefered way to do it. (Although I haven't tested the speed diff). Ian Bicking wrote: > On Dec 4, 2003, at 9:10 AM, Frank Barknecht wrote: > >> Hallo, >> >> getting at the lenght of a result with SQLResult's count() finction >> issues a "select count(*) from X". In my PostgreSQL table, this is >> quite expensive, and I am wondering, why. Analyzing the query shows, >> that no index is used: > > >> => explain SELECT count(*) from normal; >> NOTICE: QUERY PLAN: > > > I don't think an index could be used in this instance. You aren't > finding specific rows, you are finding how many total rows there are. > > It's kind of dumb that PostgreSQL can't figure this out more quickly, > since it must have internal records of the table's size already > calculated. I know MySQL has specific optimizations for this case > ("SELECT count(*) FROM blah"). If you figure something out, be sure > to get back to us, maybe there's something that can be added to > SQLObject to fix this. |
From: Frank B. <fb...@fo...> - 2003-12-04 16:52:01
|
Hallo, J-P Lee hat gesagt: // J-P Lee wrote: > This is a common problem with most db's. select count(id) is usually > the prefered way to do it. (Although I haven't tested the speed diff). I did, it's largely the same as count(*) on Postgres, and also uses a sequential scan through the whole DB. I'm just reading through this looong thread on pgsql-performance: http://archives.postgresql.org/pgsql-performance/2003-10/msg00058.php ciao -- Frank Barknecht _ ______footils.org__ |
From: J-P L. <jp...@si...> - 2003-12-04 16:59:46
|
Frank Barknecht wrote: >I did, it's largely the same as count(*) on Postgres, and also uses a >sequential scan through the whole DB. I'm just reading through this >looong thread on pgsql-performance: >http://archives.postgresql.org/pgsql-performance/2003-10/msg00058.php > > Hmm.. the thread seems to suggest the same thing. I'm not sure if you can get around the sequential scan. count(id) saves you the time of fetching all field contents. As the thread points out, this time diff is significant for large tables. For small ones, though, it's probably the same as count(*). -- J-P |
From: Frank B. <fb...@fo...> - 2003-12-04 17:26:16
|
Hallo, J-P Lee hat gesagt: // J-P Lee wrote: > Hmm.. the thread seems to suggest the same thing. I'm not sure if you > can get around the sequential scan. count(id) saves you the time of > fetching all field contents. Fetching field contents is not the problem, I guess, because a count doesn't (or shouldn't) fetch any contents (we didn't ask for them). This message however: http://archives.postgresql.org/pgsql-performance/2003-10/msg00093.php suggests doing a "count(pkey)" instead: <quote> Have you tried doing SELECT count(pkey) rather than count(*) where pkey is the primary key (assuming you have a single field that is a primary key or a unique indexed key). This is MUCH faster in my experience. </qoute> Note that the pkey is not the id-colmun, but an index on this column. I couldn't try this yet, but it might be worth a try. ciao -- Frank Barknecht _ ______footils.org__ |
From: Frank B. <fb...@fo...> - 2003-12-04 17:31:00
|
Hallo, Frank Barknecht hat gesagt: // Frank Barknecht wrote: > This message however: > http://archives.postgresql.org/pgsql-performance/2003-10/msg00093.php > suggests doing a "count(pkey)" instead: > > <quote> > Have you tried doing > > SELECT count(pkey) > > rather than count(*) where pkey is the primary key (assuming you > have a single field that is a primary key or a unique indexed key). > This is MUCH faster in my experience. > </qoute> > > Note that the pkey is not the id-colmun, but an index on this column. Ah, it seems I probably misread that message, and the author was indeed talking about the id-column. Well, at least with my 60.000 records table it wasn't faster. ciao -- Frank Barknecht _ ______footils.org__ |
From: Ian B. <ia...@co...> - 2003-12-04 17:42:46
|
On Dec 4, 2003, at 11:00 AM, J-P Lee wrote: > Hmm.. the thread seems to suggest the same thing. I'm not sure if you > can get around the sequential scan. count(id) saves you the time of > fetching all field contents. As the thread points out, this time diff > is significant for large tables. For small ones, though, it's > probably the same as count(*). count(*) at least has the potential to be more easily optimized (and now that I think about it, I believe MySQL optimizes all instances of count(*)). If you are counting any particular field, you have to look for NULLs (which don't get counted). If you're clever you can tell that the primary key column can't be NULL, with count(*) you don't need to be clever at all. And of course there's the optimization that you don't have to keep any information during your scan, except to increment a counter for each result (which is true of any count, except count(distinct ...)). Oh, but I see from the discussion that it's all about transactions, and that MySQL does the same thing with transactions as well. Anyway, found this in that thread: >>> example: On our message boards each post is a row. The powers that be like to know how many posts there are total (In addition to 'today')- select count(*) from posts is how it has been done on our informix db. With our port to PG I instead select reltuples pg_class. <<< pg_class is a mysterious table indeed, but it's got lots of good magic in it. (I'm not even sure if this is the right magic, but worth a try) -- Ian Bicking | ia...@co... | http://blog.ianbicking.org |
From: Sidnei da S. <si...@aw...> - 2003-12-04 17:55:33
|
| >>> | example: | On our message boards each post is a row. The powers that be like to | know | how many posts there are total (In addition to 'today')- | select count(*) from posts is how it has been | done on our informix db. With our port to PG I instead select reltuples | pg_class. | <<< | | pg_class is a mysterious table indeed, but it's got lots of good magic | in it. (I'm not even sure if this is the right magic, but worth a try) Humm... It seems to be the right one, but how can I measure which one is faster? In my 11k table, both are blazingly fast :) zope3=# select reltuples from pg_class where relname = 'aliquot'; reltuples ----------- 11801 (1 row) zope3=# select count(*) from aliquot; count ------- 11801 (1 row) -- Sidnei da Silva <si...@aw...> http://awkly.org - dreamcatching :: making your dreams come true http://plone.org/about/team#dreamcatcher The trouble with computers is that they do what you tell them, not what you want. -- D. Cohen |
From: Frank B. <fb...@fo...> - 2003-12-04 22:02:55
|
Hallo, Sidnei da Silva hat gesagt: // Sidnei da Silva wrote: > | >>> > | example: > | On our message boards each post is a row. The powers that be like to > | know > | how many posts there are total (In addition to 'today')- > | select count(*) from posts is how it has been > | done on our informix db. With our port to PG I instead select reltuples > | pg_class. > | <<< > | > | pg_class is a mysterious table indeed, but it's got lots of good magic > | in it. (I'm not even sure if this is the right magic, but worth a try) > > Humm... It seems to be the right one, but how can I measure which one > is faster? In my 11k table, both are blazingly fast :) Wow, this runs at the speed of light: fbar=> explain select reltuples from pg_class where relname = 'normal'; NOTICE: QUERY PLAN: Seq Scan on pg_class (cost=0.00..4.49 rows=1 width=4) EXPLAIN fbar=> explain select count(*)from normal; NOTICE: QUERY PLAN: Aggregate (cost=13239.34..13239.34 rows=1 width=0) -> Seq Scan on normal (cost=0.00..13098.07 rows=56507 width=0) EXPLAIN Doing both selects by hand, the one with pg_class returns practically immediatly, the standard count has a duration of about 3.4 seconds, as far as my rhythm feel tells me (but I'm a musician, so that's quite accurate ;) Now, how to put this to use? ciao -- Frank Barknecht _ ______footils.org__ |
From: Frank B. <fb...@fo...> - 2003-12-04 22:05:36
|
Hallo, Frank Barknecht hat gesagt: // Frank Barknecht wrote: > Now, how to put this to use? And related: This does not count the results from where-claused queries like: "SELECT COUNT(*) FROM normal WHERE artist LIKE '%Britney%';", or does it (somehow)? ciao -- Frank Barknecht _ ______footils.org__ |
From: Ian B. <ia...@co...> - 2003-12-04 22:41:40
|
On Dec 4, 2003, at 4:05 PM, Frank Barknecht wrote: > Hallo, > Frank Barknecht hat gesagt: // Frank Barknecht wrote: > >> Now, how to put this to use? > > And related: This does not count the results from where-claused > queries like: "SELECT COUNT(*) FROM normal WHERE artist LIKE > '%Britney%';", or does it (somehow)? My impression is that this is an internal value kept to do query optimization (since table size effects that considerably). It may not be entirely accurate, and doesn't get updated immediately. It can't be used with any query. If you want a query to be fast, you may want to use full-text search, or otherwise partition your table more, e.g., an artist table, and you do something like "SELECT COUNT(normal.id) FROM normal, artist WHERE normal.artist_id = artist.id AND (artist.first_name = 'Britney' OR artist.last_name = 'Britney')". Everything in that query can be indexed, and so it will be quite fast (and more importantly, scale to large databases). LIKE doesn't scale well (but full text scales better). If you are doing some sort of paging, you might simply want to cache the results of the count. Also, if you are doing paging with ordering (which you probably want), then paging and slicing will only save you from fetching and constructing all the objects, but the database still has to look at all the rows (because it can't sort them until it does that). So it might be faster to pull and cache the full results once, on the assumption that the user will be looking at later results eventually. Hmm... a useful feature would be to implement equality and hashing for query objects (which are immutable). Then you could construct a query, and use that as a dictionary/cache key for the select results. That might be too difficult, though, because AND(a, b) == AND(b, a), but those logic relationships are hard to find. You can always get the query's string/SQL representation, and cache based on that. -- Ian Bicking | ia...@co... | http://blog.ianbicking.org |
From: Frank B. <fb...@fo...> - 2003-12-05 10:58:21
|
Hallo, Ian Bicking hat gesagt: // Ian Bicking wrote: > My impression is that this is an internal value kept to do query > optimization (since table size effects that considerably). It may not > be entirely accurate, and doesn't get updated immediately. It can't be > used with any query. Yes, you're right. A pity. ;) > If you are doing some sort of paging, you might simply want to cache > the results of the count. Also, if you are doing paging with ordering > (which you probably want), then paging and slicing will only save you > from fetching and constructing all the objects, but the database still > has to look at all the rows (because it can't sort them until it does > that). So it might be faster to pull and cache the full results once, > on the assumption that the user will be looking at later results > eventually. Thanks for this tip. I did something similar now for my shop (normalmailorder.de). As I translate from URLs to SQL queries in a constant way and as the product DB is never written partly, but always changed by a single CSV file import as a whole once per week, I can be sure, to always get the same count depending on the HTTP-request URL. So I now introduced a simple CountCache table, that has two fields, "url" and "count". If there's no CountCache.byUrl(some_url), the count results are fetched from the DB and written to the CountCache. If the URL already is in the Cache, this one gets used. On CSV-import the table is emptied. I could also rebuild it from the URLs used so far, maybe I'll do that later. ciao -- Frank Barknecht _ ______footils.org__ |