From: Derryck w. <der...@ce...> - 2008-06-20 14:05:48
|
Hi , I always assumed using an index works faster ( well it is faster when the count volume is low). but this case proved otherwise (when the counted volume gets very high). select count(*) as count_order from lineitem where l_shipdate <= date '1998-12-01' - 90; With index on l_shipdate PLAN (LINEITEM INDEX (LINEITEM_SHIPDATE)) Current memory = 4633132 Delta memory = 3696420 Max memory = 4633132 Elapsed time= 39.14 sec Buffers = 75 Reads = 111237 Writes 0 Fetches = 11837255 Without index works faster ! PLAN (LINEITEM NATURAL) Current memory = 1009304 Delta memory = 72592 Max memory = 1156720 Elapsed time= 35.66 sec Buffers = 75 Reads = 107563 Writes 0 Fetches = 12217691 tested on 2.1 and 2.1.1RC1 Can i put this in the tracker ? Regards, Derryck |
From: Dmitry Y. <fir...@ya...> - 2008-06-25 17:37:31
|
Derryck welas wrote: > > I always assumed using an index works faster Generally, this is a false assumption. Index is useful only if it's selective enough. > select count(*) as count_order > from lineitem > where l_shipdate <= date '1998-12-01' - 90; How many rows is in lineitem: (a) totally and (b) corresponding your predicate? I suppose the difference between these two numbers is not very big. > Can i put this in the tracker ? Depending on what do you want us to improve :-) Do you want the optimizer to ignore that index in your query? But this is impossible without value distribution histograms in table/index statistics. Dmitry |
From: Derryck w. <der...@ce...> - 2008-06-26 11:22:52
|
Hi All, Dmitry Yemanov" wrote: >How many rows is in lineitem: (a) totally and (b) corresponding your >predicate? I suppose the difference between these two numbers is not >very big. in this case: a=6001215 b=5916591 > Depending on what do you want us to improve :-) in general make count faster :-) on details i dont know , scan the index for the count ? i can already see there is less fetching with an index but the endresult is more time for the count operation What would really help is to have a param to cache the counted value (hash per unique query) for a certain time. example SELECT CACHED <seconds> COUNT(*) FROM.. For example if this count is called within 1 minute it just returns the cached value. then after time ellapsed time do a real count again. I do something like this with stored procedures and a temp table but it is a little bit messy for example if 5 users start the procedure at once, (my actual counts run for more than a minute) i still get 5 simultanous counts performed on the server. oke the cached thing maybe looks horrible :-) but for very large counts its the only solution. >adam wrote: >How did you test this? You need to remember that caching happens both >within Firebird and also your operating system, so the query that runs >first will have to wait for disks, whereas the query that runs second >can often just read from the cache. >paul wrote: >It might also be interesting (and very quick to verify) to just reverse the >order of the tests. On the evidence presented it looks like the second test >is >benefiting from the cache created by the first test. Dont mind the testing i rebooted to give the test identical conditions. misc info: FB CS 2.1.1 /XP/1GB mem/dual core pentium 1.8Ghz steps: 1e) reboot 2e) PLAN (LINEITEM INDEX (LINEITEM_SHIPDATE)) COUNT_ORDER ============ 5916591 Current memory = 4633152 Delta memory = 3696412 Max memory = 4633152 Elapsed time= 39.03 sec Buffers = 75 Reads = 111237 Writes 0 Fetches = 11837255 3e) Current memory = 4633156 Delta memory = 3696412 Max memory = 4633156 Elapsed time= 37.87 sec Buffers = 75 Reads = 111237 Writes 0 Fetches = 11837255 4e) Current memory = 4633156 Delta memory = 3696412 Max memory = 4633156 Elapsed time= 37.78 sec Buffers = 75 Reads = 111237 Writes 0 Fetches = 11837255 1e)reboot: 2e) PLAN (LINEITEM NATURAL) COUNT_ORDER ============ 5916591 Current memory = 1009508 Delta memory = 72772 Max memory = 1156796 Elapsed time= 37.87 sec Buffers = 75 Reads = 107563 Writes 0 Fetches = 12217691 3e) Current memory = 1009512 Delta memory = 72772 Max memory = 1156800 Elapsed time= 37.21 sec Buffers = 75 Reads = 107563 Writes 0 Fetches = 12217691 4e) Current memory = 1009512 Delta memory = 72772 Max memory = 1156800 Elapsed time= 36.16 sec Buffers = 75 Reads = 107563 Writes 0 Fetches = 12217691 To me this is closer to a production environment alternate between the 2 queries Q1=with use of index.,Q2=without index. Q1 Q2 Q1 Q2 Q1 Q2 38.06, 36.14, 37.29, 36.18 37.32 36.17 Derryck |
From: Paul R. <pr...@ib...> - 2008-07-03 09:46:22
|
On Thursday 26 June 2008, Derryck welas wrote: > > What would really help is to have a param to cache the counted value (hash > per unique query) for a certain time. I like the idea, but I wonder how it could be implemented. I also wonder just how useful it is. After all, if it is cached it is possibly/probably inaccurate so the real value may not matter. In which case, a fuzzy count may be better - approximating to the nearest 100 or 1,000 or 1,000,000 etc or using a scale set by the query. Something along the lines of: select fuzzy count(*,3) from mytable to return a value approximating to the nearest thousand or just select fuzzy count(*) from mytable and let the engine decide the scale. It might also make sense to add a floor or ceiling parameter to fuzzy count, so that a result could be interpreted as 'less than a 100' or 'more than 5,000,000'. Google do something like that when the hit count is large. > >paul wrote: > >It might also be interesting (and very quick to verify) to just reverse > > the order of the tests. On the evidence presented it looks like the > > second test is benefiting from the cache created by the first test. Of course, I was thinking that you actually had a cache to influence the result - but I now realise you don't. The default cache for classic is 75 pages and that is next to useless in this case. That said, it does show up the anomaly quite well. Paul -- Paul Reeves http://www.ibphoenix.com Specialists in Firebird support |
From: Ann W. H. <aha...@ib...> - 2008-07-03 20:08:07
|
Derryck welas wrote: > > in general make count faster :-) > > on details i dont know , scan the index for the count ? One problem with count in a pure MVCC engine is that different transactions get different values for a count over the same range of values, depending on their age. Firebird indexes don't include transaction information, so the only way to know of a particular record is valid for a particular transaction is to read the record itself. So, you say, why not put transaction information in the indexes to speed up count? To be sure that a particular entry is valid for a particular transaction, you need to have both the id of the transaction that created the version of the record with that value, plus the id of the transaction that removed that value from the index. Transactions older than the creator can't see the value. Transactions that started after the creator committed but before the destroyer committed do see the value. Transactions started after the destroyer committed don't see the value. Currently, Firebird has one index entry for each distinct key value in a record - if the record has 14 versions but only two values for the key, there are only two index entries. So, to make count faster, you add 8 bytes to every index entry, and add more index entries, making indexes larger, deeper and slower. You could reduce the amount of space used by removing the id of the creator when an entry was fully mature - that is all running transactions started after the creator. I'll leave the exercise of estimating the CPU and I/O cost of contracting and expanding indexes to someone else. > > What would really help is to have a param to cache the counted value (hash > per unique query) for a certain time. Doesn't work - there is no one right answer. > example SELECT CACHED <seconds> COUNT(*) FROM.. > For example if this count is called within 1 minute it just returns the > cached value. Ah! You don't want the right answer, you just want an answer. That does make it easier, I guess, as long as we can tell the difference between counts that should be correct and those that are approximate. Cheers, Ann |
From: Adam G. <s30...@ya...> - 2008-07-03 23:44:42
|
"Ann W. Harrison" wrote: > Derryck welas wrote: > > > > in general make count faster :-) > > > > on details i dont know , scan the index for the count ? > > One problem with count in a pure MVCC engine is that different > transactions get different values for a count over the same range > of values, depending on their age. Firebird indexes don't include > transaction information, so the only way to know of a particular > record is valid for a particular transaction is to read the record > itself. > > So, you say, why not put transaction information in the indexes to > speed up count? To be sure that a particular entry is valid for > a particular transaction, you need to have both the id of the > transaction that created the version of the record with that > value, plus the id of the transaction that removed that value > from the index. Transactions older than the creator can't see > the value. Transactions that started after the creator committed > but before the destroyer committed do see the value. Transactions > started after the destroyer committed don't see the value. > > Currently, Firebird has one index entry for each distinct key > value in a record - if the record has 14 versions but only two > values for the key, there are only two index entries. > > So, to make count faster, you add 8 bytes to every index entry, > and add more index entries, making indexes larger, deeper and > slower. > > You could reduce the amount of space used by removing the id of > the creator when an entry was fully mature - that is all running > transactions started after the creator. I'll leave the exercise > of estimating the CPU and I/O cost of contracting and expanding > indexes to someone else. > > > > > What would really help is to have a param to cache the counted > > value (hash per unique query) for a certain time. > > Doesn't work - there is no one right answer. > > > example SELECT CACHED <seconds> COUNT(*) FROM.. > > For example if this count is called within 1 minute it just returns > > the cached value. > > Ah! You don't want the right answer, you just want an answer. That > does make it easier, I guess, as long as we can tell the difference > between counts that should be correct and those that are approximate. > > > Cheers, > > > Ann > If one is only interested in a guestimation of the count with no particular accuracy, there are a couple of methods. The simplest is to have a generator value, increment it in an after insert and decrement it in an after delete trigger. This will be inaccurate because: * uncommitted transactions count in the estimation * rolled back transactions count in the estimation But for the right purpose it would be a very fast compromise. Also, you can get a closer estimation by using the selectivity of the index (this includes uncommitted transactions but not rolled back transactions once garbage collection has taken place), providing the record has a primary key. http://www.firebirdfaq.org/faq5/ But there is a conceptual problem with counts - most of the time people don't care. I might care whether there is 200 or 100,000,000, but the fact that there is exactly 98,433,123 is no more helpful for the end user than an estimation of 98,500,000. Certainly it might be helpful to provide a fast shortcut to an estimation, providing the syntax was such that it was obvious that the result is not exact. Adam |
From: Nikolay S. <nik...@re...> - 2008-07-04 00:20:45
|
Adam, Ann, > But there is a conceptual problem with counts - most of the time people > don't care. I might care whether there is 200 or 100,000,000, but the > fact that there is exactly 98,433,123 is no more helpful for the end > user than an estimation of 98,500,000. Certainly it might be helpful to > provide a fast shortcut to an estimation, providing the syntax was such > that it was obvious that the result is not exact. > One way to do show Count(*) and other aggregates quickly and accurately is to materialize views with aggregates computed for the relevant groups of rows. Materialized views can be updated on each transaction, if necessary and used "like index" to evaluate query quickly. Replication techniques can be used to maintain materialized views efficiently. Query can than be rewritten by the engine to execute against materialized view instead of original table. We use this approach in production in our MDX engine for Firebird (based on Mondrian). -- Nikolay Samofatov, MBA Red Soft International +1 416 710 6854 |
From: Thomas S. <ts...@ib...> - 2008-07-04 05:32:11
|
Hi, >> But there is a conceptual problem with counts - most of the time people >> don't care. I might care whether there is 200 or 100,000,000, but the >> fact that there is exactly 98,433,123 is no more helpful for the end >> user than an estimation of 98,500,000. Certainly it might be helpful to >> provide a fast shortcut to an estimation, providing the syntax was such >> that it was obvious that the result is not exact. >> > One way to do show Count(*) and other aggregates quickly and accurately > is to materialize views with aggregates computed for the relevant groups > of rows. > Materialized views can be updated on each transaction, if necessary and > used "like index" to evaluate query quickly. > Replication techniques can be used to maintain materialized views > efficiently. > > Query can than be rewritten by the engine to execute against > materialized view instead of original table. In case of Firebird, with "by the engine", it's the Mondrian OLAP server which does support query rewriting and not Firebird. > We use this approach in production in our MDX engine for Firebird (based > on Mondrian). Aggregate tables is a pretty neat mechanism and in combination with an OLAP server which supports query rewriting, it's even more powerful. I have a German article about aggregate tables with Firebird in the queue, soon to be published in a German magazine. -- Best Regards, Thomas Steinmaurer LogManager Series - Logging/Auditing Suites supporting InterBase, Firebird, Advantage Database, MS SQL Server and NexusDB V2 Upscene Productions http://www.upscene.com |
From: Milan B. <mi...@pa...> - 2008-07-04 10:44:56
|
Thomas Steinmaurer wrote: > Aggregate tables is a pretty neat mechanism and in combination with an > OLAP server which supports query rewriting, it's even more powerful. > > I have a German article about aggregate tables with Firebird in the > queue, soon to be published in a German magazine. Looks like a possible good topic for the Conference. Any plans on that? -- Milan Babuskov http://www.guacosoft.com |
From: Thomas S. <ts...@ib...> - 2008-07-04 14:19:43
|
> Thomas Steinmaurer wrote: >> Aggregate tables is a pretty neat mechanism and in combination with an >> OLAP server which supports query rewriting, it's even more powerful. >> >> I have a German article about aggregate tables with Firebird in the >> queue, soon to be published in a German magazine. > > Looks like a possible good topic for the Conference. Any plans on that? If I would be there, I would have proposed two sessions into this direction. But I won't make it this year. -- Best Regards, Thomas Steinmaurer LogManager Series - Logging/Auditing Suites supporting InterBase, Firebird, Advantage Database, MS SQL Server and NexusDB V2 Upscene Productions http://www.upscene.com |
From: Adam G. <s30...@ya...> - 2008-06-25 23:43:54
|
"Derryck welas" wrote: > Hi , > > I always assumed using an index works faster ( well it is faster > when the count volume is low). > > but this case proved otherwise (when the counted volume gets very > high). Derryck, How did you test this? You need to remember that caching happens both within Firebird and also your operating system, so the query that runs first will have to wait for disks, whereas the query that runs second can often just read from the cache. To get an honest profile, use two copies of the database (as two different aliases). Run one test against one database and the other against the other database. Adam |
From: Paul R. <pr...@ib...> - 2008-06-26 06:19:34
|
On Thursday 26 June 2008, Adam Gardner wrote: > > To get an honest profile, use two copies of the database (as two > different aliases). Run one test against one database and the other > against the other database. > Agreed. Perhaps even restarting Firebird in between. I would also suggest running the tests say, five times, and taking an average. It is reasonable to assume that under most usage scenarios that some or all of the data will already be in the cache anyway. It might also be interesting (and very quick to verify) to just reverse the order of the tests. On the evidence presented it looks like the second test is benefiting from the cache created by the first test. Paul -- Paul Reeves http://www.ibphoenix.com Specialists in Firebird support |
From: Leyne, S. <Se...@br...> - 2008-06-26 15:25:06
|
> On Thursday 26 June 2008, Adam Gardner wrote: > > > > To get an honest profile, use two copies of the database (as two > > different aliases). Run one test against one database and the other > > against the other database. > > > > Agreed. Perhaps even restarting Firebird in between. > > I would also suggest running the tests say, five times, and taking an > average. Or simply throw out the timing of the first run. Sean |