From: Fred T. <fr...@us...> - 2011-11-25 12:27:57
|
The new script is doing a partial table scan for each chunk of the update procedure. But the scan size goes up in each iteration. Scanning 500000 rows takes about a minute and the time is almost linear when the row count varies. My suggestion was to find the chunks in several partial scans, but with fixed scan sizes. This may not be useful with your data change pattern if one of the PK columns is changing. I have not found any issue with cache management. Your selects are simply reading more rows in each iteration. Anyway, the kind of condition that does not cause big scans in SELECT and UPDATE is KEY1 = 'val' AND KEY2 > 'val2' AND ROWNUM() < chunk size Anything using an IN subquery that scans a lot of rows, slows down the operation. Fred On Friday, November 25, 2011 11:20 AM, "Renaud Pelissier" <ren...@gm...> wrote: I forgot to join the new version of the script: ***************************************************************** * \p Dropping previous datas and conf DROP INDEX IDX_test_bool IF EXISTS; DROP PROCEDURE populate_proc IF EXISTS; DROP TABLE test IF EXISTS; \p Creating table CREATE CACHED TABLE test(key1 VARCHAR(64), key2 VARCHAR(1024), bool BOOLEAN DEFAULT false, PRIMARY KEY (key1, key2)); \p Creating index --CREATE INDEX IDX_test_bool ON test(bool); \p Creating procedure CREATE PROCEDURE populate_proc(IN key1_var VARCHAR(64), IN max_var INT, IN iter_var INT) MODIFIES SQL DATA BEGIN ATOMIC DECLARE count_var INT; DECLARE batch_size_var INT; SET count_var=0; loop_label: WHILE count_var < max_var DO INSERT INTO test VALUES(key1_var,CONCAT('key2_',CAST(iter_var AS VARCHAR(32)),'_',CAST(count_var AS VARCHAR(32))), CAST(MOD(iter_var,2) AS BOOLEAN)); SET count_var = count_var + 1; END WHILE loop_label; END; .; SET FILES LOG FALSE; SET DATABASE REFERENTIAL INTEGRITY FALSE; * timestamp = 0 * delta_time = 0 * count = 0 * nchunk = 100 * chunksize=1000 * i = 1 \p Starting *{nchunk} insert iteration with key1-A by chunks of *{chunksize} rows * while (*i <= *nchunk) CALL populate_proc('key1-A',*{chunksize},*{i}); CALL UNIX_TIMESTAMP(); * ((delta_time=?-timestamp)) * timestamp = *{?} COMMIT; * ((i += 1)) * ((count+=chunksize)) \p >*{timestamp};*{delta_time};*{count} * end while SET FILES LOG TRUE; SET DATABASE REFERENTIAL INTEGRITY TRUE; SET FILES LOG FALSE; SET DATABASE REFERENTIAL INTEGRITY FALSE; * count = 0 * nchunk = 400 * chunksize=1000 * i = 1 \p Starting *{nchunk} insert iteration with key1-B by chunks of *{chunksize} rows * while (*i <= *nchunk) CALL populate_proc('key1-B',*{chunksize},*{i}); CALL UNIX_TIMESTAMP(); * ((delta_time=?-timestamp)) * timestamp = *{?} COMMIT; * ((i += 1)) * ((count+=chunksize)) \p >*{timestamp};*{delta_time};*{count} * end while SET FILES LOG TRUE; SET DATABASE REFERENTIAL INTEGRITY TRUE; SET FILES LOG FALSE; SET DATABASE REFERENTIAL INTEGRITY FALSE; \p Starting update * timestamp = 0 * delta_time = 0 * chunksize = 100000 * nupdated = 1 * count = 0 CALL UNIX_TIMESTAMP(); * timestamp = *{?} * while (*nupdated > 0) --UPDATE test SET bool=true WHERE rownum()<=*{chunksize} AND bool=false AND key1='key1-B'; --UPDATE test SET bool=true WHERE rownum()<=*{chunksize} AND key1='key1-B' AND key2>=(SELECT key2 FROM test WHERE key1='key1-B' LIMIT 1 OFFSET *{count}); --UPDATE test SET bool=true WHERE key1='key1-B' AND key2 IN (SELECT key2 FROM test WHERE key1='key1-B' LIMIT *{chunksize} OFFSET *{count}); --SELECT count(*) FROM test WHERE key1='key1-B' AND key2 IN (SELECT key2 FROM test WHERE key1='key1-B' LIMIT *{chunksize} OFFSET *{count}); SELECT count(*) FROM test WHERE rownum()<=*{chunksize} AND key1='key1-B' AND key2>=(SELECT key2 FROM test WHERE key1='key1-B' LIMIT 1 OFFSET *{count}); * ((nupdated = ?)) CALL UNIX_TIMESTAMP(); * ((delta_time=?-timestamp)) * timestamp = *{?} COMMIT; * ((count+=nupdated)) \p >*{timestamp};*{delta_time};*{count} * end while \p Done updated SET FILES LOG TRUE; SET DATABASE REFERENTIAL INTEGRITY TRUE; ***************************************************************** * Good morning Fred, Thank you again for your time and consideration! I have made some additional testing this morning following your advices. I reduced the number of rows by 10 to make the script faster and upgraded the chunk_size to 100 000. Now it takes less than 1 minute to run the test. You suggestions didn't lead to better results so I changed the UPDATE statement for a SELECT COUNT(*) statement and I was very suprised to discover that the performances pb was still there. So the problem has nothing to do neither with the index on the "bool" field nor with the UPDATE statement! I think it has something to do with the cache management which is getting slower and slower over the time. *************** FIRST SELECT REQUEST USING 'key2 IN (SELECT key2 LIMIT batch_size OFFSET count)' ************************************* 100000 1322218278 >1322218278;5;100000 100000 1322218292 >1322218292;14;200000 100000 1322218325 >1322218325;33;300000 100000 1322218381 >1322218381;56;400000 0 1322218385 >1322218385;4;400000 *************** FIRST SELECT REQUEST USING 'rownum()<=batch_size AND key2>(SELECT key2 LIMIT 1 OFFSET count) ************************************* 100000 1322218646 >1322218646;1;100000 100000 1322218650 >1322218650;4;200000 100000 1322218666 >1322218666;16;300000 100000 1322218699 >1322218699;33;400000 0 1322218704 >1322218704;5;400000 -------------------------------------------------------------------------- ---- All the data continuously generated in your IT infrastructure contains a definitive record of customers, application performance, security threats, fraudulent activity, and more. Splunk takes this data and makes sense of it. IT sense. And common sense. http://p.sf.net/sfu/splunk-novd2d _______________________________________________ Hsqldb-user mailing list Hsq...@li... https://lists.sourceforge.net/lists/listinfo/hsqldb-user |