[PyIndexer] MySQL searching: performance of IN for large sets
Status: Pre-Alpha
Brought to you by:
cduncan
From: Marcus C. <ma...@wr...> - 2001-12-17 18:54:10
|
I was stunned. Passing a list of 10K document ids to MySQL hardly made it sweat. Testing environment: FreeBSD 4.4-STABLE MySQL 3.23.46, compiled from sounce, all knobs set to default Python 2.1.1, compiled from source GCC 2.95.3 MySQLdb 0.9.0 Celeron 500 384MB SDRAM Seagate Barracuda IDE 28GB / 7 200 RPM / 2MB cache Test table: CREATE TABLE document ( id INT NOT NULL PRIMARY KEY AUTO_INCREMENT, foo VARCHAR(255) ) TYPE=BDB Test data: 1M rows with sequence ids and 16-character random string for foo Method: I generated a list of random ids and made an SQL list from that. A simple SELECT, returning the whole row, was used. The WHERE clause restricted results to those that had id IN (the_random_list). The standard cursor class's fetchall() was then used to retrieve the results. The list length and percentage of ID hits were varied. Percentage ID hits is the percentage of the random IDs in the list which actually appear in the table. Ideally, one would want to vary the number of rows also... All tests were repeated 1 000 times, and times are wallclock times. Load from other processes was reasonably constant, accounting for < 5% of total CPU. Results: Table rows IN list length % ID hits Time / select Parse time / select 1 000 000 10 10 % 2.472 ms 2.068 ms 1 000 000 10 50 % 2.524 ms 1 000 000 10 100 % 2.800 ms 1 000 000 50 10 % 4.183 ms 2.661 ms 1 000 000 50 50 % 5.253 ms 1 000 000 50 100 % 6.366 ms 1 000 000 100 10 % 6.918 ms 3.367 ms 1 000 000 100 50 % 9.242 ms 1 000 000 100 100 % 11.817 ms 1 000 000 500 10 % 28.891 ms 12.129 ms 1 000 000 500 50 % 40.089 ms 1 000 000 500 100 % 54.389 ms 1 000 000 1 000 10 % 57.018 ms 21.876 ms 1 000 000 1 000 50 % 81.902 ms 1 000 000 1 000 100 % 108.954 ms 1 000 000 5 000 10 % 301.138 ms 118.765 ms 1 000 000 5 000 50 % 719.837 ms 1 000 000 5 000 100 % 1 168.819 ms 1 000 000 10 000 10 % 628.415 ms 255.822 ms 1 000 000 10 000 50 % 1 425.179 ms 1 000 000 10 000 100 % 1 962.224 ms Some notes: - The parse / select time is iffy -- it includes the time taken to send the query to the server and return the EXPLAIN output. It is listed only once for each combination of table size and list length, as the % id hits is irrelevant to the time taken by the parser. It should probably be thought of more properly as MySQL overhead time. - Python accounts for 10 - 15% of CPU usage while huge result sets are being returned, and system CPU utilisation shoots up to as much as 30% Both these factors indicate that dividing the load between app and db server over separate machines might be quite worthwhile, although neither process was starved for CPU. - Both the time taken to parse the query, and the overall time, are almost linear functions of the number of list items. This will probably become less relevant with real-world data, especially in the case of the parsing, but it's a Good Thing! - The process is not as scientific as it might be. For one thing, later queries might benefit from MySQL or OS caching, as I did not restart the MySQL server or the machine between tests. However, since 1 000 tests are run in each case, this might not affect the overall picture significantly. I did test a few "dry runs", and found a variance of less than 5%. - No joins or other comparisons are done in the test. It is therefore not a very real-world example (in fact, it's basically selecting values we already know ;-). We still need to determine how it will run along with joins. - It might be useful to compare this to a "recipe" SELECT returning the same number of rows using a range for the id. - I should have varied the length of the random data inserted into the foo column. That would possibly have exercised the table handler and/or I/O subsystem more. I guess the next step would be to incorporate this in the searching proper, unless there are other tests I should run first? Cheers -- Marcus |