[PyIndexer] MySQL searching: performance of IN for large sets
Status: Pre-Alpha
Brought to you by:
cduncan
|
From: Marcus C. <ma...@wr...> - 2001-12-17 18:54:10
|
I was stunned.
Passing a list of 10K document ids to MySQL hardly made it sweat.
Testing environment:
FreeBSD 4.4-STABLE
MySQL 3.23.46, compiled from sounce, all knobs set to default
Python 2.1.1, compiled from source
GCC 2.95.3
MySQLdb 0.9.0
Celeron 500
384MB SDRAM
Seagate Barracuda IDE 28GB / 7 200 RPM / 2MB cache
Test table:
CREATE TABLE document (
id INT NOT NULL PRIMARY KEY AUTO_INCREMENT,
foo VARCHAR(255)
) TYPE=BDB
Test data:
1M rows with sequence ids and 16-character random string for foo
Method:
I generated a list of random ids and made an SQL list from that. A
simple SELECT, returning the whole row, was used. The WHERE clause
restricted results to those that had id IN (the_random_list).
The standard cursor class's fetchall() was then used to retrieve the
results.
The list length and percentage of ID hits were varied. Percentage ID
hits is the percentage of the random IDs in the list which actually
appear in the table. Ideally, one would want to vary the number of
rows also...
All tests were repeated 1 000 times, and times are wallclock times.
Load from other processes was reasonably constant, accounting for < 5%
of total CPU.
Results:
Table rows IN list length % ID hits Time / select Parse time / select
1 000 000 10 10 % 2.472 ms 2.068 ms
1 000 000 10 50 % 2.524 ms
1 000 000 10 100 % 2.800 ms
1 000 000 50 10 % 4.183 ms 2.661 ms
1 000 000 50 50 % 5.253 ms
1 000 000 50 100 % 6.366 ms
1 000 000 100 10 % 6.918 ms 3.367 ms
1 000 000 100 50 % 9.242 ms
1 000 000 100 100 % 11.817 ms
1 000 000 500 10 % 28.891 ms 12.129 ms
1 000 000 500 50 % 40.089 ms
1 000 000 500 100 % 54.389 ms
1 000 000 1 000 10 % 57.018 ms 21.876 ms
1 000 000 1 000 50 % 81.902 ms
1 000 000 1 000 100 % 108.954 ms
1 000 000 5 000 10 % 301.138 ms 118.765 ms
1 000 000 5 000 50 % 719.837 ms
1 000 000 5 000 100 % 1 168.819 ms
1 000 000 10 000 10 % 628.415 ms 255.822 ms
1 000 000 10 000 50 % 1 425.179 ms
1 000 000 10 000 100 % 1 962.224 ms
Some notes:
- The parse / select time is iffy -- it includes the time taken to
send the query to the server and return the EXPLAIN output. It is
listed only once for each combination of table size and list length,
as the % id hits is irrelevant to the time taken by the parser.
It should probably be thought of more properly as MySQL overhead time.
- Python accounts for 10 - 15% of CPU usage while huge result sets are
being returned, and system CPU utilisation shoots up to as much as 30%
Both these factors indicate that dividing the load between app and db
server over separate machines might be quite worthwhile, although
neither process was starved for CPU.
- Both the time taken to parse the query, and the overall time, are
almost linear functions of the number of list items. This will
probably become less relevant with real-world data, especially in the
case of the parsing, but it's a Good Thing!
- The process is not as scientific as it might be. For one thing, later
queries might benefit from MySQL or OS caching, as I did not restart
the MySQL server or the machine between tests. However, since 1 000
tests are run in each case, this might not affect the overall picture
significantly. I did test a few "dry runs", and found a variance of
less than 5%.
- No joins or other comparisons are done in the test. It is therefore
not a very real-world example (in fact, it's basically selecting
values we already know ;-). We still need to determine how it will run
along with joins.
- It might be useful to compare this to a "recipe" SELECT returning the
same number of rows using a range for the id.
- I should have varied the length of the random data inserted into the
foo column. That would possibly have exercised the table handler
and/or I/O subsystem more.
I guess the next step would be to incorporate this in the searching
proper, unless there are other tests I should run first?
Cheers
-- Marcus
|