[PyIndexer] MySQL searching: performance of IN for large sets

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

I was stunned.

Passing a list of 10K document ids to MySQL hardly made it sweat.

Testing environment:

  FreeBSD 4.4-STABLE
  MySQL 3.23.46, compiled from sounce, all knobs set to default
  Python 2.1.1, compiled from source
  GCC 2.95.3 
  MySQLdb 0.9.0
  Celeron 500
  384MB SDRAM
  Seagate Barracuda IDE 28GB / 7 200 RPM / 2MB cache

Test table:

  CREATE TABLE document (
    id INT NOT NULL PRIMARY KEY AUTO_INCREMENT,
    foo VARCHAR(255)
  ) TYPE=BDB

Test data:

  1M rows with sequence ids and 16-character random string for foo

Method:

I generated a list of random ids and made an SQL list from that. A
simple SELECT, returning the whole row, was used. The WHERE clause
restricted results to those that had id IN (the_random_list). 

The standard cursor class's fetchall() was then used to retrieve the 
results.

The list length and percentage of ID hits were varied. Percentage ID
hits is the percentage of the random IDs in the list which actually
appear in the table. Ideally, one would want to vary the number of 
rows also...

All tests were repeated 1 000 times, and times are wallclock times.
Load from other processes was reasonably constant, accounting for < 5%
of total CPU.

Results:

Table rows  IN list length   % ID hits   Time / select   Parse time / select
1 000 000        10           10 %           2.472 ms        2.068 ms
1 000 000        10           50 %           2.524 ms
1 000 000        10          100 %           2.800 ms
1 000 000        50           10 %           4.183 ms        2.661 ms    
1 000 000        50           50 %           5.253 ms
1 000 000        50          100 %           6.366 ms
1 000 000       100           10 %           6.918 ms        3.367 ms
1 000 000       100           50 %           9.242 ms
1 000 000       100          100 %          11.817 ms
1 000 000       500           10 %          28.891 ms       12.129 ms
1 000 000       500           50 %          40.089 ms
1 000 000       500          100 %          54.389 ms
1 000 000     1 000           10 %          57.018 ms       21.876 ms
1 000 000     1 000           50 %          81.902 ms
1 000 000     1 000          100 %         108.954 ms
1 000 000     5 000           10 %         301.138 ms      118.765 ms    
1 000 000     5 000           50 %         719.837 ms
1 000 000     5 000          100 %       1 168.819 ms
1 000 000    10 000           10 %         628.415 ms      255.822 ms
1 000 000    10 000           50 %       1 425.179 ms
1 000 000    10 000          100 %       1 962.224 ms

Some notes:

- The parse / select time is iffy -- it includes the time taken to
  send the query to the server and return the EXPLAIN output. It is
  listed only once for each combination of table size and list length,
  as the % id hits is irrelevant to the time taken by the parser.
  It should probably be thought of more properly as MySQL overhead time.

- Python accounts for 10 - 15% of CPU usage while huge result sets are
  being returned, and system CPU utilisation shoots up to as much as 30%
  Both these factors indicate that dividing the load between app and db
  server over separate machines might be quite worthwhile, although
  neither process was starved for CPU.

- Both the time taken to parse the query, and the overall time, are
  almost linear functions of the number of list items. This will
  probably become less relevant with real-world data, especially in the
  case of the parsing, but it's a Good Thing!

- The process is not as scientific as it might be. For one thing, later
  queries might benefit from MySQL or OS caching, as I did not restart
  the MySQL server or the machine between tests. However, since 1 000
  tests are run in each case, this might not affect the overall picture
  significantly. I did test a few "dry runs", and found a variance of
  less than 5%.

- No joins or other comparisons are done in the test. It is therefore
  not a very real-world example (in fact, it's basically selecting
  values we already know ;-). We still need to determine how it will run
  along with joins.

- It might be useful to compare this to a "recipe" SELECT returning the
  same number of rows using a range for the id.

- I should have varied the length of the random data inserted into the
  foo column. That would possibly have exercised the table handler
  and/or I/O subsystem more.

I guess the next step would be to incorporate this in the searching
proper, unless there are other tests I should run first? 

Cheers

-- Marcus