[Jython-dev] Performance of PyUnicode Issue #2100

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

In addressing #2100 I'm keen to avoid increasing the execution time of 
operations on unicode objects that *don't* contain supplementary 
characters, and to minimise time overall. I believe I know how to do 
that, but I want to be able to prove it, so I've devised a benchmark.

This may be of interest to others. When benchmarking Java it seems 
almost impossible to get repeatable results. It is common practice to 
run each test multiple times and take a minimum (see timeit module), to 
compensate for transient scheduling disturbances.  This hasn't worked 
for me with the JVM, which will step in when it feels like it, and 
changes the code being measured. Two repeat-blocks running the same test 
may yield radically different times, because the JVM adds a permanent 
optimisation some time during the second block. I seem to have found an 
approach that works. I repeat the entire sequence of tests, and take 
corresponding minimum times. It means the code is likely to be in the 
same state for successive tests, rather than adapting to the test 
currently being repeated. Now I think about it, randomising the order 
ought also to help too.

Turning off the JIT compiler also produces fairly consistent times, but 
they're about a forty times slower. These are not the conditions under 
which we most need to understand our code.

My benchmark is at: 
https://bitbucket.org/tournesol/jython-ja/commits/aed06364c44d7089d58affa92c0502f477274b8f#chg-tests/python/unicode_index_times.py 
. I'll add this to the central repo IDC: I'm just trying to avoid 
creating merges.

The figures are quite interesting, and also the comparison with 
CPython.  I'm doing this on WIndows 7x64, occupying a single AMD CPU at 
2.9GHz). This is for Jython (microseconds per operation):

 >%JT%\dist\bin\jython unicode_index_times.py
time (us)         len         [i]     [i:i+n]     find(c) find(str)    
rfind(c)  rfind(str)
single bmp          1       0.269       3.086 0.279       0.122       
0.228       0.121
single              1       0.259       2.381 0.275       0.122       
0.226       0.122
short bmp          10       0.155       0.418 0.279       0.159       
0.216       0.145
short 50%          10       0.188       0.591 0.290       0.160       
0.213       0.149
medium bmp        100       0.144       0.195 0.255       0.250       
0.172       0.284
medium 10%        100       0.316       0.718 0.274       0.274       
0.168       0.260
long bmp         1000       0.143       0.211 0.259       1.229       
0.152       1.218
long 1%          1000       1.390       4.261 0.275       1.225       
0.151       1.240
long 10%         1000       1.935       2.976 0.291       1.264       
0.144       1.249
long 10% low     1000       1.600       2.931 0.285       1.267       
0.145       1.241
long 10% high    1000       1.385       2.534 0.282       1.223       
0.145       1.302
long 50%         1000       2.450       3.737 0.301       1.321       
0.124       1.367
huge bmp        10000       0.139       0.464 0.235       7.111       
0.142       7.661
huge 10%        10000      11.113      39.374 0.247      11.305       
0.136      11.358

The percentages are percentages of non-BMP characters in the string, and 
low and high refers to concentrating them in the first or last quarter 
of the string. (I expect that to make a difference in one implementation 
I have in mind.) The presence of non-BMP characters makes a striking 
difference to index and slice operations in long strings, because they 
have to count along the string to convert character (code point) index 
to the implementation (UTF-16) index. The operations find() and rfind() 
don't show such a big difference, but they also get the wrong answer by 
failing to perform the reverse translation, having found the correct 
UTF-16 index. find(c) is roughly independent of string length because 
the operation timed is the incremental find of a single letter, and 
there's nearly always a match about 26 cells away. It's odd that a 
reverse find should be noticeably quicker.

The following is for CPython 2.7.6. On Windows this means we're confined 
to basic plane characters. Index and slice operations are a little 
slower than the Jython version (for BMP strings, and once the JIT has 
done its work):

 >python unicode_index_times.py
Narrow build: all characters from BMP
time (us)         len         [i]     [i:i+n]     find(c) find(str)    
rfind(c)  rfind(str)
single bmp          1       0.447       1.465 0.633       0.502       
0.674       0.501
single              1       0.444       1.457 0.635       0.501       
0.671       0.501
short bmp          10       0.204       0.382 0.669       0.549       
0.667       0.555
short 50%          10       0.203       0.379 0.670       0.552       
0.669       0.558
medium bmp        100       0.179       0.233 0.708       0.606       
0.627       0.610
medium 10%        100       0.180       0.231 0.714       0.588       
0.626       0.586
long bmp         1000       0.185       0.268 0.743       1.020       
0.629       1.034
long 1%          1000       0.186       0.271 0.744       1.010       
0.630       1.062
long 10%         1000       0.186       0.264 0.745       1.223       
0.630       1.165
long 10% low     1000       0.186       0.265 0.758       1.084       
0.630       1.024
long 10% high    1000       0.186       0.264 0.745       1.018       
0.628       1.135
long 50%         1000       0.186       0.263 0.746       1.243       
0.630       1.149
huge bmp        10000       0.186       0.390 0.750       5.097       
0.622       4.394
huge 10%        10000       0.186       0.394 0.747       8.275       
0.626       7.291

And this is for CPython 3.4.1 (where we have the full Unicode range):

 >python unicode_index_times.py
time (us)         len         [i]     [i:i+n]     find(c) find(str)    
rfind(c)  rfind(str)
single bmp          1       0.703       2.101 0.901       0.707       
0.932       0.693
single              1       0.701       2.098 0.891       0.708       
0.933       0.696
short bmp          10       0.310       0.613 0.942       0.783       
0.907       0.777
short 50%          10       0.337       0.650 1.095       1.115       
1.043       1.070
medium bmp        100       0.271       0.423 1.036       0.853       
0.792       0.833
medium 10%        100       0.288       0.460 1.404       1.153       
1.009       1.119
long bmp         1000       0.300       0.479 1.162       1.396       
0.789       1.302
long 1%          1000       0.311       0.575 1.635       1.709       
1.081       1.725
long 10%         1000       0.302       0.562 1.610       1.919       
1.041       1.769
long 10% low     1000       0.309       0.589 1.597       1.787       
1.054       1.479
long 10% high    1000       0.316       0.610 1.605       1.764       
1.060       1.642
long 50%         1000       0.373       0.573 1.481       1.819       
0.925       1.735
huge bmp        10000       0.301       0.577 1.144       5.596       
0.800       4.464
huge 10%        10000       0.315       0.817 1.645       9.053       
0.985       6.550

Now onwards and upwards, to do some good to PyUnicode itself.

-- 
Jeff Allen