From: Jeff A. <ja...@fa...> - 2014-08-17 08:05:05
|
In addressing #2100 I'm keen to avoid increasing the execution time of operations on unicode objects that *don't* contain supplementary characters, and to minimise time overall. I believe I know how to do that, but I want to be able to prove it, so I've devised a benchmark. This may be of interest to others. When benchmarking Java it seems almost impossible to get repeatable results. It is common practice to run each test multiple times and take a minimum (see timeit module), to compensate for transient scheduling disturbances. This hasn't worked for me with the JVM, which will step in when it feels like it, and changes the code being measured. Two repeat-blocks running the same test may yield radically different times, because the JVM adds a permanent optimisation some time during the second block. I seem to have found an approach that works. I repeat the entire sequence of tests, and take corresponding minimum times. It means the code is likely to be in the same state for successive tests, rather than adapting to the test currently being repeated. Now I think about it, randomising the order ought also to help too. Turning off the JIT compiler also produces fairly consistent times, but they're about a forty times slower. These are not the conditions under which we most need to understand our code. My benchmark is at: https://bitbucket.org/tournesol/jython-ja/commits/aed06364c44d7089d58affa92c0502f477274b8f#chg-tests/python/unicode_index_times.py . I'll add this to the central repo IDC: I'm just trying to avoid creating merges. The figures are quite interesting, and also the comparison with CPython. I'm doing this on WIndows 7x64, occupying a single AMD CPU at 2.9GHz). This is for Jython (microseconds per operation): >%JT%\dist\bin\jython unicode_index_times.py time (us) len [i] [i:i+n] find(c) find(str) rfind(c) rfind(str) single bmp 1 0.269 3.086 0.279 0.122 0.228 0.121 single 1 0.259 2.381 0.275 0.122 0.226 0.122 short bmp 10 0.155 0.418 0.279 0.159 0.216 0.145 short 50% 10 0.188 0.591 0.290 0.160 0.213 0.149 medium bmp 100 0.144 0.195 0.255 0.250 0.172 0.284 medium 10% 100 0.316 0.718 0.274 0.274 0.168 0.260 long bmp 1000 0.143 0.211 0.259 1.229 0.152 1.218 long 1% 1000 1.390 4.261 0.275 1.225 0.151 1.240 long 10% 1000 1.935 2.976 0.291 1.264 0.144 1.249 long 10% low 1000 1.600 2.931 0.285 1.267 0.145 1.241 long 10% high 1000 1.385 2.534 0.282 1.223 0.145 1.302 long 50% 1000 2.450 3.737 0.301 1.321 0.124 1.367 huge bmp 10000 0.139 0.464 0.235 7.111 0.142 7.661 huge 10% 10000 11.113 39.374 0.247 11.305 0.136 11.358 The percentages are percentages of non-BMP characters in the string, and low and high refers to concentrating them in the first or last quarter of the string. (I expect that to make a difference in one implementation I have in mind.) The presence of non-BMP characters makes a striking difference to index and slice operations in long strings, because they have to count along the string to convert character (code point) index to the implementation (UTF-16) index. The operations find() and rfind() don't show such a big difference, but they also get the wrong answer by failing to perform the reverse translation, having found the correct UTF-16 index. find(c) is roughly independent of string length because the operation timed is the incremental find of a single letter, and there's nearly always a match about 26 cells away. It's odd that a reverse find should be noticeably quicker. The following is for CPython 2.7.6. On Windows this means we're confined to basic plane characters. Index and slice operations are a little slower than the Jython version (for BMP strings, and once the JIT has done its work): >python unicode_index_times.py Narrow build: all characters from BMP time (us) len [i] [i:i+n] find(c) find(str) rfind(c) rfind(str) single bmp 1 0.447 1.465 0.633 0.502 0.674 0.501 single 1 0.444 1.457 0.635 0.501 0.671 0.501 short bmp 10 0.204 0.382 0.669 0.549 0.667 0.555 short 50% 10 0.203 0.379 0.670 0.552 0.669 0.558 medium bmp 100 0.179 0.233 0.708 0.606 0.627 0.610 medium 10% 100 0.180 0.231 0.714 0.588 0.626 0.586 long bmp 1000 0.185 0.268 0.743 1.020 0.629 1.034 long 1% 1000 0.186 0.271 0.744 1.010 0.630 1.062 long 10% 1000 0.186 0.264 0.745 1.223 0.630 1.165 long 10% low 1000 0.186 0.265 0.758 1.084 0.630 1.024 long 10% high 1000 0.186 0.264 0.745 1.018 0.628 1.135 long 50% 1000 0.186 0.263 0.746 1.243 0.630 1.149 huge bmp 10000 0.186 0.390 0.750 5.097 0.622 4.394 huge 10% 10000 0.186 0.394 0.747 8.275 0.626 7.291 And this is for CPython 3.4.1 (where we have the full Unicode range): >python unicode_index_times.py time (us) len [i] [i:i+n] find(c) find(str) rfind(c) rfind(str) single bmp 1 0.703 2.101 0.901 0.707 0.932 0.693 single 1 0.701 2.098 0.891 0.708 0.933 0.696 short bmp 10 0.310 0.613 0.942 0.783 0.907 0.777 short 50% 10 0.337 0.650 1.095 1.115 1.043 1.070 medium bmp 100 0.271 0.423 1.036 0.853 0.792 0.833 medium 10% 100 0.288 0.460 1.404 1.153 1.009 1.119 long bmp 1000 0.300 0.479 1.162 1.396 0.789 1.302 long 1% 1000 0.311 0.575 1.635 1.709 1.081 1.725 long 10% 1000 0.302 0.562 1.610 1.919 1.041 1.769 long 10% low 1000 0.309 0.589 1.597 1.787 1.054 1.479 long 10% high 1000 0.316 0.610 1.605 1.764 1.060 1.642 long 50% 1000 0.373 0.573 1.481 1.819 0.925 1.735 huge bmp 10000 0.301 0.577 1.144 5.596 0.800 4.464 huge 10% 10000 0.315 0.817 1.645 9.053 0.985 6.550 Now onwards and upwards, to do some good to PyUnicode itself. -- Jeff Allen |