From: <me...@ho...> - 2005-11-09 14:27:56
|
This is a report of benchmarking x86 inline allocation following up on Juho's great improvement of the same on the x86-64. There were three versions tested, all of them based on 0.9.5.85 (=3D 0.9.6). * pristine: this one has inline allocation disabled * xor-swap: use the xor swap trick instead of XCHG * stack-temporary: use a stack temporary and mov instead of XCHG Each was tested on uni- and multithread so that's six configurations. Oh, in the comparison matrix there is one more just to see how bad it was: * inline-alloc: this one simply has inline allocation enabled on a=20 multithreaded build Executive summary: inline-alloc is slow, xor-swap and stack-temporary are significantly faster in some tests, the only difference between them being the strange strange peeks in the multithread-stack-temporary column (puzzle, string-concat, ...). I'm leaning towards merging xor-swap if you don't have better solutions=20 in mind.? Cheers, G=E1bor MULTITHREAD =2D-------- * pristine ** core file size: 26406912 ** cl-bench total (two runs) real 3m49.590s user 3m5.146s sys 0m29.147s real 3m48.000s user 3m4.493s sys 0m29.094s * xor-swap ** core file size: 26804224 (+ 1.28742%) ** cl-bench total (two runs) real 3m41.404s user 3m1.428s sys 0m28.740s real 3m41.538s user 3m0.900s sys 0m28.973s * stack-temporary * core file size: 26746880 (+ 1.0%) ** cl-bench total (two runs) real 3m48.753s user 3m5.587s sys 0m29.175s real 3m46.859s user 3m5.970s sys 0m28.845s UNITHREAD =2D-------- * pristine ** core file size: 25272320 ** cl-bench total real 3m40.151s user 2m58.790s sys 0m29.081s * xor-swap ** core file size: 25612288 (+ 1.3%) ** cl-bench total real 3m56.109s user 2m56.367s sys 0m29.022s * stack-temporary ** core file size: 25563136 (+ 1.1%) ** cl-bench total real 3m43.939s user 2m57.387s sys 0m29.103s |
From: <me...@ho...> - 2005-11-12 20:00:57
Attachments:
report
|
In 0.9.6.38 I committed an updated version of the patch that is more efficient and doesn't produce extreme slowdowns (1.4-2.2) on some unlucky cl-bench tests even on Pentium M. These slowdowns are likely related to cache size. I can only guess what the results on P3 (walrus) will look like. For reference here is the cl-bench report on Pentium M comparing: 0.9.6.37 0.9.6.37.better-smart-alloc: this was committed (/2 is another run) 0.9.6.37.smart-alloc: another version with different branching 0.9.6.37.xor-swap (in the previous mail) 0.9.6.37.p-a: a small, unrelated pseudo-atomic optimization for x86/x86-64 And here are the totals for builds and cl-bench runs: [sbcl-pristine]$ real 30m8.331s user 25m42.969s sys 1m3.826s real 8m38.480s user 7m1.191s sys 0m57.427s [sbcl-xor-swap]$ real 27m29.413s user 25m19.798s sys 1m2.387s real 8m56.104s user 6m58.611s sys 0m56.130s [sbcl-smart-alloc]$ real 29m32.288s user 25m30.392s sys 1m4.207s real 8m27.872s user 6m51.039s sys 0m55.618s [sbcl-better-smart-alloc]$ real 30m7.539s user 25m36.266s sys 1m3.763s real 8m35.257s user 6m49.203s sys 0m56.165s |