From: Thad G. <tha...@gm...> - 2021-01-06 18:22:53
|
Good job Jeff! Glad this is useful in the longterm for you and perhaps other contributors later on to help with optimization in areas. Yeap, you got it... there is `movapd` using xmm3 and xmm0 registers respectively. In other words, you have verified that particular one is indeed using SSE2 instructions on your test Athlon CPU and with the corresponding latencies and throughput as described in it's reference appendix C.8 for SSE2 here: https://www.amd.com/system/files/TechDocs/25112.PDF#G14.232935 Note: > Although one SSE2 instruction can operate on twice as much data as an MMX > instruction, performance might not increase significantly. Two major > reasons are: accessing SSE2 data in memory not aligned > <https://en.wikipedia.org/wiki/Data_structure_alignment> to a 16-byte > boundary can incur significant penalty, and the throughput > <https://en.wikipedia.org/wiki/Throughput> of SSE2 instructions in older > x86 <https://en.wikipedia.org/wiki/X86> implementations was half that for > MMX instructions. Intel <https://en.wikipedia.org/wiki/Intel> addressed > the first problem by adding an instruction in SSE3 > <https://en.wikipedia.org/wiki/SSE3> to reduce the overhead of accessing > unaligned data and improving the overall performance of misaligned loads, > and the last problem by widening the execution engine in their Core > microarchitecture <https://en.wikipedia.org/wiki/Core_(microarchitecture)> > in Core 2 Duo and later products. https://en.wikipedia.org/wiki/SSE2 Anyways, you have some better understanding now on what might be happening behind the scenes with intrinsic functions used or not and the Hotspot VM. I noticed you have your source targeting OpenJDK11, so if you want to see the mapping of the vmSymbols for the intrinsic functions scroll through this file here or just search (here's jdk10): http://hg.openjdk.java.net/jdk10/jdk10/hotspot/file/5ab7a67bc155/src/share/vm/classfile/vmSymbols.hpp Thad https://www.linkedin.com/in/thadguidry/ On Wed, Jan 6, 2021 at 10:28 AM Jeff Allen <ja...@fa...> wrote: > On 06/01/2021 02:33, Thad Guidry wrote: > > Hi Jeff! > > I'm from the OpenRefine team where we are constantly watching the future > of Jython since we use it as an expression language within OpenRefine, > along with Clojure. > We've talked on the mailing list I think in the past, perhaps not. > > I think we have. Thanks for your continued interest in Jython. > > Regarding the microbenchmarks and your analysis...and some of the > anomalies you found... > I'm wondering if you verified that SIMD, SSE, etc. intrinsics were being > used or not sometimes? > https://www.amd.com/system/files/TechDocs/25112.PDF#G14.232935 > > Yes, I found similar information: that's what led to my conclusions about > the quartic test. I'm impressed HotSpot is able to use them. > > And to see if intrinsic methods are being utilized or not and where in > compiled code, you can add: > -XX:+PrintCompilation -XX:+UnlockDiagnosticVMOptions -XX:+PrintInlining > > Unlock has to come first, it seems. I've experimented with those options > and found what they produced was pretty incomprehensible. I never made the > disassembly option work. > > Going back and trying a little harder, thanks to your suggestion, I got > further this morning. The output remains too complex for me to follow (so > many jumps!), but a superficial inspection supports the conjectures I made > based only on timing. In particular, of the three fixtures, only for Jython > 2 does the JVM manage to in-line the floating point arithmetic into > quartic(). It contains this in what I assume is the fast path: > > 0x00000202187735df: movapd xmm3,xmm0 > 0x00000202187735e3: addsd xmm3,xmm2 > 0x00000202187735e7: subsd xmm2,xmm0 ;*dsub {reexecute=0 > rethrow=0 return_oop=0} > ; - > org.python.core.PyFloat::float___sub__@23 (line 486) > ; - > org.python.core.PyFloat::__sub__@2 (line 477) > ; - > org.python.core.PyObject::_basic_sub@2 (line 2192) > ; - > org.python.core.PyObject::_sub@31 (line 2177) > ; - > uk.co.farowl.jy2bm.PyFloatBinary::quartic@33 (line 86) > 0x00000202187735eb: mulsd xmm3,xmm1 > > I added a task to the Gradle scripts that dumps the compiled code (if one > has the hsdis-amd64 plug-in) as I'm sure to forget how I did this. > > > https://github.com/jeff5/very-slow-jython/blob/333f61d54787f7499ec8141eafe6b8c5c04f0cea/jy2bm/jy2bm.gradle#L74 > > You might also be thought provoked with some extra information within this > JEP: > https://bugs.openjdk.java.net/browse/JDK-8205637 > > Some Java JVM compilers & many of Java's robust libraries completely miss > the point of sometimes using Intrinsic functions as often as possible. For > example: SSE 4.2 > https://software.intel.com/sites/landingpage/IntrinsicsGuide/#expand=898,2862,2861,2860,2863,2864,2865&techs=SSE4_2 > and the reason why Azul's Zing JVM is fast, is because it DOES use > intrinsic functions as much as possible. Kris Mok (Azul Systems) did a > great presentation of this back in 2013 > https://www.slideshare.net/RednaxelaFX/green-teajug-hotspotintrinsics02232013 > > One of the nice things about a JVM language is that it gets better when > other people do clever things. > > Jeff > > |