From: Thad G. <tha...@gm...> - 2021-01-06 18:38:09
|
Oops, here the better link directly to section C.8 SSE2 for your test Athlon CPU https://www.amd.com/system/files/TechDocs/25112.PDF#G18.1592237 Thad https://www.linkedin.com/in/thadguidry/ On Wed, Jan 6, 2021 at 12:22 PM Thad Guidry <tha...@gm...> wrote: > Good job Jeff! > > Glad this is useful in the longterm for you and perhaps other contributors > later on to help with optimization in areas. > > Yeap, you got it... there is `movapd` using xmm3 and xmm0 registers > respectively. > In other words, you have verified that particular one is indeed using > SSE2 instructions on your test Athlon CPU and with the corresponding > latencies and throughput as described in it's reference appendix C.8 for > SSE2 here: > https://www.amd.com/system/files/TechDocs/25112.PDF#G14.232935 > > Note: > >> Although one SSE2 instruction can operate on twice as much data as an MMX >> instruction, performance might not increase significantly. Two major >> reasons are: accessing SSE2 data in memory not aligned >> <https://en.wikipedia.org/wiki/Data_structure_alignment> to a 16-byte >> boundary can incur significant penalty, and the throughput >> <https://en.wikipedia.org/wiki/Throughput> of SSE2 instructions in older >> x86 <https://en.wikipedia.org/wiki/X86> implementations was half that >> for MMX instructions. Intel <https://en.wikipedia.org/wiki/Intel> >> addressed the first problem by adding an instruction in SSE3 >> <https://en.wikipedia.org/wiki/SSE3> to reduce the overhead of accessing >> unaligned data and improving the overall performance of misaligned loads, >> and the last problem by widening the execution engine in their Core >> microarchitecture >> <https://en.wikipedia.org/wiki/Core_(microarchitecture)> in Core 2 Duo >> and later products. > > > https://en.wikipedia.org/wiki/SSE2 > > Anyways, you have some better understanding now on what might be happening > behind the scenes with intrinsic functions used or not and the Hotspot VM. > I noticed you have your source targeting OpenJDK11, so if you want to see > the mapping of the vmSymbols for the intrinsic functions scroll through > this file here or just search (here's jdk10): > > http://hg.openjdk.java.net/jdk10/jdk10/hotspot/file/5ab7a67bc155/src/share/vm/classfile/vmSymbols.hpp > > Thad > https://www.linkedin.com/in/thadguidry/ > > > On Wed, Jan 6, 2021 at 10:28 AM Jeff Allen <ja...@fa...> wrote: > >> On 06/01/2021 02:33, Thad Guidry wrote: >> >> Hi Jeff! >> >> I'm from the OpenRefine team where we are constantly watching the future >> of Jython since we use it as an expression language within OpenRefine, >> along with Clojure. >> We've talked on the mailing list I think in the past, perhaps not. >> >> I think we have. Thanks for your continued interest in Jython. >> >> Regarding the microbenchmarks and your analysis...and some of the >> anomalies you found... >> I'm wondering if you verified that SIMD, SSE, etc. intrinsics were being >> used or not sometimes? >> https://www.amd.com/system/files/TechDocs/25112.PDF#G14.232935 >> >> Yes, I found similar information: that's what led to my conclusions about >> the quartic test. I'm impressed HotSpot is able to use them. >> >> And to see if intrinsic methods are being utilized or not and where in >> compiled code, you can add: >> -XX:+PrintCompilation -XX:+UnlockDiagnosticVMOptions >> -XX:+PrintInlining >> >> Unlock has to come first, it seems. I've experimented with those options >> and found what they produced was pretty incomprehensible. I never made the >> disassembly option work. >> >> Going back and trying a little harder, thanks to your suggestion, I got >> further this morning. The output remains too complex for me to follow (so >> many jumps!), but a superficial inspection supports the conjectures I made >> based only on timing. In particular, of the three fixtures, only for Jython >> 2 does the JVM manage to in-line the floating point arithmetic into >> quartic(). It contains this in what I assume is the fast path: >> >> 0x00000202187735df: movapd xmm3,xmm0 >> 0x00000202187735e3: addsd xmm3,xmm2 >> 0x00000202187735e7: subsd xmm2,xmm0 ;*dsub {reexecute=0 >> rethrow=0 return_oop=0} >> ; - >> org.python.core.PyFloat::float___sub__@23 (line 486) >> ; - >> org.python.core.PyFloat::__sub__@2 (line 477) >> ; - >> org.python.core.PyObject::_basic_sub@2 (line 2192) >> ; - >> org.python.core.PyObject::_sub@31 (line 2177) >> ; - >> uk.co.farowl.jy2bm.PyFloatBinary::quartic@33 (line 86) >> 0x00000202187735eb: mulsd xmm3,xmm1 >> >> I added a task to the Gradle scripts that dumps the compiled code (if one >> has the hsdis-amd64 plug-in) as I'm sure to forget how I did this. >> >> >> https://github.com/jeff5/very-slow-jython/blob/333f61d54787f7499ec8141eafe6b8c5c04f0cea/jy2bm/jy2bm.gradle#L74 >> >> You might also be thought provoked with some extra information within >> this JEP: >> https://bugs.openjdk.java.net/browse/JDK-8205637 >> >> Some Java JVM compilers & many of Java's robust libraries completely miss >> the point of sometimes using Intrinsic functions as often as possible. For >> example: SSE 4.2 >> https://software.intel.com/sites/landingpage/IntrinsicsGuide/#expand=898,2862,2861,2860,2863,2864,2865&techs=SSE4_2 >> and the reason why Azul's Zing JVM is fast, is because it DOES use >> intrinsic functions as much as possible. Kris Mok (Azul Systems) did a >> great presentation of this back in 2013 >> https://www.slideshare.net/RednaxelaFX/green-teajug-hotspotintrinsics02232013 >> >> One of the nice things about a JVM language is that it gets better when >> other people do clever things. >> >> Jeff >> >> |