Re: [Jython-dev] A few micro-benchmarks for Jython 2 and two other ideas

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Good job Jeff!

Glad this is useful in the longterm for you and perhaps other contributors
later on to help with optimization in areas.

Yeap, you got it... there is `movapd` using xmm3 and xmm0 registers
respectively.
In other words, you have  verified that particular one is indeed using SSE2
instructions on your test Athlon CPU and with the corresponding latencies
and throughput as described in it's reference appendix C.8 for SSE2 here:
https://www.amd.com/system/files/TechDocs/25112.PDF#G14.232935

Note:

> Although one SSE2 instruction can operate on twice as much data as an MMX
> instruction, performance might not increase significantly. Two major
> reasons are: accessing SSE2 data in memory not aligned
> <https://en.wikipedia.org/wiki/Data_structure_alignment> to a 16-byte
> boundary can incur significant penalty, and the throughput
> <https://en.wikipedia.org/wiki/Throughput> of SSE2 instructions in older
> x86 <https://en.wikipedia.org/wiki/X86> implementations was half that for
> MMX instructions. Intel <https://en.wikipedia.org/wiki/Intel> addressed
> the first problem by adding an instruction in SSE3
> <https://en.wikipedia.org/wiki/SSE3> to reduce the overhead of accessing
> unaligned data and improving the overall performance of misaligned loads,
> and the last problem by widening the execution engine in their Core
> microarchitecture <https://en.wikipedia.org/wiki/Core_(microarchitecture)>
> in Core 2 Duo and later products.

https://en.wikipedia.org/wiki/SSE2

Anyways, you have some better understanding now on what might be happening
behind the scenes with intrinsic functions used or not and the Hotspot VM.
I noticed you have your source targeting OpenJDK11, so if you want to see
the mapping of the vmSymbols for the intrinsic functions scroll through
this file here or just search (here's jdk10):
http://hg.openjdk.java.net/jdk10/jdk10/hotspot/file/5ab7a67bc155/src/share/vm/classfile/vmSymbols.hpp

Thad
https://www.linkedin.com/in/thadguidry/

On Wed, Jan 6, 2021 at 10:28 AM Jeff Allen <ja...@fa...> wrote:

> On 06/01/2021 02:33, Thad Guidry wrote:
>
> Hi Jeff!
>
> I'm from the OpenRefine team where we are constantly watching the future
> of Jython since we use it as an expression language within OpenRefine,
> along with Clojure.
> We've talked on the mailing list I think in the past, perhaps not.
>
> I think we have. Thanks for your continued interest in Jython.
>
> Regarding the microbenchmarks and your analysis...and some of the
> anomalies you found...
> I'm wondering if you verified that SIMD, SSE, etc. intrinsics were being
> used or not sometimes?
> https://www.amd.com/system/files/TechDocs/25112.PDF#G14.232935
>
> Yes, I found similar information: that's what led to my conclusions about
> the quartic test. I'm impressed HotSpot is able to use them.
>
> And to see if intrinsic methods are being utilized or not and where in
> compiled code, you can add:
>     -XX:+PrintCompilation -XX:+UnlockDiagnosticVMOptions -XX:+PrintInlining
>
> Unlock has to come first, it seems. I've experimented with those options
> and found what they produced was pretty incomprehensible. I never made the
> disassembly option work.
>
> Going back and trying a little harder, thanks to your suggestion, I got
> further this morning. The output remains too complex for me to follow (so
> many jumps!), but a superficial inspection supports the conjectures I made
> based only on timing. In particular, of the three fixtures, only for Jython
> 2 does the JVM manage to in-line the floating point arithmetic into
> quartic(). It contains this in what I assume is the fast path:
>
>   0x00000202187735df: movapd  xmm3,xmm0
>   0x00000202187735e3: addsd   xmm3,xmm2
>   0x00000202187735e7: subsd   xmm2,xmm0         ;*dsub {reexecute=0
> rethrow=0 return_oop=0}
>                                                 ; -
> org.python.core.PyFloat::float___sub__@23 (line 486)
>                                                 ; -
> org.python.core.PyFloat::__sub__@2 (line 477)
>                                                 ; -
> org.python.core.PyObject::_basic_sub@2 (line 2192)
>                                                 ; -
> org.python.core.PyObject::_sub@31 (line 2177)
>                                                 ; -
> uk.co.farowl.jy2bm.PyFloatBinary::quartic@33 (line 86)
>   0x00000202187735eb: mulsd   xmm3,xmm1
>
> I added a task to the Gradle scripts that dumps the compiled code (if one
> has the hsdis-amd64 plug-in) as I'm sure to forget how I did this.
>
>
> https://github.com/jeff5/very-slow-jython/blob/333f61d54787f7499ec8141eafe6b8c5c04f0cea/jy2bm/jy2bm.gradle#L74
>
> You might also be thought provoked with some extra information within this
> JEP:
>     https://bugs.openjdk.java.net/browse/JDK-8205637
>
> Some Java JVM compilers & many of Java's robust libraries completely miss
> the point of sometimes using Intrinsic functions as often as possible. For
> example: SSE 4.2
> https://software.intel.com/sites/landingpage/IntrinsicsGuide/#expand=898,2862,2861,2860,2863,2864,2865&techs=SSE4_2
> and the reason why Azul's Zing JVM is fast, is because it DOES use
> intrinsic functions as much as possible. Kris Mok (Azul Systems) did a
> great presentation of this back in 2013
> https://www.slideshare.net/RednaxelaFX/green-teajug-hotspotintrinsics02232013
>
> One of the nice things about a JVM language is that it gets better when
> other people do clever things.
>
> Jeff
>
>