Re: [Jython-dev] A few micro-benchmarks for Jython 2 and two other ideas

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Oops, here the better link directly to section C.8 SSE2 for your test
Athlon CPU
https://www.amd.com/system/files/TechDocs/25112.PDF#G18.1592237

Thad
https://www.linkedin.com/in/thadguidry/


On Wed, Jan 6, 2021 at 12:22 PM Thad Guidry <tha...@gm...> wrote:

> Good job Jeff!
>
> Glad this is useful in the longterm for you and perhaps other contributors
> later on to help with optimization in areas.
>
> Yeap, you got it... there is `movapd` using xmm3 and xmm0 registers
> respectively.
> In other words, you have  verified that particular one is indeed using
> SSE2 instructions on your test Athlon CPU and with the corresponding
> latencies and throughput as described in it's reference appendix C.8 for
> SSE2 here:
> https://www.amd.com/system/files/TechDocs/25112.PDF#G14.232935
>
> Note:
>
>> Although one SSE2 instruction can operate on twice as much data as an MMX
>> instruction, performance might not increase significantly. Two major
>> reasons are: accessing SSE2 data in memory not aligned
>> <https://en.wikipedia.org/wiki/Data_structure_alignment> to a 16-byte
>> boundary can incur significant penalty, and the throughput
>> <https://en.wikipedia.org/wiki/Throughput> of SSE2 instructions in older
>> x86 <https://en.wikipedia.org/wiki/X86> implementations was half that
>> for MMX instructions. Intel <https://en.wikipedia.org/wiki/Intel>
>> addressed the first problem by adding an instruction in SSE3
>> <https://en.wikipedia.org/wiki/SSE3> to reduce the overhead of accessing
>> unaligned data and improving the overall performance of misaligned loads,
>> and the last problem by widening the execution engine in their Core
>> microarchitecture
>> <https://en.wikipedia.org/wiki/Core_(microarchitecture)> in Core 2 Duo
>> and later products.
>
>
> https://en.wikipedia.org/wiki/SSE2
>
> Anyways, you have some better understanding now on what might be happening
> behind the scenes with intrinsic functions used or not and the Hotspot VM.
> I noticed you have your source targeting OpenJDK11, so if you want to see
> the mapping of the vmSymbols for the intrinsic functions scroll through
> this file here or just search (here's jdk10):
>
> http://hg.openjdk.java.net/jdk10/jdk10/hotspot/file/5ab7a67bc155/src/share/vm/classfile/vmSymbols.hpp
>
> Thad
> https://www.linkedin.com/in/thadguidry/
>
>
> On Wed, Jan 6, 2021 at 10:28 AM Jeff Allen <ja...@fa...> wrote:
>
>> On 06/01/2021 02:33, Thad Guidry wrote:
>>
>> Hi Jeff!
>>
>> I'm from the OpenRefine team where we are constantly watching the future
>> of Jython since we use it as an expression language within OpenRefine,
>> along with Clojure.
>> We've talked on the mailing list I think in the past, perhaps not.
>>
>> I think we have. Thanks for your continued interest in Jython.
>>
>> Regarding the microbenchmarks and your analysis...and some of the
>> anomalies you found...
>> I'm wondering if you verified that SIMD, SSE, etc. intrinsics were being
>> used or not sometimes?
>> https://www.amd.com/system/files/TechDocs/25112.PDF#G14.232935
>>
>> Yes, I found similar information: that's what led to my conclusions about
>> the quartic test. I'm impressed HotSpot is able to use them.
>>
>> And to see if intrinsic methods are being utilized or not and where in
>> compiled code, you can add:
>>     -XX:+PrintCompilation -XX:+UnlockDiagnosticVMOptions
>> -XX:+PrintInlining
>>
>> Unlock has to come first, it seems. I've experimented with those options
>> and found what they produced was pretty incomprehensible. I never made the
>> disassembly option work.
>>
>> Going back and trying a little harder, thanks to your suggestion, I got
>> further this morning. The output remains too complex for me to follow (so
>> many jumps!), but a superficial inspection supports the conjectures I made
>> based only on timing. In particular, of the three fixtures, only for Jython
>> 2 does the JVM manage to in-line the floating point arithmetic into
>> quartic(). It contains this in what I assume is the fast path:
>>
>>   0x00000202187735df: movapd  xmm3,xmm0
>>   0x00000202187735e3: addsd   xmm3,xmm2
>>   0x00000202187735e7: subsd   xmm2,xmm0         ;*dsub {reexecute=0
>> rethrow=0 return_oop=0}
>>                                                 ; -
>> org.python.core.PyFloat::float___sub__@23 (line 486)
>>                                                 ; -
>> org.python.core.PyFloat::__sub__@2 (line 477)
>>                                                 ; -
>> org.python.core.PyObject::_basic_sub@2 (line 2192)
>>                                                 ; -
>> org.python.core.PyObject::_sub@31 (line 2177)
>>                                                 ; -
>> uk.co.farowl.jy2bm.PyFloatBinary::quartic@33 (line 86)
>>   0x00000202187735eb: mulsd   xmm3,xmm1
>>
>> I added a task to the Gradle scripts that dumps the compiled code (if one
>> has the hsdis-amd64 plug-in) as I'm sure to forget how I did this.
>>
>>
>> https://github.com/jeff5/very-slow-jython/blob/333f61d54787f7499ec8141eafe6b8c5c04f0cea/jy2bm/jy2bm.gradle#L74
>>
>> You might also be thought provoked with some extra information within
>> this JEP:
>>     https://bugs.openjdk.java.net/browse/JDK-8205637
>>
>> Some Java JVM compilers & many of Java's robust libraries completely miss
>> the point of sometimes using Intrinsic functions as often as possible. For
>> example: SSE 4.2
>> https://software.intel.com/sites/landingpage/IntrinsicsGuide/#expand=898,2862,2861,2860,2863,2864,2865&techs=SSE4_2
>> and the reason why Azul's Zing JVM is fast, is because it DOES use
>> intrinsic functions as much as possible. Kris Mok (Azul Systems) did a
>> great presentation of this back in 2013
>> https://www.slideshare.net/RednaxelaFX/green-teajug-hotspotintrinsics02232013
>>
>> One of the nice things about a JVM language is that it gets better when
>> other people do clever things.
>>
>> Jeff
>>
>>