pocl-devel Mailing List for pocl (Page 7)

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Hi all,

I am actually experiencing the same problem as Timo (admittedly, for a much
more complicated set of kernels).
I've been treating POCL as useful tool for validation -- i.e., it's easy to
install via conda, and unlike some other OpenCL runtimes (i.e., Intel...
not that I mean to offend Jeff), you guys are very responsive to
acknowledging / fixing bugs, or at least explaining why I shouldn't be
doing something that breaks POCL.
However, if I could actually get POCL to vectorize my code that would be
even better!

I've been following this thread and to see if I can figure it out, but
trying to turn on some debugging info in my compilation process yields a
segfault:

POCL_DEBUG=1 POCL_DEBUG_LLVM_PASSES=1 POCL_VERBOSE=1 ./libocl_pyjac.so 4 1
[2018-01-08 17:14:11.140720878650861]POCL: in fn pocl_init_devices at line
398:
 |   GENERAL |  Installing SIGFPE handler...
[2018-01-08 17:14:11.654271269]POCL: in fn POclCreateCommandQueue at line
41:
 |   GENERAL |  Create Command queue on device 0
[2018-01-08 17:14:11.140720962851560]POCL: in fn compile_and_link_program
at line 506:
 |   GENERAL |  building program with options
-I/home/ncurtis/spyjac-test/out -cl-std=CL1.2
[2018-01-08 17:14:11.94610193975812]POCL: in fn compile_and_link_program at
line 561:
 |   GENERAL |  building from sources for device 0
Segmentation fault (core dumped)

Does LLVM have to be compiled in Debug mode in order for the output to work?
I can upload a kernel example if desired

Nick

On Thu, Feb 8, 2018 at 12:58 AM, Pekka Jääskeläinen <
pek...@tu...> wrote:

> Hi Timo,
>
> Too bad I personally cannot spend more time on this due to urgent
> deadlines, but some quick insights:
>
> I added a ticket so we remember to check why you didn't get vectorizer
> remarks, which can be really useful: https://github.com/pocl/pocl/i
> ssues/613
>
> Do you use FP relaxation flags to clBuildProgram? Strict FP
> reordering rules sometimes prevent vectorization.
>
> If you aim for horizontal (work-group) vectorization of your
> kernel loops, the below debug output indeed can indicate a reason.
>
> I haven't followed the progress of outer loop vectorization in upstream
> LLVM, but how pocl tries to enforce it now is to try to force the parallel
> WI loop insde your kernel loops.  It does that by trying to add an implicit
> barrier inside your loop which results in that effect.
>
> It cannot do it if it doesn't know if it's legal to do so (all WIs
> have to go through all kernel loop iterations).  In this case the analysis
> to figure that out failed to prove that's the case.  It might be
> worthwhile to try to track the reason for that.  I think upstream LLVM
> also has divergence analysis which might be adopted now to pocl.
>
> VariableUniformityAnalysis.cc is the one that analyses whether a
> variable is "uniform" (known to always contain the same value for
> all WIs) or not. There are also debug outputs that can be enabled to
> figure out why your loop iteration variables were not detected as such.
>
> The early exit might cause difficulties to various analysis:
>
>     if (myTrialIndex - trialOffset >= nTrial) return;
>
> In fact, that could cause all sorts of troubles for static fine grained
> parallelization as it can mean WI divergence at the end of the grid (even
> if it really doesn't, it's not possible for the kcompiler to prove it due
> to nTrial being a kernel argument variable).
>
> So, if you can avoid this by specializing your kernel to an edge kernel
> and one which is known to not get out of bounds, it might help pocl
> kcompiler
> to cope with this case.
>
> All of it could be done by the kcompiler, but it currently isn't. If
> someone
> would like to add handling for this, it would be really useful, as this
> is quite a common pattern in OpenCL C kernels.
>
>
> I hope these insights help,
> Pekka
>
> On 02/08/2018 02:24 AM, Timo Betcke wrote:
>
>> Hi,
>>
>> one more hint. I followed Pekka's suggestion to enable debug output in
>> ImplicitLoopBarriers.cc and
>> ImplicitConditionalBarriers.cc. There is some interesting output
>> generated. It states that:
>>
>> ### ILB: The kernel has no barriers, let's not add implicit ones either
>> to avoid WI context switch overheads
>> ### ILB: The kernel has no barriers, let's not add implicit ones either
>> to avoid WI context switch overheads
>> ### trying to add a loop barrier to force horizontal parallelization
>> ### the loop is not uniform because loop entry '' is not uniform
>> ### trying to add a loop barrier to force horizontal parallelization
>> ### the loop is not uniform because loop entry '' is not uniform
>>
>> What does it mean and does it prevent workgroup level parallelization?
>>
>> Best wishes
>>
>> Timo
>>
>> On 7 February 2018 at 23:41, Timo Betcke <tim...@gm... <mailto:
>> tim...@gm...>> wrote:
>>
>>     Hi,
>>
>>     I have tried to dive a bit more into the code now and used Pekka's and
>>     Jeffs hints. Analyzing with Vtune showed that no AVX2 code is
>> generated
>>     in POCL,
>>     which I already suspected. I tried POCL_VECTORIZER_REMARKS=1 to
>> activate
>>     vectorizer remarks. But it does not create any kind of output.
>> However,
>>     I could create the llvm generated code using
>>     POCL_LEAVE_KERNEL_COMPILER_TEMP_FILES=1. I am not experienced with
>> LLVM
>>     IR. But it seems that
>>     it does not create vectorized code. I have uploaded a gist with the
>>     disassembled output here:
>>
>>     https://gist.github.com/tbetcke/c5f71dca27cc20c611c35b67f5faa36b
>>     <https://gist.github.com/tbetcke/c5f71dca27cc20c611c35b67f5faa36b>
>>
>>     The question is what prevents the auto vectorizer from working at all.
>>     The code seems quite straight forward with very simple for-loops with
>>     hard-coded bounds
>>     (numQuadPoints is a compiler macro, set to 3 in the experiments). I
>>     would be grateful for any pointer of how to proceed to figure out what
>>     is going on with the
>>     vectorizer.
>>
>>     By the way, I have recompiled pocl with llvm 6. There was no change in
>>     behavior from versions 4 and 5.
>>
>>     Best wishes
>>
>>     Timo
>>
>>     On 7 February 2018 at 16:37, Timo Betcke <tim...@gm...
>>     <mailto:tim...@gm...>> wrote:
>>
>>         Dear Jeff,
>>
>>         thanks for the explanations. I have now installed pocl on my Xeon
>> W
>>         workstation, and the benchmarks are as follows
>>         (pure kernel runtime via event timers this time to exclude Python
>>         overhead.)
>>
>>         1.) Intel OpenCL Driver: 0.0965s
>>         2.) POCL: 0.937s
>>         3.) AMD CPU OpenCL Driver: 0.64s
>>
>>         The CPU is a Xeon W-2155 with 3.3GHz and 10 Cores. I have not had
>>         time to investigate the LLVM IR Code as suggested
>>         but will do as soon as possible. AMD is included as I have a
>> Radeon
>>         Pro card, which automatically also installed OpenCL CPU drivers.
>>
>>         Best wishes
>>
>>         Timo
>>
>>
>>         On 7 February 2018 at 16:03, Jeff Hammond <jef...@gm...
>>         <mailto:jef...@gm...>> wrote:
>>
>>
>>
>>             On Wed, Feb 7, 2018 at 2:41 AM, Michal Babej
>>             <Fra...@ru... <mailto:Franz.Netykafka@runbox
>> .com>>
>>
>>             wrote:
>>
>>                 Hi,
>>
>>                 > we noticed for one of our OpenCL kernels that pocl is
>> over 4 times
>>                 > slower than the Intel OpenCL runtime on a Xeon W
>> processor.
>>
>>                 1) If i googled correctly, Xeon W has AVX-512, which the
>>                 intel runtime
>>                 is likely fully using. LLVM 4 has absolutely horrible
>> AVX512
>>                 support,
>>                 LLVM 5 is better but there are still bugs, and you'll want
>>                 LLVM 6 for
>>                 AVX-512 to work (at least i know they fixed the AVX-512
>> few
>>                 bugs i
>>                 found, i don't have a machine anymore to test it).
>>
>>
>>
>>             Indeed, Xeon W [1] is a sibling of Xeon Scalable and Core
>>             X-series of the Skylake generation, which I'll refer to as SKX
>>             since they are microarchitecturally the same.  All of these
>>             support AVX-512, which I'm going to refer to as AVX3 in the
>>             following, for reasons that will become clear.
>>
>>             An important detail when evaluating vectorization on these
>>             processors is that the frequency drops when transitioning from
>>             scalar/SSE2 code to AVX2 code to AVX3 (i.e. AVX-512) code [2],
>>             which corresponds to the use of xmm (128b), ymm (256b), and
>> zmm
>>             (512b) registers respectively.  AVX3 instructions with ymm
>>             registers should run at AVX2 frequency.
>>
>>             While most (but not all - see [3]) parts have 2 VPUs, the
>> first
>>             of these is implemented via port fusion [4].  What this means
>> is
>>             that the core can dispatch 2 512b AVX3 instructions on ports
>> 0+1
>>             and 5, or it can dispatch 3 256b instructions (AVX2 or AVX3)
>> on
>>             ports 0, 1 and 5.  Thus, one can get 1024b throughput at one
>>             frequency or 768b throughput at a slightly higher frequency.
>>            What this means is that 512b vectorization pays off for code
>>             that is thoroughly compute-bound and heavily vectorized (e.g.
>>             dense linear algebra and molecular dynamics) but that 256b
>>             vectorization is likely better for code that is more
>>             memory-bound or doesn't vectorize as well.
>>
>>             The Intel C/C++ compiler has a flag -qopt-zmm-usage={low,high}
>>             to address this, where "-xCORE-AVX512 -qopt-zmm-usage=low" is
>>             going to take advantage of all the AVX3 instructions but favor
>>             256b ymm registers, which will behave exactly like AVX2 in
>> some
>>             cases (i.e. ones where the AVX3 instruction features aren't
>> used).
>>
>>             Anyways, the short version of this story is that you should
>> not
>>             assume 512b SIMD code generation is the reason for a
>> performance
>>             benefit from the Intel OpenCL compiler, since it may in fact
>> not
>>             generate those instructions if it thinks that 256b is
>> better.             It would be useful to force both POCL and Intel OpenCL
>> to use
>>             SSE2 and AVX2, respectively, in experiments, to see how they
>>             compare when targeting the same vector ISA.  This sort of
>>             comparison would also be helpful to resolve an older bug
>> report
>>             of a similar nature [5].
>>
>>             What I wrote here is one engineer's attempt to summarize a
>> large
>>             amount of information in a user-friendly format.  I apologize
>>             for any errors - they are certainly not intentional.
>>
>>             [1]
>>             https://ark.intel.com/products/series/125035/Intel-Xeon-
>> Processor-W-Family
>>             <https://ark.intel.com/products/series/125035/Intel-Xeon-
>> Processor-W-Family>
>>             [2]
>>             https://www.intel.com/content/dam/www/public/us/en/documents
>> /specification-updates/xeon-scalable-spec-update.pdf
>>             <https://www.intel.com/content/dam/www/public/us/en/document
>> s/specification-updates/xeon-scalable-spec-update.pdf>
>>             [3] https://github.com/jeffhammond/vpu-count
>>             <https://github.com/jeffhammond/vpu-count>
>>             [4]
>>             https://en.wikichip.org/wiki/intel/microarchitectures/skylak
>> e_(server)#Scheduler_.26_512-SIMD_addition
>>             <https://en.wikichip.org/wiki/intel/microarchitectures/skyla
>> ke_(server)#Scheduler_.26_512-SIMD_addition>
>>             [5] https://github.com/pocl/pocl/issues/292
>>             <https://github.com/pocl/pocl/issues/292>
>>
>>                 2) It could be the autovectorizer, or it could be
>> something
>>                 else. Are
>>                 your machines NUMA ? if so, you'll likely see very bad
>>                 performance, as
>>                 pocl has no NUMA tuning currently. Also i've seen
>>                 occasionally that pocl
>>                 unrolls too much and overflows L1 caches (you could try
>>                 experimenting
>>                 with various local WG sizes to clEnqueueNDRK).
>> Unfortunately
>>                 this part of pocl has received little attention lately...
>>
>>
>>             I don't know what POCL uses for threading, but Intel OpenCL
>> uses
>>             the TBB runtime [6].  The TBB runtime has some very smart
>>             features for load-balancing and automatic cache blocking that
>>             are not implemented in OpenMP and are hard to implement by
>> hand
>>             in Pthreads.
>>
>>             [6]
>>             https://software.intel.com/en-us/articles/whats-new-opencl-r
>> untime-1611
>>             <https://software.intel.com/en-us/articles/whats-new-opencl-
>> runtime-1611>
>>
>>             Jeff
>>
>>             --             Jeff Hammond
>>             jef...@gm... <mailto:jef...@gm...>
>>             http://jeffhammond.github.io/
>>
>>             ------------------------------------------------------------
>> ------------------
>>             Check out the vibrant tech community on one of the world's
>> most
>>             engaging tech sites, Slashdot.org! http://sdm.link/slashdot
>>             _______________________________________________
>>             pocl-devel mailing list
>>             poc...@li...
>>             <mailto:poc...@li...>
>>             https://lists.sourceforge.net/lists/listinfo/pocl-devel
>>             <https://lists.sourceforge.net/lists/listinfo/pocl-devel>
>>
>>
>>
>>
>>         --         Dr. Timo Betcke
>>         Reader in Mathematics
>>         University College London
>>         Department of Mathematics
>>         E-Mail: t.b...@uc... <mailto:t.b...@uc...>
>>         Tel.: +44 (0) 20-3108-4068 <tel:020%203108%204068>
>>         Fax.: +44 (0) 20-7383-5519 <tel:020%207383%205519>
>>
>>
>>
>>
>>     --     Dr. Timo Betcke
>>     Reader in Mathematics
>>     University College London
>>     Department of Mathematics
>>     E-Mail: t.b...@uc... <mailto:t.b...@uc...>
>>     Tel.: +44 (0) 20-3108-4068 <tel:020%203108%204068>
>>     Fax.: +44 (0) 20-7383-5519 <tel:020%207383%205519>
>>
>>
>>
>>
>> --
>> Dr. Timo Betcke
>> Reader in Mathematics
>> University College London
>> Department of Mathematics
>> E-Mail: t.b...@uc... <mailto:t.b...@uc...>
>> Tel.: +44 (0) 20-3108-4068
>> Fax.: +44 (0) 20-7383-5519
>>
>>
>> ------------------------------------------------------------
>> ------------------
>> Check out the vibrant tech community on one of the world's most
>> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
>>
>>
>>
>> _______________________________________________
>> pocl-devel mailing list
>> poc...@li...
>> https://lists.sourceforge.net/lists/listinfo/pocl-devel
>>
>>
> --
> Pekka
>
>
> ------------------------------------------------------------
> ------------------
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> _______________________________________________
> pocl-devel mailing list
> poc...@li...
> https://lists.sourceforge.net/lists/listinfo/pocl-devel
>

2011	Jan	Feb	Mar	Apr	May	Jun	Jul	Aug	Sep	Oct (25)	Nov (11)	Dec (36)
2012	Jan (30)	Feb (4)	Mar (4)	Apr (7)	May (5)	Jun (31)	Jul (6)	Aug (19)	Sep (38)	Oct (30)	Nov (22)	Dec (19)
2013	Jan (55)	Feb (39)	Mar (77)	Apr (10)	May (83)	Jun (52)	Jul (86)	Aug (61)	Sep (29)	Oct (9)	Nov (38)	Dec (22)
2014	Jan (14)	Feb (29)	Mar (4)	Apr (19)	May (3)	Jun (27)	Jul (6)	Aug (5)	Sep (3)	Oct (48)	Nov	Dec (5)
2015	Jan (8)	Feb (2)	Mar (8)	Apr (16)	May	Jun	Jul (2)	Aug (1)	Sep (2)	Oct (13)	Nov (5)	Dec (2)
2016	Jan (26)	Feb (6)	Mar (8)	Apr (8)	May (2)	Jun	Jul	Aug (11)	Sep (3)	Oct (5)	Nov (14)	Dec (2)
2017	Jan (16)	Feb (4)	Mar (11)	Apr (4)	May (5)	Jun (5)	Jul (3)	Aug	Sep (6)	Oct	Nov (10)	Dec (6)
2018	Jan	Feb (21)	Mar (11)	Apr (3)	May (2)	Jun (8)	Jul	Aug (13)	Sep (6)	Oct (2)	Nov	Dec (11)
2019	Jan	Feb (5)	Mar (10)	Apr (2)	May	Jun	Jul	Aug	Sep (10)	Oct (4)	Nov	Dec
2020	Jan	Feb	Mar (1)	Apr (4)	May	Jun	Jul (3)	Aug	Sep (3)	Oct	Nov	Dec (4)
2021	Jan	Feb	Mar	Apr (1)	May (1)	Jun	Jul (4)	Aug	Sep	Oct (4)	Nov	Dec
2022	Jan	Feb	Mar (4)	Apr	May (11)	Jun (1)	Jul (3)	Aug	Sep (1)	Oct	Nov (2)	Dec (1)
2023	Jan (4)	Feb	Mar (1)	Apr	May	Jun (2)	Jul	Aug	Sep	Oct	Nov	Dec (1)

pocl-devel Mailing List for pocl (Page 7)

pocl-devel — Portable OpenCL development discussion