You can subscribe to this list here.
2011 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(25) |
Nov
(11) |
Dec
(36) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2012 |
Jan
(30) |
Feb
(4) |
Mar
(4) |
Apr
(7) |
May
(5) |
Jun
(31) |
Jul
(6) |
Aug
(19) |
Sep
(38) |
Oct
(30) |
Nov
(22) |
Dec
(19) |
2013 |
Jan
(55) |
Feb
(39) |
Mar
(77) |
Apr
(10) |
May
(83) |
Jun
(52) |
Jul
(86) |
Aug
(61) |
Sep
(29) |
Oct
(9) |
Nov
(38) |
Dec
(22) |
2014 |
Jan
(14) |
Feb
(29) |
Mar
(4) |
Apr
(19) |
May
(3) |
Jun
(27) |
Jul
(6) |
Aug
(5) |
Sep
(3) |
Oct
(48) |
Nov
|
Dec
(5) |
2015 |
Jan
(8) |
Feb
(2) |
Mar
(8) |
Apr
(16) |
May
|
Jun
|
Jul
(2) |
Aug
(1) |
Sep
(2) |
Oct
(13) |
Nov
(5) |
Dec
(2) |
2016 |
Jan
(26) |
Feb
(6) |
Mar
(8) |
Apr
(8) |
May
(2) |
Jun
|
Jul
|
Aug
(11) |
Sep
(3) |
Oct
(5) |
Nov
(14) |
Dec
(2) |
2017 |
Jan
(16) |
Feb
(4) |
Mar
(11) |
Apr
(4) |
May
(5) |
Jun
(5) |
Jul
(3) |
Aug
|
Sep
(6) |
Oct
|
Nov
(10) |
Dec
(6) |
2018 |
Jan
|
Feb
(21) |
Mar
(11) |
Apr
(3) |
May
(2) |
Jun
(8) |
Jul
|
Aug
(13) |
Sep
(6) |
Oct
(2) |
Nov
|
Dec
(11) |
2019 |
Jan
|
Feb
(5) |
Mar
(10) |
Apr
(2) |
May
|
Jun
|
Jul
|
Aug
|
Sep
(10) |
Oct
(4) |
Nov
|
Dec
|
2020 |
Jan
|
Feb
|
Mar
(1) |
Apr
(4) |
May
|
Jun
|
Jul
(3) |
Aug
|
Sep
(3) |
Oct
|
Nov
|
Dec
(4) |
2021 |
Jan
|
Feb
|
Mar
|
Apr
(1) |
May
(1) |
Jun
|
Jul
(4) |
Aug
|
Sep
|
Oct
(4) |
Nov
|
Dec
|
2022 |
Jan
|
Feb
|
Mar
(4) |
Apr
|
May
(11) |
Jun
(1) |
Jul
(3) |
Aug
|
Sep
(1) |
Oct
|
Nov
(2) |
Dec
(1) |
2023 |
Jan
(4) |
Feb
|
Mar
(1) |
Apr
|
May
|
Jun
(2) |
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
(1) |
From: Nicholas C. <nic...@uc...> - 2018-02-08 17:46:30
|
Hi all, I am actually experiencing the same problem as Timo (admittedly, for a much more complicated set of kernels). I've been treating POCL as useful tool for validation -- i.e., it's easy to install via conda, and unlike some other OpenCL runtimes (i.e., Intel... not that I mean to offend Jeff), you guys are very responsive to acknowledging / fixing bugs, or at least explaining why I shouldn't be doing something that breaks POCL. However, if I could actually get POCL to vectorize my code that would be even better! I've been following this thread and to see if I can figure it out, but trying to turn on some debugging info in my compilation process yields a segfault: POCL_DEBUG=1 POCL_DEBUG_LLVM_PASSES=1 POCL_VERBOSE=1 ./libocl_pyjac.so 4 1 [2018-01-08 17:14:11.140720878650861]POCL: in fn pocl_init_devices at line 398: | GENERAL | Installing SIGFPE handler... [2018-01-08 17:14:11.654271269]POCL: in fn POclCreateCommandQueue at line 41: | GENERAL | Create Command queue on device 0 [2018-01-08 17:14:11.140720962851560]POCL: in fn compile_and_link_program at line 506: | GENERAL | building program with options -I/home/ncurtis/spyjac-test/out -cl-std=CL1.2 [2018-01-08 17:14:11.94610193975812]POCL: in fn compile_and_link_program at line 561: | GENERAL | building from sources for device 0 Segmentation fault (core dumped) Does LLVM have to be compiled in Debug mode in order for the output to work? I can upload a kernel example if desired Nick On Thu, Feb 8, 2018 at 12:58 AM, Pekka Jääskeläinen < pek...@tu...> wrote: > Hi Timo, > > Too bad I personally cannot spend more time on this due to urgent > deadlines, but some quick insights: > > I added a ticket so we remember to check why you didn't get vectorizer > remarks, which can be really useful: https://github.com/pocl/pocl/i > ssues/613 > > Do you use FP relaxation flags to clBuildProgram? Strict FP > reordering rules sometimes prevent vectorization. > > If you aim for horizontal (work-group) vectorization of your > kernel loops, the below debug output indeed can indicate a reason. > > I haven't followed the progress of outer loop vectorization in upstream > LLVM, but how pocl tries to enforce it now is to try to force the parallel > WI loop insde your kernel loops. It does that by trying to add an implicit > barrier inside your loop which results in that effect. > > It cannot do it if it doesn't know if it's legal to do so (all WIs > have to go through all kernel loop iterations). In this case the analysis > to figure that out failed to prove that's the case. It might be > worthwhile to try to track the reason for that. I think upstream LLVM > also has divergence analysis which might be adopted now to pocl. > > VariableUniformityAnalysis.cc is the one that analyses whether a > variable is "uniform" (known to always contain the same value for > all WIs) or not. There are also debug outputs that can be enabled to > figure out why your loop iteration variables were not detected as such. > > The early exit might cause difficulties to various analysis: > > if (myTrialIndex - trialOffset >= nTrial) return; > > In fact, that could cause all sorts of troubles for static fine grained > parallelization as it can mean WI divergence at the end of the grid (even > if it really doesn't, it's not possible for the kcompiler to prove it due > to nTrial being a kernel argument variable). > > So, if you can avoid this by specializing your kernel to an edge kernel > and one which is known to not get out of bounds, it might help pocl > kcompiler > to cope with this case. > > All of it could be done by the kcompiler, but it currently isn't. If > someone > would like to add handling for this, it would be really useful, as this > is quite a common pattern in OpenCL C kernels. > > > I hope these insights help, > Pekka > > On 02/08/2018 02:24 AM, Timo Betcke wrote: > >> Hi, >> >> one more hint. I followed Pekka's suggestion to enable debug output in >> ImplicitLoopBarriers.cc and >> ImplicitConditionalBarriers.cc. There is some interesting output >> generated. It states that: >> >> ### ILB: The kernel has no barriers, let's not add implicit ones either >> to avoid WI context switch overheads >> ### ILB: The kernel has no barriers, let's not add implicit ones either >> to avoid WI context switch overheads >> ### trying to add a loop barrier to force horizontal parallelization >> ### the loop is not uniform because loop entry '' is not uniform >> ### trying to add a loop barrier to force horizontal parallelization >> ### the loop is not uniform because loop entry '' is not uniform >> >> What does it mean and does it prevent workgroup level parallelization? >> >> Best wishes >> >> Timo >> >> On 7 February 2018 at 23:41, Timo Betcke <tim...@gm... <mailto: >> tim...@gm...>> wrote: >> >> Hi, >> >> I have tried to dive a bit more into the code now and used Pekka's and >> Jeffs hints. Analyzing with Vtune showed that no AVX2 code is >> generated >> in POCL, >> which I already suspected. I tried POCL_VECTORIZER_REMARKS=1 to >> activate >> vectorizer remarks. But it does not create any kind of output. >> However, >> I could create the llvm generated code using >> POCL_LEAVE_KERNEL_COMPILER_TEMP_FILES=1. I am not experienced with >> LLVM >> IR. But it seems that >> it does not create vectorized code. I have uploaded a gist with the >> disassembled output here: >> >> https://gist.github.com/tbetcke/c5f71dca27cc20c611c35b67f5faa36b >> <https://gist.github.com/tbetcke/c5f71dca27cc20c611c35b67f5faa36b> >> >> The question is what prevents the auto vectorizer from working at all. >> The code seems quite straight forward with very simple for-loops with >> hard-coded bounds >> (numQuadPoints is a compiler macro, set to 3 in the experiments). I >> would be grateful for any pointer of how to proceed to figure out what >> is going on with the >> vectorizer. >> >> By the way, I have recompiled pocl with llvm 6. There was no change in >> behavior from versions 4 and 5. >> >> Best wishes >> >> Timo >> >> On 7 February 2018 at 16:37, Timo Betcke <tim...@gm... >> <mailto:tim...@gm...>> wrote: >> >> Dear Jeff, >> >> thanks for the explanations. I have now installed pocl on my Xeon >> W >> workstation, and the benchmarks are as follows >> (pure kernel runtime via event timers this time to exclude Python >> overhead.) >> >> 1.) Intel OpenCL Driver: 0.0965s >> 2.) POCL: 0.937s >> 3.) AMD CPU OpenCL Driver: 0.64s >> >> The CPU is a Xeon W-2155 with 3.3GHz and 10 Cores. I have not had >> time to investigate the LLVM IR Code as suggested >> but will do as soon as possible. AMD is included as I have a >> Radeon >> Pro card, which automatically also installed OpenCL CPU drivers. >> >> Best wishes >> >> Timo >> >> >> On 7 February 2018 at 16:03, Jeff Hammond <jef...@gm... >> <mailto:jef...@gm...>> wrote: >> >> >> >> On Wed, Feb 7, 2018 at 2:41 AM, Michal Babej >> <Fra...@ru... <mailto:Franz.Netykafka@runbox >> .com>> >> >> wrote: >> >> Hi, >> >> > we noticed for one of our OpenCL kernels that pocl is >> over 4 times >> > slower than the Intel OpenCL runtime on a Xeon W >> processor. >> >> 1) If i googled correctly, Xeon W has AVX-512, which the >> intel runtime >> is likely fully using. LLVM 4 has absolutely horrible >> AVX512 >> support, >> LLVM 5 is better but there are still bugs, and you'll want >> LLVM 6 for >> AVX-512 to work (at least i know they fixed the AVX-512 >> few >> bugs i >> found, i don't have a machine anymore to test it). >> >> >> >> Indeed, Xeon W [1] is a sibling of Xeon Scalable and Core >> X-series of the Skylake generation, which I'll refer to as SKX >> since they are microarchitecturally the same. All of these >> support AVX-512, which I'm going to refer to as AVX3 in the >> following, for reasons that will become clear. >> >> An important detail when evaluating vectorization on these >> processors is that the frequency drops when transitioning from >> scalar/SSE2 code to AVX2 code to AVX3 (i.e. AVX-512) code [2], >> which corresponds to the use of xmm (128b), ymm (256b), and >> zmm >> (512b) registers respectively. AVX3 instructions with ymm >> registers should run at AVX2 frequency. >> >> While most (but not all - see [3]) parts have 2 VPUs, the >> first >> of these is implemented via port fusion [4]. What this means >> is >> that the core can dispatch 2 512b AVX3 instructions on ports >> 0+1 >> and 5, or it can dispatch 3 256b instructions (AVX2 or AVX3) >> on >> ports 0, 1 and 5. Thus, one can get 1024b throughput at one >> frequency or 768b throughput at a slightly higher frequency. >> What this means is that 512b vectorization pays off for code >> that is thoroughly compute-bound and heavily vectorized (e.g. >> dense linear algebra and molecular dynamics) but that 256b >> vectorization is likely better for code that is more >> memory-bound or doesn't vectorize as well. >> >> The Intel C/C++ compiler has a flag -qopt-zmm-usage={low,high} >> to address this, where "-xCORE-AVX512 -qopt-zmm-usage=low" is >> going to take advantage of all the AVX3 instructions but favor >> 256b ymm registers, which will behave exactly like AVX2 in >> some >> cases (i.e. ones where the AVX3 instruction features aren't >> used). >> >> Anyways, the short version of this story is that you should >> not >> assume 512b SIMD code generation is the reason for a >> performance >> benefit from the Intel OpenCL compiler, since it may in fact >> not >> generate those instructions if it thinks that 256b is >> better. It would be useful to force both POCL and Intel OpenCL >> to use >> SSE2 and AVX2, respectively, in experiments, to see how they >> compare when targeting the same vector ISA. This sort of >> comparison would also be helpful to resolve an older bug >> report >> of a similar nature [5]. >> >> What I wrote here is one engineer's attempt to summarize a >> large >> amount of information in a user-friendly format. I apologize >> for any errors - they are certainly not intentional. >> >> [1] >> https://ark.intel.com/products/series/125035/Intel-Xeon- >> Processor-W-Family >> <https://ark.intel.com/products/series/125035/Intel-Xeon- >> Processor-W-Family> >> [2] >> https://www.intel.com/content/dam/www/public/us/en/documents >> /specification-updates/xeon-scalable-spec-update.pdf >> <https://www.intel.com/content/dam/www/public/us/en/document >> s/specification-updates/xeon-scalable-spec-update.pdf> >> [3] https://github.com/jeffhammond/vpu-count >> <https://github.com/jeffhammond/vpu-count> >> [4] >> https://en.wikichip.org/wiki/intel/microarchitectures/skylak >> e_(server)#Scheduler_.26_512-SIMD_addition >> <https://en.wikichip.org/wiki/intel/microarchitectures/skyla >> ke_(server)#Scheduler_.26_512-SIMD_addition> >> [5] https://github.com/pocl/pocl/issues/292 >> <https://github.com/pocl/pocl/issues/292> >> >> 2) It could be the autovectorizer, or it could be >> something >> else. Are >> your machines NUMA ? if so, you'll likely see very bad >> performance, as >> pocl has no NUMA tuning currently. Also i've seen >> occasionally that pocl >> unrolls too much and overflows L1 caches (you could try >> experimenting >> with various local WG sizes to clEnqueueNDRK). >> Unfortunately >> this part of pocl has received little attention lately... >> >> >> I don't know what POCL uses for threading, but Intel OpenCL >> uses >> the TBB runtime [6]. The TBB runtime has some very smart >> features for load-balancing and automatic cache blocking that >> are not implemented in OpenMP and are hard to implement by >> hand >> in Pthreads. >> >> [6] >> https://software.intel.com/en-us/articles/whats-new-opencl-r >> untime-1611 >> <https://software.intel.com/en-us/articles/whats-new-opencl- >> runtime-1611> >> >> Jeff >> >> -- Jeff Hammond >> jef...@gm... <mailto:jef...@gm...> >> http://jeffhammond.github.io/ >> >> ------------------------------------------------------------ >> ------------------ >> Check out the vibrant tech community on one of the world's >> most >> engaging tech sites, Slashdot.org! http://sdm.link/slashdot >> _______________________________________________ >> pocl-devel mailing list >> poc...@li... >> <mailto:poc...@li...> >> https://lists.sourceforge.net/lists/listinfo/pocl-devel >> <https://lists.sourceforge.net/lists/listinfo/pocl-devel> >> >> >> >> >> -- Dr. Timo Betcke >> Reader in Mathematics >> University College London >> Department of Mathematics >> E-Mail: t.b...@uc... <mailto:t.b...@uc...> >> Tel.: +44 (0) 20-3108-4068 <tel:020%203108%204068> >> Fax.: +44 (0) 20-7383-5519 <tel:020%207383%205519> >> >> >> >> >> -- Dr. Timo Betcke >> Reader in Mathematics >> University College London >> Department of Mathematics >> E-Mail: t.b...@uc... <mailto:t.b...@uc...> >> Tel.: +44 (0) 20-3108-4068 <tel:020%203108%204068> >> Fax.: +44 (0) 20-7383-5519 <tel:020%207383%205519> >> >> >> >> >> -- >> Dr. Timo Betcke >> Reader in Mathematics >> University College London >> Department of Mathematics >> E-Mail: t.b...@uc... <mailto:t.b...@uc...> >> Tel.: +44 (0) 20-3108-4068 >> Fax.: +44 (0) 20-7383-5519 >> >> >> ------------------------------------------------------------ >> ------------------ >> Check out the vibrant tech community on one of the world's most >> engaging tech sites, Slashdot.org! http://sdm.link/slashdot >> >> >> >> _______________________________________________ >> pocl-devel mailing list >> poc...@li... >> https://lists.sourceforge.net/lists/listinfo/pocl-devel >> >> > -- > Pekka > > > ------------------------------------------------------------ > ------------------ > Check out the vibrant tech community on one of the world's most > engaging tech sites, Slashdot.org! http://sdm.link/slashdot > _______________________________________________ > pocl-devel mailing list > poc...@li... > https://lists.sourceforge.net/lists/listinfo/pocl-devel > |
From: Pekka J. <pek...@tu...> - 2018-02-08 05:58:33
|
Hi Timo, Too bad I personally cannot spend more time on this due to urgent deadlines, but some quick insights: I added a ticket so we remember to check why you didn't get vectorizer remarks, which can be really useful: https://github.com/pocl/pocl/issues/613 Do you use FP relaxation flags to clBuildProgram? Strict FP reordering rules sometimes prevent vectorization. If you aim for horizontal (work-group) vectorization of your kernel loops, the below debug output indeed can indicate a reason. I haven't followed the progress of outer loop vectorization in upstream LLVM, but how pocl tries to enforce it now is to try to force the parallel WI loop insde your kernel loops. It does that by trying to add an implicit barrier inside your loop which results in that effect. It cannot do it if it doesn't know if it's legal to do so (all WIs have to go through all kernel loop iterations). In this case the analysis to figure that out failed to prove that's the case. It might be worthwhile to try to track the reason for that. I think upstream LLVM also has divergence analysis which might be adopted now to pocl. VariableUniformityAnalysis.cc is the one that analyses whether a variable is "uniform" (known to always contain the same value for all WIs) or not. There are also debug outputs that can be enabled to figure out why your loop iteration variables were not detected as such. The early exit might cause difficulties to various analysis: if (myTrialIndex - trialOffset >= nTrial) return; In fact, that could cause all sorts of troubles for static fine grained parallelization as it can mean WI divergence at the end of the grid (even if it really doesn't, it's not possible for the kcompiler to prove it due to nTrial being a kernel argument variable). So, if you can avoid this by specializing your kernel to an edge kernel and one which is known to not get out of bounds, it might help pocl kcompiler to cope with this case. All of it could be done by the kcompiler, but it currently isn't. If someone would like to add handling for this, it would be really useful, as this is quite a common pattern in OpenCL C kernels. I hope these insights help, Pekka On 02/08/2018 02:24 AM, Timo Betcke wrote: > Hi, > > one more hint. I followed Pekka's suggestion to enable debug output in > ImplicitLoopBarriers.cc and > ImplicitConditionalBarriers.cc. There is some interesting output generated. > It states that: > > ### ILB: The kernel has no barriers, let's not add implicit ones either to > avoid WI context switch overheads > ### ILB: The kernel has no barriers, let's not add implicit ones either to > avoid WI context switch overheads > ### trying to add a loop barrier to force horizontal parallelization > ### the loop is not uniform because loop entry '' is not uniform > ### trying to add a loop barrier to force horizontal parallelization > ### the loop is not uniform because loop entry '' is not uniform > > What does it mean and does it prevent workgroup level parallelization? > > Best wishes > > Timo > > On 7 February 2018 at 23:41, Timo Betcke <tim...@gm... > <mailto:tim...@gm...>> wrote: > > Hi, > > I have tried to dive a bit more into the code now and used Pekka's and > Jeffs hints. Analyzing with Vtune showed that no AVX2 code is generated > in POCL, > which I already suspected. I tried POCL_VECTORIZER_REMARKS=1 to activate > vectorizer remarks. But it does not create any kind of output. However, > I could create the llvm generated code using > POCL_LEAVE_KERNEL_COMPILER_TEMP_FILES=1. I am not experienced with LLVM > IR. But it seems that > it does not create vectorized code. I have uploaded a gist with the > disassembled output here: > > https://gist.github.com/tbetcke/c5f71dca27cc20c611c35b67f5faa36b > <https://gist.github.com/tbetcke/c5f71dca27cc20c611c35b67f5faa36b> > > The question is what prevents the auto vectorizer from working at all. > The code seems quite straight forward with very simple for-loops with > hard-coded bounds > (numQuadPoints is a compiler macro, set to 3 in the experiments). I > would be grateful for any pointer of how to proceed to figure out what > is going on with the > vectorizer. > > By the way, I have recompiled pocl with llvm 6. There was no change in > behavior from versions 4 and 5. > > Best wishes > > Timo > > On 7 February 2018 at 16:37, Timo Betcke <tim...@gm... > <mailto:tim...@gm...>> wrote: > > Dear Jeff, > > thanks for the explanations. I have now installed pocl on my Xeon W > workstation, and the benchmarks are as follows > (pure kernel runtime via event timers this time to exclude Python > overhead.) > > 1.) Intel OpenCL Driver: 0.0965s > 2.) POCL: 0.937s > 3.) AMD CPU OpenCL Driver: 0.64s > > The CPU is a Xeon W-2155 with 3.3GHz and 10 Cores. I have not had > time to investigate the LLVM IR Code as suggested > but will do as soon as possible. AMD is included as I have a Radeon > Pro card, which automatically also installed OpenCL CPU drivers. > > Best wishes > > Timo > > > On 7 February 2018 at 16:03, Jeff Hammond <jef...@gm... > <mailto:jef...@gm...>> wrote: > > > > On Wed, Feb 7, 2018 at 2:41 AM, Michal Babej > <Fra...@ru... <mailto:Fra...@ru...>> > wrote: > > Hi, > > > we noticed for one of our OpenCL kernels that pocl is over 4 times > > slower than the Intel OpenCL runtime on a Xeon W processor. > > 1) If i googled correctly, Xeon W has AVX-512, which the > intel runtime > is likely fully using. LLVM 4 has absolutely horrible AVX512 > support, > LLVM 5 is better but there are still bugs, and you'll want > LLVM 6 for > AVX-512 to work (at least i know they fixed the AVX-512 few > bugs i > found, i don't have a machine anymore to test it). > > > > Indeed, Xeon W [1] is a sibling of Xeon Scalable and Core > X-series of the Skylake generation, which I'll refer to as SKX > since they are microarchitecturally the same. All of these > support AVX-512, which I'm going to refer to as AVX3 in the > following, for reasons that will become clear. > > An important detail when evaluating vectorization on these > processors is that the frequency drops when transitioning from > scalar/SSE2 code to AVX2 code to AVX3 (i.e. AVX-512) code [2], > which corresponds to the use of xmm (128b), ymm (256b), and zmm > (512b) registers respectively. AVX3 instructions with ymm > registers should run at AVX2 frequency. > > While most (but not all - see [3]) parts have 2 VPUs, the first > of these is implemented via port fusion [4]. What this means is > that the core can dispatch 2 512b AVX3 instructions on ports 0+1 > and 5, or it can dispatch 3 256b instructions (AVX2 or AVX3) on > ports 0, 1 and 5. Thus, one can get 1024b throughput at one > frequency or 768b throughput at a slightly higher frequency. > What this means is that 512b vectorization pays off for code > that is thoroughly compute-bound and heavily vectorized (e.g. > dense linear algebra and molecular dynamics) but that 256b > vectorization is likely better for code that is more > memory-bound or doesn't vectorize as well. > > The Intel C/C++ compiler has a flag -qopt-zmm-usage={low,high} > to address this, where "-xCORE-AVX512 -qopt-zmm-usage=low" is > going to take advantage of all the AVX3 instructions but favor > 256b ymm registers, which will behave exactly like AVX2 in some > cases (i.e. ones where the AVX3 instruction features aren't used). > > Anyways, the short version of this story is that you should not > assume 512b SIMD code generation is the reason for a performance > benefit from the Intel OpenCL compiler, since it may in fact not > generate those instructions if it thinks that 256b is better. > It would be useful to force both POCL and Intel OpenCL to use > SSE2 and AVX2, respectively, in experiments, to see how they > compare when targeting the same vector ISA. This sort of > comparison would also be helpful to resolve an older bug report > of a similar nature [5]. > > What I wrote here is one engineer's attempt to summarize a large > amount of information in a user-friendly format. I apologize > for any errors - they are certainly not intentional. > > [1] > https://ark.intel.com/products/series/125035/Intel-Xeon-Processor-W-Family > <https://ark.intel.com/products/series/125035/Intel-Xeon-Processor-W-Family> > [2] > https://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-scalable-spec-update.pdf > <https://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-scalable-spec-update.pdf> > [3] https://github.com/jeffhammond/vpu-count > <https://github.com/jeffhammond/vpu-count> > [4] > https://en.wikichip.org/wiki/intel/microarchitectures/skylake_(server)#Scheduler_.26_512-SIMD_addition > <https://en.wikichip.org/wiki/intel/microarchitectures/skylake_(server)#Scheduler_.26_512-SIMD_addition> > [5] https://github.com/pocl/pocl/issues/292 > <https://github.com/pocl/pocl/issues/292> > > 2) It could be the autovectorizer, or it could be something > else. Are > your machines NUMA ? if so, you'll likely see very bad > performance, as > pocl has no NUMA tuning currently. Also i've seen > occasionally that pocl > unrolls too much and overflows L1 caches (you could try > experimenting > with various local WG sizes to clEnqueueNDRK). Unfortunately > this part of pocl has received little attention lately... > > > I don't know what POCL uses for threading, but Intel OpenCL uses > the TBB runtime [6]. The TBB runtime has some very smart > features for load-balancing and automatic cache blocking that > are not implemented in OpenMP and are hard to implement by hand > in Pthreads. > > [6] > https://software.intel.com/en-us/articles/whats-new-opencl-runtime-1611 > <https://software.intel.com/en-us/articles/whats-new-opencl-runtime-1611> > > Jeff > > -- > Jeff Hammond > jef...@gm... <mailto:jef...@gm...> > http://jeffhammond.github.io/ > > ------------------------------------------------------------------------------ > Check out the vibrant tech community on one of the world's most > engaging tech sites, Slashdot.org! http://sdm.link/slashdot > _______________________________________________ > pocl-devel mailing list > poc...@li... > <mailto:poc...@li...> > https://lists.sourceforge.net/lists/listinfo/pocl-devel > <https://lists.sourceforge.net/lists/listinfo/pocl-devel> > > > > > -- > Dr. Timo Betcke > Reader in Mathematics > University College London > Department of Mathematics > E-Mail: t.b...@uc... <mailto:t.b...@uc...> > Tel.: +44 (0) 20-3108-4068 <tel:020%203108%204068> > Fax.: +44 (0) 20-7383-5519 <tel:020%207383%205519> > > > > > -- > Dr. Timo Betcke > Reader in Mathematics > University College London > Department of Mathematics > E-Mail: t.b...@uc... <mailto:t.b...@uc...> > Tel.: +44 (0) 20-3108-4068 <tel:020%203108%204068> > Fax.: +44 (0) 20-7383-5519 <tel:020%207383%205519> > > > > > -- > Dr. Timo Betcke > Reader in Mathematics > University College London > Department of Mathematics > E-Mail: t.b...@uc... <mailto:t.b...@uc...> > Tel.: +44 (0) 20-3108-4068 > Fax.: +44 (0) 20-7383-5519 > > > ------------------------------------------------------------------------------ > Check out the vibrant tech community on one of the world's most > engaging tech sites, Slashdot.org! http://sdm.link/slashdot > > > > _______________________________________________ > pocl-devel mailing list > poc...@li... > https://lists.sourceforge.net/lists/listinfo/pocl-devel > -- Pekka |
From: Timo B. <tim...@gm...> - 2018-02-08 00:24:25
|
Hi, one more hint. I followed Pekka's suggestion to enable debug output in ImplicitLoopBarriers.cc and ImplicitConditionalBarriers.cc. There is some interesting output generated. It states that: ### ILB: The kernel has no barriers, let's not add implicit ones either to avoid WI context switch overheads ### ILB: The kernel has no barriers, let's not add implicit ones either to avoid WI context switch overheads ### trying to add a loop barrier to force horizontal parallelization ### the loop is not uniform because loop entry '' is not uniform ### trying to add a loop barrier to force horizontal parallelization ### the loop is not uniform because loop entry '' is not uniform What does it mean and does it prevent workgroup level parallelization? Best wishes Timo On 7 February 2018 at 23:41, Timo Betcke <tim...@gm...> wrote: > Hi, > > I have tried to dive a bit more into the code now and used Pekka's and > Jeffs hints. Analyzing with Vtune showed that no AVX2 code is generated in > POCL, > which I already suspected. I tried POCL_VECTORIZER_REMARKS=1 to activate > vectorizer remarks. But it does not create any kind of output. However, > I could create the llvm generated code using POCL_LEAVE_KERNEL_ > COMPILER_TEMP_FILES=1. I am not experienced with LLVM IR. But it seems > that > it does not create vectorized code. I have uploaded a gist with the > disassembled output here: > > https://gist.github.com/tbetcke/c5f71dca27cc20c611c35b67f5faa36b > > The question is what prevents the auto vectorizer from working at all. The > code seems quite straight forward with very simple for-loops with > hard-coded bounds > (numQuadPoints is a compiler macro, set to 3 in the experiments). I would > be grateful for any pointer of how to proceed to figure out what is going > on with the > vectorizer. > > By the way, I have recompiled pocl with llvm 6. There was no change in > behavior from versions 4 and 5. > > Best wishes > > Timo > > On 7 February 2018 at 16:37, Timo Betcke <tim...@gm...> wrote: > >> Dear Jeff, >> >> thanks for the explanations. I have now installed pocl on my Xeon W >> workstation, and the benchmarks are as follows >> (pure kernel runtime via event timers this time to exclude Python >> overhead.) >> >> 1.) Intel OpenCL Driver: 0.0965s >> 2.) POCL: 0.937s >> 3.) AMD CPU OpenCL Driver: 0.64s >> >> The CPU is a Xeon W-2155 with 3.3GHz and 10 Cores. I have not had time to >> investigate the LLVM IR Code as suggested >> but will do as soon as possible. AMD is included as I have a Radeon Pro >> card, which automatically also installed OpenCL CPU drivers. >> >> Best wishes >> >> Timo >> >> >> On 7 February 2018 at 16:03, Jeff Hammond <jef...@gm...> wrote: >> >>> >>> >>> On Wed, Feb 7, 2018 at 2:41 AM, Michal Babej <Fra...@ru... >>> > wrote: >>> >>>> Hi, >>>> >>>> > we noticed for one of our OpenCL kernels that pocl is over 4 times >>>> > slower than the Intel OpenCL runtime on a Xeon W processor. >>>> >>>> 1) If i googled correctly, Xeon W has AVX-512, which the intel runtime >>>> is likely fully using. LLVM 4 has absolutely horrible AVX512 support, >>>> LLVM 5 is better but there are still bugs, and you'll want LLVM 6 for >>>> AVX-512 to work (at least i know they fixed the AVX-512 few bugs i >>>> found, i don't have a machine anymore to test it). >>>> >>> >>> >>> Indeed, Xeon W [1] is a sibling of Xeon Scalable and Core X-series of >>> the Skylake generation, which I'll refer to as SKX since they are >>> microarchitecturally the same. All of these support AVX-512, which I'm >>> going to refer to as AVX3 in the following, for reasons that will become >>> clear. >>> >>> An important detail when evaluating vectorization on these processors is >>> that the frequency drops when transitioning from scalar/SSE2 code to AVX2 >>> code to AVX3 (i.e. AVX-512) code [2], which corresponds to the use of xmm >>> (128b), ymm (256b), and zmm (512b) registers respectively. AVX3 >>> instructions with ymm registers should run at AVX2 frequency. >>> >>> While most (but not all - see [3]) parts have 2 VPUs, the first of these >>> is implemented via port fusion [4]. What this means is that the core can >>> dispatch 2 512b AVX3 instructions on ports 0+1 and 5, or it can dispatch 3 >>> 256b instructions (AVX2 or AVX3) on ports 0, 1 and 5. Thus, one can get >>> 1024b throughput at one frequency or 768b throughput at a slightly higher >>> frequency. What this means is that 512b vectorization pays off for code >>> that is thoroughly compute-bound and heavily vectorized (e.g. dense linear >>> algebra and molecular dynamics) but that 256b vectorization is likely >>> better for code that is more memory-bound or doesn't vectorize as well. >>> >>> The Intel C/C++ compiler has a flag -qopt-zmm-usage={low,high} to >>> address this, where "-xCORE-AVX512 -qopt-zmm-usage=low" is going to take >>> advantage of all the AVX3 instructions but favor 256b ymm registers, which >>> will behave exactly like AVX2 in some cases (i.e. ones where the AVX3 >>> instruction features aren't used). >>> >>> Anyways, the short version of this story is that you should not assume >>> 512b SIMD code generation is the reason for a performance benefit from the >>> Intel OpenCL compiler, since it may in fact not generate those instructions >>> if it thinks that 256b is better. It would be useful to force both POCL >>> and Intel OpenCL to use SSE2 and AVX2, respectively, in experiments, to see >>> how they compare when targeting the same vector ISA. This sort of >>> comparison would also be helpful to resolve an older bug report of a >>> similar nature [5]. >>> >>> What I wrote here is one engineer's attempt to summarize a large amount >>> of information in a user-friendly format. I apologize for any errors - >>> they are certainly not intentional. >>> >>> [1] https://ark.intel.com/products/series/125035/Intel-Xeon- >>> Processor-W-Family >>> [2] https://www.intel.com/content/dam/www/public/us/en/docum >>> ents/specification-updates/xeon-scalable-spec-update.pdf >>> [3] https://github.com/jeffhammond/vpu-count >>> [4] https://en.wikichip.org/wiki/intel/microarchitectures/skylak >>> e_(server)#Scheduler_.26_512-SIMD_addition >>> [5] https://github.com/pocl/pocl/issues/292 >>> >>> >>>> 2) It could be the autovectorizer, or it could be something else. Are >>>> your machines NUMA ? if so, you'll likely see very bad performance, as >>>> pocl has no NUMA tuning currently. Also i've seen occasionally that pocl >>>> unrolls too much and overflows L1 caches (you could try experimenting >>>> with various local WG sizes to clEnqueueNDRK). Unfortunately >>>> this part of pocl has received little attention lately... >>>> >>> >>> I don't know what POCL uses for threading, but Intel OpenCL uses the TBB >>> runtime [6]. The TBB runtime has some very smart features for >>> load-balancing and automatic cache blocking that are not implemented in >>> OpenMP and are hard to implement by hand in Pthreads. >>> >>> [6] https://software.intel.com/en-us/articles/whats-new-open >>> cl-runtime-1611 >>> >>> Jeff >>> >>> -- >>> Jeff Hammond >>> jef...@gm... >>> http://jeffhammond.github.io/ >>> >>> ------------------------------------------------------------ >>> ------------------ >>> Check out the vibrant tech community on one of the world's most >>> engaging tech sites, Slashdot.org! http://sdm.link/slashdot >>> _______________________________________________ >>> pocl-devel mailing list >>> poc...@li... >>> https://lists.sourceforge.net/lists/listinfo/pocl-devel >>> >>> >> >> >> -- >> Dr. Timo Betcke >> Reader in Mathematics >> University College London >> Department of Mathematics >> E-Mail: t.b...@uc... >> Tel.: +44 (0) 20-3108-4068 <020%203108%204068> >> Fax.: +44 (0) 20-7383-5519 <020%207383%205519> >> > > > > -- > Dr. Timo Betcke > Reader in Mathematics > University College London > Department of Mathematics > E-Mail: t.b...@uc... > Tel.: +44 (0) 20-3108-4068 <020%203108%204068> > Fax.: +44 (0) 20-7383-5519 <020%207383%205519> > -- Dr. Timo Betcke Reader in Mathematics University College London Department of Mathematics E-Mail: t.b...@uc... Tel.: +44 (0) 20-3108-4068 Fax.: +44 (0) 20-7383-5519 |
From: Timo B. <tim...@gm...> - 2018-02-07 23:41:17
|
Hi, I have tried to dive a bit more into the code now and used Pekka's and Jeffs hints. Analyzing with Vtune showed that no AVX2 code is generated in POCL, which I already suspected. I tried POCL_VECTORIZER_REMARKS=1 to activate vectorizer remarks. But it does not create any kind of output. However, I could create the llvm generated code using POCL_LEAVE_KERNEL_COMPILER_TEMP_FILES=1. I am not experienced with LLVM IR. But it seems that it does not create vectorized code. I have uploaded a gist with the disassembled output here: https://gist.github.com/tbetcke/c5f71dca27cc20c611c35b67f5faa36b The question is what prevents the auto vectorizer from working at all. The code seems quite straight forward with very simple for-loops with hard-coded bounds (numQuadPoints is a compiler macro, set to 3 in the experiments). I would be grateful for any pointer of how to proceed to figure out what is going on with the vectorizer. By the way, I have recompiled pocl with llvm 6. There was no change in behavior from versions 4 and 5. Best wishes Timo On 7 February 2018 at 16:37, Timo Betcke <tim...@gm...> wrote: > Dear Jeff, > > thanks for the explanations. I have now installed pocl on my Xeon W > workstation, and the benchmarks are as follows > (pure kernel runtime via event timers this time to exclude Python > overhead.) > > 1.) Intel OpenCL Driver: 0.0965s > 2.) POCL: 0.937s > 3.) AMD CPU OpenCL Driver: 0.64s > > The CPU is a Xeon W-2155 with 3.3GHz and 10 Cores. I have not had time to > investigate the LLVM IR Code as suggested > but will do as soon as possible. AMD is included as I have a Radeon Pro > card, which automatically also installed OpenCL CPU drivers. > > Best wishes > > Timo > > > On 7 February 2018 at 16:03, Jeff Hammond <jef...@gm...> wrote: > >> >> >> On Wed, Feb 7, 2018 at 2:41 AM, Michal Babej <Fra...@ru...> >> wrote: >> >>> Hi, >>> >>> > we noticed for one of our OpenCL kernels that pocl is over 4 times >>> > slower than the Intel OpenCL runtime on a Xeon W processor. >>> >>> 1) If i googled correctly, Xeon W has AVX-512, which the intel runtime >>> is likely fully using. LLVM 4 has absolutely horrible AVX512 support, >>> LLVM 5 is better but there are still bugs, and you'll want LLVM 6 for >>> AVX-512 to work (at least i know they fixed the AVX-512 few bugs i >>> found, i don't have a machine anymore to test it). >>> >> >> >> Indeed, Xeon W [1] is a sibling of Xeon Scalable and Core X-series of the >> Skylake generation, which I'll refer to as SKX since they are >> microarchitecturally the same. All of these support AVX-512, which I'm >> going to refer to as AVX3 in the following, for reasons that will become >> clear. >> >> An important detail when evaluating vectorization on these processors is >> that the frequency drops when transitioning from scalar/SSE2 code to AVX2 >> code to AVX3 (i.e. AVX-512) code [2], which corresponds to the use of xmm >> (128b), ymm (256b), and zmm (512b) registers respectively. AVX3 >> instructions with ymm registers should run at AVX2 frequency. >> >> While most (but not all - see [3]) parts have 2 VPUs, the first of these >> is implemented via port fusion [4]. What this means is that the core can >> dispatch 2 512b AVX3 instructions on ports 0+1 and 5, or it can dispatch 3 >> 256b instructions (AVX2 or AVX3) on ports 0, 1 and 5. Thus, one can get >> 1024b throughput at one frequency or 768b throughput at a slightly higher >> frequency. What this means is that 512b vectorization pays off for code >> that is thoroughly compute-bound and heavily vectorized (e.g. dense linear >> algebra and molecular dynamics) but that 256b vectorization is likely >> better for code that is more memory-bound or doesn't vectorize as well. >> >> The Intel C/C++ compiler has a flag -qopt-zmm-usage={low,high} to address >> this, where "-xCORE-AVX512 -qopt-zmm-usage=low" is going to take advantage >> of all the AVX3 instructions but favor 256b ymm registers, which will >> behave exactly like AVX2 in some cases (i.e. ones where the AVX3 >> instruction features aren't used). >> >> Anyways, the short version of this story is that you should not assume >> 512b SIMD code generation is the reason for a performance benefit from the >> Intel OpenCL compiler, since it may in fact not generate those instructions >> if it thinks that 256b is better. It would be useful to force both POCL >> and Intel OpenCL to use SSE2 and AVX2, respectively, in experiments, to see >> how they compare when targeting the same vector ISA. This sort of >> comparison would also be helpful to resolve an older bug report of a >> similar nature [5]. >> >> What I wrote here is one engineer's attempt to summarize a large amount >> of information in a user-friendly format. I apologize for any errors - >> they are certainly not intentional. >> >> [1] https://ark.intel.com/products/series/125035/Intel-Xeon- >> Processor-W-Family >> [2] https://www.intel.com/content/dam/www/public/us/en/docum >> ents/specification-updates/xeon-scalable-spec-update.pdf >> [3] https://github.com/jeffhammond/vpu-count >> [4] https://en.wikichip.org/wiki/intel/microarchitectures/skylak >> e_(server)#Scheduler_.26_512-SIMD_addition >> [5] https://github.com/pocl/pocl/issues/292 >> >> >>> 2) It could be the autovectorizer, or it could be something else. Are >>> your machines NUMA ? if so, you'll likely see very bad performance, as >>> pocl has no NUMA tuning currently. Also i've seen occasionally that pocl >>> unrolls too much and overflows L1 caches (you could try experimenting >>> with various local WG sizes to clEnqueueNDRK). Unfortunately >>> this part of pocl has received little attention lately... >>> >> >> I don't know what POCL uses for threading, but Intel OpenCL uses the TBB >> runtime [6]. The TBB runtime has some very smart features for >> load-balancing and automatic cache blocking that are not implemented in >> OpenMP and are hard to implement by hand in Pthreads. >> >> [6] https://software.intel.com/en-us/articles/whats-new-open >> cl-runtime-1611 >> >> Jeff >> >> -- >> Jeff Hammond >> jef...@gm... >> http://jeffhammond.github.io/ >> >> ------------------------------------------------------------ >> ------------------ >> Check out the vibrant tech community on one of the world's most >> engaging tech sites, Slashdot.org! http://sdm.link/slashdot >> _______________________________________________ >> pocl-devel mailing list >> poc...@li... >> https://lists.sourceforge.net/lists/listinfo/pocl-devel >> >> > > > -- > Dr. Timo Betcke > Reader in Mathematics > University College London > Department of Mathematics > E-Mail: t.b...@uc... > Tel.: +44 (0) 20-3108-4068 <020%203108%204068> > Fax.: +44 (0) 20-7383-5519 <020%207383%205519> > -- Dr. Timo Betcke Reader in Mathematics University College London Department of Mathematics E-Mail: t.b...@uc... Tel.: +44 (0) 20-3108-4068 Fax.: +44 (0) 20-7383-5519 |
From: Timo B. <tim...@gm...> - 2018-02-07 16:38:03
|
Dear Jeff, thanks for the explanations. I have now installed pocl on my Xeon W workstation, and the benchmarks are as follows (pure kernel runtime via event timers this time to exclude Python overhead.) 1.) Intel OpenCL Driver: 0.0965s 2.) POCL: 0.937s 3.) AMD CPU OpenCL Driver: 0.64s The CPU is a Xeon W-2155 with 3.3GHz and 10 Cores. I have not had time to investigate the LLVM IR Code as suggested but will do as soon as possible. AMD is included as I have a Radeon Pro card, which automatically also installed OpenCL CPU drivers. Best wishes Timo On 7 February 2018 at 16:03, Jeff Hammond <jef...@gm...> wrote: > > > On Wed, Feb 7, 2018 at 2:41 AM, Michal Babej <Fra...@ru...> > wrote: > >> Hi, >> >> > we noticed for one of our OpenCL kernels that pocl is over 4 times >> > slower than the Intel OpenCL runtime on a Xeon W processor. >> >> 1) If i googled correctly, Xeon W has AVX-512, which the intel runtime >> is likely fully using. LLVM 4 has absolutely horrible AVX512 support, >> LLVM 5 is better but there are still bugs, and you'll want LLVM 6 for >> AVX-512 to work (at least i know they fixed the AVX-512 few bugs i >> found, i don't have a machine anymore to test it). >> > > > Indeed, Xeon W [1] is a sibling of Xeon Scalable and Core X-series of the > Skylake generation, which I'll refer to as SKX since they are > microarchitecturally the same. All of these support AVX-512, which I'm > going to refer to as AVX3 in the following, for reasons that will become > clear. > > An important detail when evaluating vectorization on these processors is > that the frequency drops when transitioning from scalar/SSE2 code to AVX2 > code to AVX3 (i.e. AVX-512) code [2], which corresponds to the use of xmm > (128b), ymm (256b), and zmm (512b) registers respectively. AVX3 > instructions with ymm registers should run at AVX2 frequency. > > While most (but not all - see [3]) parts have 2 VPUs, the first of these > is implemented via port fusion [4]. What this means is that the core can > dispatch 2 512b AVX3 instructions on ports 0+1 and 5, or it can dispatch 3 > 256b instructions (AVX2 or AVX3) on ports 0, 1 and 5. Thus, one can get > 1024b throughput at one frequency or 768b throughput at a slightly higher > frequency. What this means is that 512b vectorization pays off for code > that is thoroughly compute-bound and heavily vectorized (e.g. dense linear > algebra and molecular dynamics) but that 256b vectorization is likely > better for code that is more memory-bound or doesn't vectorize as well. > > The Intel C/C++ compiler has a flag -qopt-zmm-usage={low,high} to address > this, where "-xCORE-AVX512 -qopt-zmm-usage=low" is going to take advantage > of all the AVX3 instructions but favor 256b ymm registers, which will > behave exactly like AVX2 in some cases (i.e. ones where the AVX3 > instruction features aren't used). > > Anyways, the short version of this story is that you should not assume > 512b SIMD code generation is the reason for a performance benefit from the > Intel OpenCL compiler, since it may in fact not generate those instructions > if it thinks that 256b is better. It would be useful to force both POCL > and Intel OpenCL to use SSE2 and AVX2, respectively, in experiments, to see > how they compare when targeting the same vector ISA. This sort of > comparison would also be helpful to resolve an older bug report of a > similar nature [5]. > > What I wrote here is one engineer's attempt to summarize a large amount of > information in a user-friendly format. I apologize for any errors - they > are certainly not intentional. > > [1] https://ark.intel.com/products/series/125035/Intel- > Xeon-Processor-W-Family > [2] https://www.intel.com/content/dam/www/public/us/en/ > documents/specification-updates/xeon-scalable-spec-update.pdf > [3] https://github.com/jeffhammond/vpu-count > [4] https://en.wikichip.org/wiki/intel/microarchitectures/ > skylake_(server)#Scheduler_.26_512-SIMD_addition > [5] https://github.com/pocl/pocl/issues/292 > > >> 2) It could be the autovectorizer, or it could be something else. Are >> your machines NUMA ? if so, you'll likely see very bad performance, as >> pocl has no NUMA tuning currently. Also i've seen occasionally that pocl >> unrolls too much and overflows L1 caches (you could try experimenting >> with various local WG sizes to clEnqueueNDRK). Unfortunately >> this part of pocl has received little attention lately... >> > > I don't know what POCL uses for threading, but Intel OpenCL uses the TBB > runtime [6]. The TBB runtime has some very smart features for > load-balancing and automatic cache blocking that are not implemented in > OpenMP and are hard to implement by hand in Pthreads. > > [6] https://software.intel.com/en-us/articles/whats-new- > opencl-runtime-1611 > > Jeff > > -- > Jeff Hammond > jef...@gm... > http://jeffhammond.github.io/ > > ------------------------------------------------------------ > ------------------ > Check out the vibrant tech community on one of the world's most > engaging tech sites, Slashdot.org! http://sdm.link/slashdot > _______________________________________________ > pocl-devel mailing list > poc...@li... > https://lists.sourceforge.net/lists/listinfo/pocl-devel > > -- Dr. Timo Betcke Reader in Mathematics University College London Department of Mathematics E-Mail: t.b...@uc... Tel.: +44 (0) 20-3108-4068 Fax.: +44 (0) 20-7383-5519 |
From: Jeff H. <jef...@gm...> - 2018-02-07 16:04:09
|
On Wed, Feb 7, 2018 at 2:41 AM, Michal Babej <Fra...@ru...> wrote: > Hi, > > > we noticed for one of our OpenCL kernels that pocl is over 4 times > > slower than the Intel OpenCL runtime on a Xeon W processor. > > 1) If i googled correctly, Xeon W has AVX-512, which the intel runtime > is likely fully using. LLVM 4 has absolutely horrible AVX512 support, > LLVM 5 is better but there are still bugs, and you'll want LLVM 6 for > AVX-512 to work (at least i know they fixed the AVX-512 few bugs i > found, i don't have a machine anymore to test it). > Indeed, Xeon W [1] is a sibling of Xeon Scalable and Core X-series of the Skylake generation, which I'll refer to as SKX since they are microarchitecturally the same. All of these support AVX-512, which I'm going to refer to as AVX3 in the following, for reasons that will become clear. An important detail when evaluating vectorization on these processors is that the frequency drops when transitioning from scalar/SSE2 code to AVX2 code to AVX3 (i.e. AVX-512) code [2], which corresponds to the use of xmm (128b), ymm (256b), and zmm (512b) registers respectively. AVX3 instructions with ymm registers should run at AVX2 frequency. While most (but not all - see [3]) parts have 2 VPUs, the first of these is implemented via port fusion [4]. What this means is that the core can dispatch 2 512b AVX3 instructions on ports 0+1 and 5, or it can dispatch 3 256b instructions (AVX2 or AVX3) on ports 0, 1 and 5. Thus, one can get 1024b throughput at one frequency or 768b throughput at a slightly higher frequency. What this means is that 512b vectorization pays off for code that is thoroughly compute-bound and heavily vectorized (e.g. dense linear algebra and molecular dynamics) but that 256b vectorization is likely better for code that is more memory-bound or doesn't vectorize as well. The Intel C/C++ compiler has a flag -qopt-zmm-usage={low,high} to address this, where "-xCORE-AVX512 -qopt-zmm-usage=low" is going to take advantage of all the AVX3 instructions but favor 256b ymm registers, which will behave exactly like AVX2 in some cases (i.e. ones where the AVX3 instruction features aren't used). Anyways, the short version of this story is that you should not assume 512b SIMD code generation is the reason for a performance benefit from the Intel OpenCL compiler, since it may in fact not generate those instructions if it thinks that 256b is better. It would be useful to force both POCL and Intel OpenCL to use SSE2 and AVX2, respectively, in experiments, to see how they compare when targeting the same vector ISA. This sort of comparison would also be helpful to resolve an older bug report of a similar nature [5]. What I wrote here is one engineer's attempt to summarize a large amount of information in a user-friendly format. I apologize for any errors - they are certainly not intentional. [1] https://ark.intel.com/products/series/125035/Intel-Xeon-Processor-W-Family [2] https://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-scalable-spec-update.pdf [3] https://github.com/jeffhammond/vpu-count [4] https://en.wikichip.org/wiki/intel/microarchitectures/skylake_(server)#Scheduler_.26_512-SIMD_addition [5] https://github.com/pocl/pocl/issues/292 > 2) It could be the autovectorizer, or it could be something else. Are > your machines NUMA ? if so, you'll likely see very bad performance, as > pocl has no NUMA tuning currently. Also i've seen occasionally that pocl > unrolls too much and overflows L1 caches (you could try experimenting > with various local WG sizes to clEnqueueNDRK). Unfortunately > this part of pocl has received little attention lately... > I don't know what POCL uses for threading, but Intel OpenCL uses the TBB runtime [6]. The TBB runtime has some very smart features for load-balancing and automatic cache blocking that are not implemented in OpenMP and are hard to implement by hand in Pthreads. [6] https://software.intel.com/en-us/articles/whats-new-opencl-runtime-1611 Jeff -- Jeff Hammond jef...@gm... http://jeffhammond.github.io/ |
From: Timo B. <tim...@gm...> - 2018-02-07 15:55:19
|
Sorry. It was still set to private. I have set it to public now. Timo On 7 February 2018 at 15:52, Jeff Hammond <jef...@gm...> wrote: > > > On Wed, Feb 7, 2018 at 3:34 AM, Timo Betcke <tim...@gm...> wrote: > >> Dear All, >> >> thanks for the responses. I am posting the kernel below. You can also >> find it >> at >> >> https://bitbucket.org/bemppsolutions/bempp-cl/src/master/ >> bempp/core/opencl/sources/kernels/laplace_bnd_slp_0.cl? >> at=master&fileviewer=file-view-default >> >> > I get an "access denied" error. > > Jeff > > >> It is the first kernel (evaluate_regular). The second one is less >> performance critical. First of all, regarding processor. I mistyped. I >> actually made the experiments on my Kaby Lake i7 >> dual core laptop, but will redo the experiments today on my Xeon W >> workstation. I recompiled yesterday already with llvm 5. But the results >> are similar. >> Intel OpenCL takes around 0.9 seconds and pocl takes around 5 seconds for >> this setup. >> >> I will follow Jeff's and Pekka's advice today to get some more infomation >> on the compiled kernels (might be slow going, as it is the first time I am >> diving into OpenCL profiling). >> >> I just also tried setting the workgroup sizes manually. But this did not >> change anything in the pocl performance. The actual kernel is below. It >> consists of fairly simple for loops. >> The parameter REALTYPE is controlled by a macro in the header file and >> set to 'float' for the benchmark tests. >> >> A little explanation is in order. Each work item takes two triangles ( a >> test and a trial triangle), computes some geometric properties and >> integrates the 3d Laplace potential operator 1/r >> across them, where r is the distance between a quadrature point on the >> test and on the trial triangle. So, it is a fairly simple operation that >> accelerates very well on the Intel runtime. >> >> I am quite motivated to get this issue sorted. We are rewriting an >> open-source software package (Bempp, www.bempp.com) to be based on >> PyOpenCL with OpenCL kernels. >> The first performance benchmarks gave very good speed-ups compared to our >> old C++ code (for both, Intel and pocl). But for later deployment we want >> to be able to target pocl with minimial >> performance penalties compared to the Intel runtime (if possible) as it >> is by default available in Ubuntu and can also be simply installed via >> conda-forge. >> >> Best wishes >> >> Timo >> >> >> ------------------------------------------------------------ >> --------------------------------- >> >> __kernel void evaluate_regular(__constant REALTYPE3 *testElements, >> __constant REALTYPE3 *trialElements, >> __constant REALTYPE2 *quadPoints, >> __constant REALTYPE *quadWeights, >> __global REALTYPE *globalResult, >> int nTrial) >> { >> >> /* Variable declarations */ >> >> const int numQuadPoints = N_QUAD_POINTS; >> >> >> size_t myTestIndex = get_global_id(0); >> size_t myTrialIndex = get_global_id(1); >> >> size_t trialOffset = get_global_offset(1); >> >> >> size_t testQuadIndex; >> size_t trialQuadIndex; >> size_t i; >> >> REALTYPE3 testGlobalPoint; >> REALTYPE3 trialGlobalPoint; >> >> REALTYPE3 testCorners[3]; >> REALTYPE3 trialCorners[3]; >> >> REALTYPE3 testJac[2]; >> REALTYPE3 trialJac[2]; >> >> REALTYPE2 testPoint; >> REALTYPE2 trialPoint; >> >> REALTYPE dist; >> REALTYPE testIntElem; >> REALTYPE trialIntElem; >> >> REALTYPE shapeIntegral = 0; >> >> if (myTrialIndex - trialOffset >= nTrial) return; >> >> for (i = 0; i < 3; ++i){ >> testCorners[i] = testElements[3 * myTestIndex + i]; >> trialCorners[i] = trialElements[3 * myTrialIndex + i]; >> } >> >> testJac[0] = testCorners[1] - testCorners[0]; >> testJac[1] = testCorners[2] - testCorners[0]; >> >> trialJac[0] = trialCorners[1] - trialCorners[0]; >> trialJac[1] = trialCorners[2] - trialCorners[0]; >> >> testIntElem = length(cross(testJac[0], testJac[1])); >> trialIntElem = length(cross(trialJac[0], trialJac[1])); >> >> for (testQuadIndex = 0; testQuadIndex < numQuadPoints; >> ++testQuadIndex){ >> >> testPoint = quadPoints[testQuadIndex]; >> >> testGlobalPoint = testCorners[0] + testJac[0] * testPoint.x + >> testJac[1] * testPoint.y; >> //testGlobalPoint = (M_ONE - testPoint.x - testPoint.y) * >> testCorners[0] + >> // testPoint.x * testCorners[1] + testPoint.y * testCorners[2]; >> >> for (trialQuadIndex = 0; trialQuadIndex < numQuadPoints; >> ++trialQuadIndex){ >> >> trialPoint = quadPoints[trialQuadIndex]; >> trialGlobalPoint = trialCorners[0] + trialJac[0] * >> trialPoint.x + trialJac[1] * trialPoint.y; >> //trialGlobalPoint = (M_ONE - trialPoint.x - trialPoint.y) * >> trialCorners[0] + >> // trialPoint.x * trialCorners[1] + trialPoint.y * >> trialCorners[2]; >> >> dist = distance(testGlobalPoint, trialGlobalPoint); >> shapeIntegral += quadWeights[testQuadIndex] * >> quadWeights[trialQuadIndex] / dist; >> } >> >> } >> >> shapeIntegral *= testIntElem * trialIntElem * M_INV_4PI; >> globalResult[myTestIndex * nTrial + (myTrialIndex - trialOffset)] = >> shapeIntegral; >> >> >> >> >> } >> >> >> >> >> >> On 7 February 2018 at 10:41, Michal Babej <Fra...@ru...> >> wrote: >> > >> > Hi, >> > >> > > we noticed for one of our OpenCL kernels that pocl is over 4 times >> > > slower than the Intel OpenCL runtime on a Xeon W processor. >> > >> > 1) If i googled correctly, Xeon W has AVX-512, which the intel runtime >> > is likely fully using. LLVM 4 has absolutely horrible AVX512 support, >> > LLVM 5 is better but there are still bugs, and you'll want LLVM 6 for >> > AVX-512 to work (at least i know they fixed the AVX-512 few bugs i >> > found, i don't have a machine anymore to test it). >> > >> > 2) It could be the autovectorizer, or it could be something else. Are >> > your machines NUMA ? if so, you'll likely see very bad performance, as >> > pocl has no NUMA tuning currently. Also i've seen occasionally that pocl >> > unrolls too much and overflows L1 caches (you could try experimenting >> > with various local WG sizes to clEnqueueNDRK). Unfortunately >> > this part of pocl has received little attention lately... >> > >> > Cheers, >> > -- mb >> > >> > ------------------------------------------------------------ >> ------------------ >> > Check out the vibrant tech community on one of the world's most >> > engaging tech sites, Slashdot.org! http://sdm.link/slashdot >> > _______________________________________________ >> > pocl-devel mailing list >> > poc...@li... >> > https://lists.sourceforge.net/lists/listinfo/pocl-devel >> >> >> >> >> -- >> Dr. Timo Betcke >> Reader in Mathematics >> University College London >> Department of Mathematics >> E-Mail: t.b...@uc... >> Tel.: +44 (0) 20-3108-4068 <020%203108%204068> >> Fax.: +44 (0) 20-7383-5519 <020%207383%205519> >> >> ------------------------------------------------------------ >> ------------------ >> Check out the vibrant tech community on one of the world's most >> engaging tech sites, Slashdot.org! http://sdm.link/slashdot >> _______________________________________________ >> pocl-devel mailing list >> poc...@li... >> https://lists.sourceforge.net/lists/listinfo/pocl-devel >> >> > > > -- > Jeff Hammond > jef...@gm... > http://jeffhammond.github.io/ > > ------------------------------------------------------------ > ------------------ > Check out the vibrant tech community on one of the world's most > engaging tech sites, Slashdot.org! http://sdm.link/slashdot > _______________________________________________ > pocl-devel mailing list > poc...@li... > https://lists.sourceforge.net/lists/listinfo/pocl-devel > > -- Dr. Timo Betcke Reader in Mathematics University College London Department of Mathematics E-Mail: t.b...@uc... Tel.: +44 (0) 20-3108-4068 Fax.: +44 (0) 20-7383-5519 |
From: Jeff H. <jef...@gm...> - 2018-02-07 15:52:32
|
On Wed, Feb 7, 2018 at 3:34 AM, Timo Betcke <tim...@gm...> wrote: > Dear All, > > thanks for the responses. I am posting the kernel below. You can also find > it > at > > https://bitbucket.org/bemppsolutions/bempp-cl/src/ > master/bempp/core/opencl/sources/kernels/laplace_bnd_ > slp_0.cl?at=master&fileviewer=file-view-default > > I get an "access denied" error. Jeff > It is the first kernel (evaluate_regular). The second one is less > performance critical. First of all, regarding processor. I mistyped. I > actually made the experiments on my Kaby Lake i7 > dual core laptop, but will redo the experiments today on my Xeon W > workstation. I recompiled yesterday already with llvm 5. But the results > are similar. > Intel OpenCL takes around 0.9 seconds and pocl takes around 5 seconds for > this setup. > > I will follow Jeff's and Pekka's advice today to get some more infomation > on the compiled kernels (might be slow going, as it is the first time I am > diving into OpenCL profiling). > > I just also tried setting the workgroup sizes manually. But this did not > change anything in the pocl performance. The actual kernel is below. It > consists of fairly simple for loops. > The parameter REALTYPE is controlled by a macro in the header file and set > to 'float' for the benchmark tests. > > A little explanation is in order. Each work item takes two triangles ( a > test and a trial triangle), computes some geometric properties and > integrates the 3d Laplace potential operator 1/r > across them, where r is the distance between a quadrature point on the > test and on the trial triangle. So, it is a fairly simple operation that > accelerates very well on the Intel runtime. > > I am quite motivated to get this issue sorted. We are rewriting an > open-source software package (Bempp, www.bempp.com) to be based on > PyOpenCL with OpenCL kernels. > The first performance benchmarks gave very good speed-ups compared to our > old C++ code (for both, Intel and pocl). But for later deployment we want > to be able to target pocl with minimial > performance penalties compared to the Intel runtime (if possible) as it is > by default available in Ubuntu and can also be simply installed via > conda-forge. > > Best wishes > > Timo > > > ------------------------------------------------------------ > --------------------------------- > > __kernel void evaluate_regular(__constant REALTYPE3 *testElements, > __constant REALTYPE3 *trialElements, > __constant REALTYPE2 *quadPoints, > __constant REALTYPE *quadWeights, > __global REALTYPE *globalResult, > int nTrial) > { > > /* Variable declarations */ > > const int numQuadPoints = N_QUAD_POINTS; > > > size_t myTestIndex = get_global_id(0); > size_t myTrialIndex = get_global_id(1); > > size_t trialOffset = get_global_offset(1); > > > size_t testQuadIndex; > size_t trialQuadIndex; > size_t i; > > REALTYPE3 testGlobalPoint; > REALTYPE3 trialGlobalPoint; > > REALTYPE3 testCorners[3]; > REALTYPE3 trialCorners[3]; > > REALTYPE3 testJac[2]; > REALTYPE3 trialJac[2]; > > REALTYPE2 testPoint; > REALTYPE2 trialPoint; > > REALTYPE dist; > REALTYPE testIntElem; > REALTYPE trialIntElem; > > REALTYPE shapeIntegral = 0; > > if (myTrialIndex - trialOffset >= nTrial) return; > > for (i = 0; i < 3; ++i){ > testCorners[i] = testElements[3 * myTestIndex + i]; > trialCorners[i] = trialElements[3 * myTrialIndex + i]; > } > > testJac[0] = testCorners[1] - testCorners[0]; > testJac[1] = testCorners[2] - testCorners[0]; > > trialJac[0] = trialCorners[1] - trialCorners[0]; > trialJac[1] = trialCorners[2] - trialCorners[0]; > > testIntElem = length(cross(testJac[0], testJac[1])); > trialIntElem = length(cross(trialJac[0], trialJac[1])); > > for (testQuadIndex = 0; testQuadIndex < numQuadPoints; > ++testQuadIndex){ > > testPoint = quadPoints[testQuadIndex]; > > testGlobalPoint = testCorners[0] + testJac[0] * testPoint.x + > testJac[1] * testPoint.y; > //testGlobalPoint = (M_ONE - testPoint.x - testPoint.y) * > testCorners[0] + > // testPoint.x * testCorners[1] + testPoint.y * testCorners[2]; > > for (trialQuadIndex = 0; trialQuadIndex < numQuadPoints; > ++trialQuadIndex){ > > trialPoint = quadPoints[trialQuadIndex]; > trialGlobalPoint = trialCorners[0] + trialJac[0] * > trialPoint.x + trialJac[1] * trialPoint.y; > //trialGlobalPoint = (M_ONE - trialPoint.x - trialPoint.y) * > trialCorners[0] + > // trialPoint.x * trialCorners[1] + trialPoint.y * > trialCorners[2]; > > dist = distance(testGlobalPoint, trialGlobalPoint); > shapeIntegral += quadWeights[testQuadIndex] * > quadWeights[trialQuadIndex] / dist; > } > > } > > shapeIntegral *= testIntElem * trialIntElem * M_INV_4PI; > globalResult[myTestIndex * nTrial + (myTrialIndex - trialOffset)] = > shapeIntegral; > > > > > } > > > > > > On 7 February 2018 at 10:41, Michal Babej <Fra...@ru...> > wrote: > > > > Hi, > > > > > we noticed for one of our OpenCL kernels that pocl is over 4 times > > > slower than the Intel OpenCL runtime on a Xeon W processor. > > > > 1) If i googled correctly, Xeon W has AVX-512, which the intel runtime > > is likely fully using. LLVM 4 has absolutely horrible AVX512 support, > > LLVM 5 is better but there are still bugs, and you'll want LLVM 6 for > > AVX-512 to work (at least i know they fixed the AVX-512 few bugs i > > found, i don't have a machine anymore to test it). > > > > 2) It could be the autovectorizer, or it could be something else. Are > > your machines NUMA ? if so, you'll likely see very bad performance, as > > pocl has no NUMA tuning currently. Also i've seen occasionally that pocl > > unrolls too much and overflows L1 caches (you could try experimenting > > with various local WG sizes to clEnqueueNDRK). Unfortunately > > this part of pocl has received little attention lately... > > > > Cheers, > > -- mb > > > > ------------------------------------------------------------ > ------------------ > > Check out the vibrant tech community on one of the world's most > > engaging tech sites, Slashdot.org! http://sdm.link/slashdot > > _______________________________________________ > > pocl-devel mailing list > > poc...@li... > > https://lists.sourceforge.net/lists/listinfo/pocl-devel > > > > > -- > Dr. Timo Betcke > Reader in Mathematics > University College London > Department of Mathematics > E-Mail: t.b...@uc... > Tel.: +44 (0) 20-3108-4068 > Fax.: +44 (0) 20-7383-5519 > > ------------------------------------------------------------ > ------------------ > Check out the vibrant tech community on one of the world's most > engaging tech sites, Slashdot.org! http://sdm.link/slashdot > _______________________________________________ > pocl-devel mailing list > poc...@li... > https://lists.sourceforge.net/lists/listinfo/pocl-devel > > -- Jeff Hammond jef...@gm... http://jeffhammond.github.io/ |
From: Pekka J. <pek...@tu...> - 2018-02-07 14:36:31
|
Dear Timo, Also Michal's advice of testing with LLVM 6.0 is a good idea. Given you don't seem to have barriers in your kernel, you might want to check if any of the implicit barriers we inject confuse the vectorization. These are inserted at least in passes: ImplicitConditionalBarriers.cc ImplicitLoopBarriers.cc They have some debug output which you can enable via a macro define which can help you to the right direction. BR, Pekka On 02/07/2018 01:34 PM, Timo Betcke wrote: > Dear All, > > thanks for the responses. I am posting the kernel below. You can also find it > at > > https://bitbucket.org/bemppsolutions/bempp-cl/src/master/bempp/core/opencl/sources/kernels/laplace_bnd_slp_0.cl?at=master&fileviewer=file-view-default > > It is the first kernel (evaluate_regular). The second one is less > performance critical. First of all, regarding processor. I mistyped. I > actually made the experiments on my Kaby Lake i7 > dual core laptop, but will redo the experiments today on my Xeon W > workstation. I recompiled yesterday already with llvm 5. But the results are > similar. > Intel OpenCL takes around 0.9 seconds and pocl takes around 5 seconds for > this setup. > > I will follow Jeff's and Pekka's advice today to get some more infomation on > the compiled kernels (might be slow going, as it is the first time I am > diving into OpenCL profiling). > > I just also tried setting the workgroup sizes manually. But this did not > change anything in the pocl performance. The actual kernel is below. It > consists of fairly simple for loops. > The parameter REALTYPE is controlled by a macro in the header file and set > to 'float' for the benchmark tests. > > A little explanation is in order. Each work item takes two triangles ( a > test and a trial triangle), computes some geometric properties and > integrates the 3d Laplace potential operator 1/r > across them, where r is the distance between a quadrature point on the test > and on the trial triangle. So, it is a fairly simple operation that > accelerates very well on the Intel runtime. > > I am quite motivated to get this issue sorted. We are rewriting an > open-source software package (Bempp, www.bempp.com <http://www.bempp.com>) > to be based on PyOpenCL with OpenCL kernels. > The first performance benchmarks gave very good speed-ups compared to our > old C++ code (for both, Intel and pocl). But for later deployment we want to > be able to target pocl with minimial > performance penalties compared to the Intel runtime (if possible) as it is > by default available in Ubuntu and can also be simply installed via conda-forge. > > Best wishes > > Timo > > > --------------------------------------------------------------------------------------------- > > __kernel void evaluate_regular(__constant REALTYPE3 *testElements, > __constant REALTYPE3 *trialElements, > __constant REALTYPE2 *quadPoints, > __constant REALTYPE *quadWeights, > __global REALTYPE *globalResult, > int nTrial) > { > > /* Variable declarations */ > > const int numQuadPoints = N_QUAD_POINTS; > > > size_t myTestIndex = get_global_id(0); > size_t myTrialIndex = get_global_id(1); > > size_t trialOffset = get_global_offset(1); > > > size_t testQuadIndex; > size_t trialQuadIndex; > size_t i; > > REALTYPE3 testGlobalPoint; > REALTYPE3 trialGlobalPoint; > > REALTYPE3 testCorners[3]; > REALTYPE3 trialCorners[3]; > > REALTYPE3 testJac[2]; > REALTYPE3 trialJac[2]; > > REALTYPE2 testPoint; > REALTYPE2 trialPoint; > > REALTYPE dist; > REALTYPE testIntElem; > REALTYPE trialIntElem; > > REALTYPE shapeIntegral = 0; > > if (myTrialIndex - trialOffset >= nTrial) return; > > for (i = 0; i < 3; ++i){ > testCorners[i] = testElements[3 * myTestIndex + i]; > trialCorners[i] = trialElements[3 * myTrialIndex + i]; > } > > testJac[0] = testCorners[1] - testCorners[0]; > testJac[1] = testCorners[2] - testCorners[0]; > > trialJac[0] = trialCorners[1] - trialCorners[0]; > trialJac[1] = trialCorners[2] - trialCorners[0]; > > testIntElem = length(cross(testJac[0], testJac[1])); > trialIntElem = length(cross(trialJac[0], trialJac[1])); > > for (testQuadIndex = 0; testQuadIndex < numQuadPoints; ++testQuadIndex){ > > testPoint = quadPoints[testQuadIndex]; > > testGlobalPoint = testCorners[0] + testJac[0] * testPoint.x + > testJac[1] * testPoint.y; > //testGlobalPoint = (M_ONE - testPoint.x - testPoint.y) * > testCorners[0] + > // testPoint.x * testCorners[1] + testPoint.y * testCorners[2]; > > for (trialQuadIndex = 0; trialQuadIndex < numQuadPoints; > ++trialQuadIndex){ > > trialPoint = quadPoints[trialQuadIndex]; > trialGlobalPoint = trialCorners[0] + trialJac[0] * trialPoint.x > + trialJac[1] * trialPoint.y; > //trialGlobalPoint = (M_ONE - trialPoint.x - trialPoint.y) * > trialCorners[0] + > // trialPoint.x * trialCorners[1] + trialPoint.y * > trialCorners[2]; > > dist = distance(testGlobalPoint, trialGlobalPoint); > shapeIntegral += quadWeights[testQuadIndex] * > quadWeights[trialQuadIndex] / dist; > } > > } > > shapeIntegral *= testIntElem * trialIntElem * M_INV_4PI; > globalResult[myTestIndex * nTrial + (myTrialIndex - trialOffset)] = > shapeIntegral; > > > > } > > > > > > On 7 February 2018 at 10:41, Michal Babej <Fra...@ru... > <mailto:Fra...@ru...>> wrote: > > > > Hi, > > > > > we noticed for one of our OpenCL kernels that pocl is over 4 times > > > slower than the Intel OpenCL runtime on a Xeon W processor. > > > > 1) If i googled correctly, Xeon W has AVX-512, which the intel runtime > > is likely fully using. LLVM 4 has absolutely horrible AVX512 support, > > LLVM 5 is better but there are still bugs, and you'll want LLVM 6 for > > AVX-512 to work (at least i know they fixed the AVX-512 few bugs i > > found, i don't have a machine anymore to test it). > > > > 2) It could be the autovectorizer, or it could be something else. Are > > your machines NUMA ? if so, you'll likely see very bad performance, as > > pocl has no NUMA tuning currently. Also i've seen occasionally that pocl > > unrolls too much and overflows L1 caches (you could try experimenting > > with various local WG sizes to clEnqueueNDRK). Unfortunately > > this part of pocl has received little attention lately... > > > > Cheers, > > -- mb > > > > > ------------------------------------------------------------------------------ > > Check out the vibrant tech community on one of the world's most > > engaging tech sites, Slashdot.org! http://sdm.link/slashdot > > _______________________________________________ > > pocl-devel mailing list > > poc...@li... <mailto:poc...@li...> > > https://lists.sourceforge.net/lists/listinfo/pocl-devel > > > > > -- > Dr. Timo Betcke > Reader in Mathematics > University College London > Department of Mathematics > E-Mail: t.b...@uc... <mailto:t.b...@uc...> > Tel.: +44 (0) 20-3108-4068 > Fax.: +44 (0) 20-7383-5519 > > > ------------------------------------------------------------------------------ > Check out the vibrant tech community on one of the world's most > engaging tech sites, Slashdot.org! http://sdm.link/slashdot > > > > _______________________________________________ > pocl-devel mailing list > poc...@li... > https://lists.sourceforge.net/lists/listinfo/pocl-devel > -- Pekka |
From: Timo B. <tim...@gm...> - 2018-02-07 11:34:16
|
Dear All, thanks for the responses. I am posting the kernel below. You can also find it at https://bitbucket.org/bemppsolutions/bempp-cl/src/master/bempp/core/opencl/sources/kernels/laplace_bnd_slp_0.cl?at=master&fileviewer=file-view-default It is the first kernel (evaluate_regular). The second one is less performance critical. First of all, regarding processor. I mistyped. I actually made the experiments on my Kaby Lake i7 dual core laptop, but will redo the experiments today on my Xeon W workstation. I recompiled yesterday already with llvm 5. But the results are similar. Intel OpenCL takes around 0.9 seconds and pocl takes around 5 seconds for this setup. I will follow Jeff's and Pekka's advice today to get some more infomation on the compiled kernels (might be slow going, as it is the first time I am diving into OpenCL profiling). I just also tried setting the workgroup sizes manually. But this did not change anything in the pocl performance. The actual kernel is below. It consists of fairly simple for loops. The parameter REALTYPE is controlled by a macro in the header file and set to 'float' for the benchmark tests. A little explanation is in order. Each work item takes two triangles ( a test and a trial triangle), computes some geometric properties and integrates the 3d Laplace potential operator 1/r across them, where r is the distance between a quadrature point on the test and on the trial triangle. So, it is a fairly simple operation that accelerates very well on the Intel runtime. I am quite motivated to get this issue sorted. We are rewriting an open-source software package (Bempp, www.bempp.com) to be based on PyOpenCL with OpenCL kernels. The first performance benchmarks gave very good speed-ups compared to our old C++ code (for both, Intel and pocl). But for later deployment we want to be able to target pocl with minimial performance penalties compared to the Intel runtime (if possible) as it is by default available in Ubuntu and can also be simply installed via conda-forge. Best wishes Timo --------------------------------------------------------------------------------------------- __kernel void evaluate_regular(__constant REALTYPE3 *testElements, __constant REALTYPE3 *trialElements, __constant REALTYPE2 *quadPoints, __constant REALTYPE *quadWeights, __global REALTYPE *globalResult, int nTrial) { /* Variable declarations */ const int numQuadPoints = N_QUAD_POINTS; size_t myTestIndex = get_global_id(0); size_t myTrialIndex = get_global_id(1); size_t trialOffset = get_global_offset(1); size_t testQuadIndex; size_t trialQuadIndex; size_t i; REALTYPE3 testGlobalPoint; REALTYPE3 trialGlobalPoint; REALTYPE3 testCorners[3]; REALTYPE3 trialCorners[3]; REALTYPE3 testJac[2]; REALTYPE3 trialJac[2]; REALTYPE2 testPoint; REALTYPE2 trialPoint; REALTYPE dist; REALTYPE testIntElem; REALTYPE trialIntElem; REALTYPE shapeIntegral = 0; if (myTrialIndex - trialOffset >= nTrial) return; for (i = 0; i < 3; ++i){ testCorners[i] = testElements[3 * myTestIndex + i]; trialCorners[i] = trialElements[3 * myTrialIndex + i]; } testJac[0] = testCorners[1] - testCorners[0]; testJac[1] = testCorners[2] - testCorners[0]; trialJac[0] = trialCorners[1] - trialCorners[0]; trialJac[1] = trialCorners[2] - trialCorners[0]; testIntElem = length(cross(testJac[0], testJac[1])); trialIntElem = length(cross(trialJac[0], trialJac[1])); for (testQuadIndex = 0; testQuadIndex < numQuadPoints; ++testQuadIndex){ testPoint = quadPoints[testQuadIndex]; testGlobalPoint = testCorners[0] + testJac[0] * testPoint.x + testJac[1] * testPoint.y; //testGlobalPoint = (M_ONE - testPoint.x - testPoint.y) * testCorners[0] + // testPoint.x * testCorners[1] + testPoint.y * testCorners[2]; for (trialQuadIndex = 0; trialQuadIndex < numQuadPoints; ++trialQuadIndex){ trialPoint = quadPoints[trialQuadIndex]; trialGlobalPoint = trialCorners[0] + trialJac[0] * trialPoint.x + trialJac[1] * trialPoint.y; //trialGlobalPoint = (M_ONE - trialPoint.x - trialPoint.y) * trialCorners[0] + // trialPoint.x * trialCorners[1] + trialPoint.y * trialCorners[2]; dist = distance(testGlobalPoint, trialGlobalPoint); shapeIntegral += quadWeights[testQuadIndex] * quadWeights[trialQuadIndex] / dist; } } shapeIntegral *= testIntElem * trialIntElem * M_INV_4PI; globalResult[myTestIndex * nTrial + (myTrialIndex - trialOffset)] = shapeIntegral; } On 7 February 2018 at 10:41, Michal Babej <Fra...@ru...> wrote: > > Hi, > > > we noticed for one of our OpenCL kernels that pocl is over 4 times > > slower than the Intel OpenCL runtime on a Xeon W processor. > > 1) If i googled correctly, Xeon W has AVX-512, which the intel runtime > is likely fully using. LLVM 4 has absolutely horrible AVX512 support, > LLVM 5 is better but there are still bugs, and you'll want LLVM 6 for > AVX-512 to work (at least i know they fixed the AVX-512 few bugs i > found, i don't have a machine anymore to test it). > > 2) It could be the autovectorizer, or it could be something else. Are > your machines NUMA ? if so, you'll likely see very bad performance, as > pocl has no NUMA tuning currently. Also i've seen occasionally that pocl > unrolls too much and overflows L1 caches (you could try experimenting > with various local WG sizes to clEnqueueNDRK). Unfortunately > this part of pocl has received little attention lately... > > Cheers, > -- mb > > ------------------------------------------------------------------------------ > Check out the vibrant tech community on one of the world's most > engaging tech sites, Slashdot.org! http://sdm.link/slashdot > _______________________________________________ > pocl-devel mailing list > poc...@li... > https://lists.sourceforge.net/lists/listinfo/pocl-devel -- Dr. Timo Betcke Reader in Mathematics University College London Department of Mathematics E-Mail: t.b...@uc... Tel.: +44 (0) 20-3108-4068 Fax.: +44 (0) 20-7383-5519 |
From: Michal B. <Fra...@ru...> - 2018-02-07 10:41:49
|
Hi, > we noticed for one of our OpenCL kernels that pocl is over 4 times > slower than the Intel OpenCL runtime on a Xeon W processor. 1) If i googled correctly, Xeon W has AVX-512, which the intel runtime is likely fully using. LLVM 4 has absolutely horrible AVX512 support, LLVM 5 is better but there are still bugs, and you'll want LLVM 6 for AVX-512 to work (at least i know they fixed the AVX-512 few bugs i found, i don't have a machine anymore to test it). 2) It could be the autovectorizer, or it could be something else. Are your machines NUMA ? if so, you'll likely see very bad performance, as pocl has no NUMA tuning currently. Also i've seen occasionally that pocl unrolls too much and overflows L1 caches (you could try experimenting with various local WG sizes to clEnqueueNDRK). Unfortunately this part of pocl has received little attention lately... Cheers, -- mb |
From: Pekka J. <pek...@tu...> - 2018-02-07 07:11:25
|
Hello Timo, I'm glad to hear you are willing to contribute to the cause of open and performance portable OpenCL. Beware, though, some of the kernel compiler needs major rewrites for clarity, and unfortunately there are only a few people working on the kernel compiler. But hopefully soon we can count you in as one :) This reminds me that I should really write the "how to tune and hack the pocl kernel compiler" document. Maybe this is a starter for that: There are several useful environment variables for debugging and analyzing the kernel compiler optimizations: http://portablecl.org/docs/html/env_variables.html First, you can make pocl to dump more debug output from LLVM and its vectorizer: * POCL_DEBUG_LLVM_PASSES When set to 1, enables debug output from LLVM passes during optimization. * POCL_VECTORIZER_REMARKS When set to 1, prints out remarks produced by the loop vectorizer of LLVM during kernel compilation. To debug and analyze the kernel compiler intermediate results closer, you can instruct pocl to leave the temporary LLVM bitcode files (normally it deletes them after they are not needed). POCL_CACHE_DIR, it's useful to set this to a local temp dir which you can clear up between trials. POCL_LEAVE_KERNEL_COMPILER_TEMP_FILES=1 Then after executing your OpenCL app, under your temp dir, you will find .bc files, the most interesting one being parallel.bc which is the final IR produced by pocl and LLVM before codegen. If you don't see vector LLVM IR there, it won't likely appear in your final binary either. To start hacking: http://portablecl.org/docs/html/kernel_compiler.html Also our pocl paper might provide additional help, but the above link should give a good overview although it might be outdated (I've added it to my task list to update it). The LLVM passes are under lib/llvmopencl. The layer between OpenCL runtime and the kernel compiler is in files lib/CL/pocl_llvm*.c Please don't hesitate to ask for further instructions here or in IRC. BR, Pekka On 02/07/2018 02:20 AM, Timo Betcke wrote: > Hi, > > we noticed for one of our OpenCL kernels that pocl is over 4 times slower > than the Intel OpenCL runtime on a Xeon W processor. I am assuming it is the > auto vectorizer. How can I debug this and figure out if vectorization across > work items is being performed with pocl? The kernels are running under > PyOpenCL on Ubuntu 16.04 with LLVM 4 and pocl 1.0. > > We are planning to distribute our software and would prefer to have good > performance on pocl and not have to rely on the Intel environment. > > Best wishes > > Timo > > -- > Dr. Timo Betcke > Reader in Mathematics > University College London > Department of Mathematics > E-Mail: t.b...@uc... <mailto:t.b...@uc...> > Tel.: +44 (0) 20-3108-4068 > Fax.: +44 (0) 20-7383-5519 > > > ------------------------------------------------------------------------------ > Check out the vibrant tech community on one of the world's most > engaging tech sites, Slashdot.org! http://sdm.link/slashdot > > > > _______________________________________________ > pocl-devel mailing list > poc...@li... > https://lists.sourceforge.net/lists/listinfo/pocl-devel > -- Pekka |
From: Jeff H. <jef...@gm...> - 2018-02-07 01:01:49
|
Can you share details eg small reproducer? Or post the OpenCL kernel source? You can use Linux perf or Vtune to access hardware counters to get an idea what vector code is being generated. Jeff (I work for Intel) Sent from my iPhone > On Feb 6, 2018, at 4:20 PM, Timo Betcke <tim...@gm...> wrote: > > Hi, > > we noticed for one of our OpenCL kernels that pocl is over 4 times slower than the Intel OpenCL runtime on a Xeon W processor. I am assuming it is the auto vectorizer. How can I debug this and figure out if vectorization across work items is being performed with pocl? The kernels are running under PyOpenCL on Ubuntu 16.04 with LLVM 4 and pocl 1.0. > > We are planning to distribute our software and would prefer to have good performance on pocl and not have to rely on the Intel environment. > > Best wishes > > Timo > > -- > Dr. Timo Betcke > Reader in Mathematics > University College London > Department of Mathematics > E-Mail: t.b...@uc... > Tel.: +44 (0) 20-3108-4068 > Fax.: +44 (0) 20-7383-5519 > ------------------------------------------------------------------------------ > Check out the vibrant tech community on one of the world's most > engaging tech sites, Slashdot.org! http://sdm.link/slashdot > _______________________________________________ > pocl-devel mailing list > poc...@li... > https://lists.sourceforge.net/lists/listinfo/pocl-devel |
From: Timo B. <tim...@gm...> - 2018-02-07 00:20:36
|
Hi, we noticed for one of our OpenCL kernels that pocl is over 4 times slower than the Intel OpenCL runtime on a Xeon W processor. I am assuming it is the auto vectorizer. How can I debug this and figure out if vectorization across work items is being performed with pocl? The kernels are running under PyOpenCL on Ubuntu 16.04 with LLVM 4 and pocl 1.0. We are planning to distribute our software and would prefer to have good performance on pocl and not have to rely on the Intel environment. Best wishes Timo -- Dr. Timo Betcke Reader in Mathematics University College London Department of Mathematics E-Mail: t.b...@uc... Tel.: +44 (0) 20-3108-4068 Fax.: +44 (0) 20-7383-5519 |
From: Pekka J. <pek...@tu...> - 2017-12-27 15:59:13
|
Hello O. Hartmann, I didn't mean to be "rude" when closing that pocl issue report, but there just seems to be no doubt that the problem is in the LLVM-side and has wider impact than pocl or other OpenCL implementations. Thus, the issue is better discussed in the LLVM bugzilla entries instead of pocl's issue tracker / mailing list because other LLVM devs are better be in the loop. One easy "workaround fix" might be what I mentioned in https://bugs.llvm.org/show_bug.cgi?id=30587: "This error would probably just go away if the command line handler just ignored multiple identical command line switch registrations silently." Someone just needs to try something like that out and submit a patch to LLVM. However, if the old dynlib linking doesn't work around it, there might be a more serious / another issue now. And I'm not aware of other client-side workarounds other than all clients dynamic linking to the same libLLVM*.so, unfortunately. BR, Pekka On 27.12.2017 15:19, O. Hartmann wrote: > Hello List. > > Running pocl 0.14 and/or pocl-1.0 on FreeBSD CURRENT, using > CLANG/LLVM 4.0.1, code generated using pocl in combination with > ocl-icd 2.2.11 and intel-beignet and clover installed as additional > OpenCL ICDs, any code/binary using ocl-icd is bailing out when pocl > is installed dropping: > > : CommandLine Error: Option 'enable-value-profiling' registered more > than once! LLVM ERROR: inconsistency in registered CommandLine > options > > > to the console. > > Searching the net for some answers or bugfixes lead me to this bug > report: > > https://github.com/pocl/pocl/issues/474 > > which has been closed (in a rude way for my taste). > > There are reports similar to my experiences at the LLVM bug report > site: > > Bug 30587 - Inconsistency in commandline options with multiple OpenCL > vendor libraries installed > https://bugs.llvm.org/show_bug.cgi?id=30587 > > with some comments from some fellows well known to this list and this > one: > > Bug 22952 - cl::opt + LLVM_BUILD_LLVM_DYLIB is completely broken > https://bugs.llvm.org/show_bug.cgi?id=22952 > > I think I'm in the same boat now and want to ask if there is any > solution to this problem apart from staically linking llvm? > > Kind regard, > > Oh > > > > ------------------------------------------------------------------------------ > > Check out the vibrant tech community on one of the world's most > engaging tech sites, Slashdot.org! http://sdm.link/slashdot > > > > _______________________________________________ pocl-devel mailing > list poc...@li... > https://lists.sourceforge.net/lists/listinfo/pocl-devel > -- Pekka |
From: O. H. <oha...@wa...> - 2017-12-27 13:33:04
|
Hello List. Running pocl 0.14 and/or pocl-1.0 on FreeBSD CURRENT, using CLANG/LLVM 4.0.1, code generated using pocl in combination with ocl-icd 2.2.11 and intel-beignet and clover installed as additional OpenCL ICDs, any code/binary using ocl-icd is bailing out when pocl is installed dropping: : CommandLine Error: Option 'enable-value-profiling' registered more than once! LLVM ERROR: inconsistency in registered CommandLine options to the console. Searching the net for some answers or bugfixes lead me to this bug report: https://github.com/pocl/pocl/issues/474 which has been closed (in a rude way for my taste). There are reports similar to my experiences at the LLVM bug report site: Bug 30587 - Inconsistency in commandline options with multiple OpenCL vendor libraries installed https://bugs.llvm.org/show_bug.cgi?id=30587 with some comments from some fellows well known to this list and this one: Bug 22952 - cl::opt + LLVM_BUILD_LLVM_DYLIB is completely broken https://bugs.llvm.org/show_bug.cgi?id=22952 I think I'm in the same boat now and want to ask if there is any solution to this problem apart from staically linking llvm? Kind regard, Oh -- O. Hartmann Ich widerspreche der Nutzung oder Übermittlung meiner Daten für Werbezwecke oder für die Markt- oder Meinungsforschung (§ 28 Abs. 4 BDSG). |
From: Michal B. <Fra...@ru...> - 2017-12-19 12:02:12
|
Hello, After fixing a few bugs in RC1, we're finally ready for release. Release highlights: * Support for LLVM/Clang 5.0 and 4.0. * Support for NVIDIA GPUs via a new CUDA backend (currently experimental) * Full conformance with OpenCL 1.2 standard on CPU backend (with some limitations, see the documentation for details) You can download from the usual location: http://portablecl.org/download.html or from Github: https://github.com/pocl/pocl/releases/tag/v1.0 Regards, -- mb |
From: Andreas K. <li...@in...> - 2017-12-09 19:31:44
|
Michal Babej <Fra...@ru...> writes: > Hi, > > That's great to hear. > >> long-standing bugs that we encountered with 0.14: > > One seems like a cache issue, the other like math precision > issue, yes ? Both of them struck me as cache issues, though I might be wrong. AFAIR, there isn't much in the way of special functions in these kernels. > Unfortunately i still see some occasional problems with the cache, but > i'll try to resolve them before release. It should be better than 0.14 > still. That'd be fantastic. For what it's worth, PyOpenCL caches OpenCL binaries (which seems faster than relying entirely on pocl's from-source cache, which seems to do a full preprocessor run on the code before hashing). > As for math, the precision is mandated by OpenCL - and pocl even > documents the actual ULPs, scroll down in this document: > https://github.com/pocl/pocl/blob/master/doc/sphinx/source/conformance.rst > previously (with VML) pocl was very non-conformant. We did see a few things that could conceivably be traced back to loss of FP digits. I'm excited to hear that improvements have been made, and I'm looking forward to seeing whether that translates to improvements in our results. >> I've also started this here: >> https://github.com/conda-forge/pocl-feedstock/pull/11 > > I'm unfamiliar with conda, but i forgot to mention there are now Docker > files in tools/docker, which might be useful even if one's not using > Docker, to see the exact steps required to setup Pocl. Conda is a user-level package manager. Install anywhere in your home directory, no root required. It originated within the scientific python community but has expanded far beyond that, to Julia, R, and many others. I've packages pocl, ocl-icd, and pyopencl for it, with the goal of making the process of installing a working OpenCL environment a question of copy-pasting three shell commands. Among other things, it takes care of making binaries with baked-in paths relocatable (by compiling with a very long path and then hacking the binary). Here's the patch that makes that work with pocl: https://github.com/inducer/pocl-feedstock/blob/b7f3702df3d7888ee13a51cc2a93adf31489d771/recipe/paths-in-separate-compilation-unit.patch Let me know if that's something you could see accepting upstream. Andreas |
From: Andreas K. <li...@in...> - 2017-12-07 20:49:53
|
Michal Babej <Fra...@ru...> writes: > Hello, > > It took a while, but finally it's ready. Please test with either > LLVM 4 or 5, and report any issues. Thanks > > Instructions: > > https://github.com/pocl/pocl/wiki/Release-testing-of-pocl-1.0 > > Sources: > > https://github.com/pocl/pocl/releases/tag/v1.0-RC1 This is looking great from our end. I've run the tests for many of our packages (pyopencl, loopy, sumpy, pytential, grudge) on this version, and encountered no issues that were pocl's fault. As far as I can tell, this also seems to have resolved two long-standing bugs that we encountered with 0.14: https://gitlab.tiker.net/inducer/pytential/issues/64 https://gitlab.tiker.net/inducer/pytential/issues/75 Plus the new pocl *feels* a fair bit faster in our use. So all in all, this is shaping up to be an awesome release! Thank you all for the work you have put into it. I've also started this here: https://github.com/conda-forge/pocl-feedstock/pull/11 so that conda install pocl should suffice to install an up-to-date pocl for those using the conda user-level package manager in conjunction with conda forge. Andreas |
From: Michal B. <Fra...@ru...> - 2017-12-06 12:09:02
|
Hello, It took a while, but finally it's ready. Please test with either LLVM 4 or 5, and report any issues. Thanks Instructions: https://github.com/pocl/pocl/wiki/Release-testing-of-pocl-1.0 Sources: https://github.com/pocl/pocl/releases/tag/v1.0-RC1 Regards, -- mb |
From: Pekka J. <pek...@tu...> - 2017-11-16 12:43:34
|
On 16.11.2017 08:46, Wuweijia wrote: > I do not how the pocl add the definition of _local_id_x , > _local_id_y ... to the final shared object. Can you show me where or > how. I check the code that _local_id_x _local_id_y variable is > external, not the definition. They are placeholder declarations which are converted to "context array" field accesses by the LLVM pass in lib/llvmopencl/Workgroup.cc (privatizeContext). The "context array" is given as a hidden extra argument to the kernel and contains ids etc. related to the launch of the kernel. For non-GPUs, the ids are iterated by the parallel work item loops generated to execute the whole work-group's WIs in a single function. BR, -- Pekka |
From: Wuweijia <wuw...@hu...> - 2017-11-16 12:40:45
|
Hi #else passes.push_back("workitem-handler-chooser"); passes.push_back("workgroup"); passes.push_back("globaldce"); passes.push_back("flatten"); passes.push_back("wi-aa"); passes.push_back("workitemrepl"); // passes.push_back("STANDARD_OPTS"); const std::string wg_method = pocl_get_string_option("POCL_WORK_GROUP_METHOD", "loopvec"); #endif I commented all the passes and only push these these passes (as above), when I run the cl file compile into the so. The so is lack of some variable that is undefined . These the output, that some global variable is undefined. localhost:/system/bin # nm /system/lib64/crtend_so.o -o /sdcard/pocl/kcache/JE/DDOPBDCLJEBGFADLGMMPEJPMKKFBGLFFAELBI/scarlar_add/1-1-1/scarlar_add.so | grep " U " /sdcard/pocl/kcache/JE/DDOPBDCLJEBGFADLGMMPEJPMKKFBGLFFAELBI/scarlar_add/1-1-1/scarlar_add.so: U __cxa_atexit@@LIBC /sdcard/pocl/kcache/JE/DDOPBDCLJEBGFADLGMMPEJPMKKFBGLFFAELBI/scarlar_add/1-1-1/scarlar_add.so: U __cxa_finalize@@LIBC /sdcard/pocl/kcache/JE/DDOPBDCLJEBGFADLGMMPEJPMKKFBGLFFAELBI/scarlar_add/1-1-1/scarlar_add.so: U __register_atfork@@LIBC /sdcard/pocl/kcache/JE/DDOPBDCLJEBGFADLGMMPEJPMKKFBGLFFAELBI/scarlar_add/1-1-1/scarlar_add.so: U _group_id_x /sdcard/pocl/kcache/JE/DDOPBDCLJEBGFADLGMMPEJPMKKFBGLFFAELBI/scarlar_add/1-1-1/scarlar_add.so: U _group_id_y /sdcard/pocl/kcache/JE/DDOPBDCLJEBGFADLGMMPEJPMKKFBGLFFAELBI/scarlar_add/1-1-1/scarlar_add.so: U _group_id_z /sdcard/pocl/kcache/JE/DDOPBDCLJEBGFADLGMMPEJPMKKFBGLFFAELBI/scarlar_add/1-1-1/scarlar_add.so: U _local_id_x /sdcard/pocl/kcache/JE/DDOPBDCLJEBGFADLGMMPEJPMKKFBGLFFAELBI/scarlar_add/1-1-1/scarlar_add.so: U _local_id_y /sdcard/pocl/kcache/JE/DDOPBDCLJEBGFADLGMMPEJPMKKFBGLFFAELBI/scarlar_add/1-1-1/scarlar_add.so: U _local_id_z localhost:/system/bin # I do not how to handle it . can you tell me the which passes generate it why it failed, I can change the code . Environment: Arm64 server, non-GPUs BR Owen -----邮件原件----- 发件人: Pekka Jääskeläinen [mailto:pek...@tu...] 发送时间: 2017年11月16日 20:29 收件人: Wuweijia <wuw...@hu...>; Portable Computing Language development discussion <poc...@li...> 抄送: Fanbohao <fan...@hu...> 主题: Re: 答复: 答复: [pocl-devel] [POCL_DBG] How to debug the cl file. On 16.11.2017 08:46, Wuweijia wrote: > I do not how the pocl add the definition of _local_id_x , > _local_id_y ... to the final shared object. Can you show me where or > how. I check the code that _local_id_x _local_id_y variable is > external, not the definition. They are placeholder declarations which are converted to "context array" field accesses by the LLVM pass in lib/llvmopencl/Workgroup.cc (privatizeContext). The "context array" is given as a hidden extra argument to the kernel and contains ids etc. related to the launch of the kernel. For non-GPUs, the ids are iterated by the parallel work item loops generated to execute the whole work-group's WIs in a single function. BR, -- Pekka |
From: Pekka J. <pek...@tu...> - 2017-11-16 09:02:03
|
Hi, This would be an extremely useful feature, but I don't know of anyone working on it at the moment. Contributions welcome. BR, Pekka On 16.11.2017 02:43, Wuweijia wrote: > hi > Is there any milestone about how to release that function? > BR > Owen > > -----邮件原件----- > 发件人: Pekka Jääskeläinen [mailto:pek...@tu...] > 发送时间: 2017年11月15日 19:32 > 收件人: Portable Computing Language development discussion <poc...@li...>; Wuweijia <wuw...@hu...> > 抄送: Fanbohao <fan...@hu...> > 主题: Re: [pocl-devel] [POCL_DBG] How to debug the cl file. > > Hi, > > I think the debug info generation and preservation across WGF passes is unfinished. > > I typically end up using printf() when debugging kernel code, and look at the disassembly in gdb, which usually helps to spot the kernel lines where the crash happens. > > On 10.11.2017 10:33, Wuweijia wrote: >> I write cl file , and ran it with the pocl in arm64 >> server, compilation is ok. In the running there is some bug in the >> cl file, and application crash in the cl file . How can I debug the cl file with gdb. >> >> I think the pocl compile the cl file with some optimized >> options, no debug info. Show the gdb can not show me the callstack >> where cl file crash. > > -- > Pekka > -- Pekka |
From: Wuweijia <wuw...@hu...> - 2017-11-16 07:47:27
|
Hi: I do not how the pocl add the definition of _local_id_x , _local_id_y ... to the final shared object. Can you show me where or how. I check the code that _local_id_x _local_id_y variable is external, not the definition. BR Owen -----邮件原件----- 发件人: Pekka Jääskeläinen [mailto:pek...@tu...] 发送时间: 2017年11月16日 15:30 收件人: Wuweijia <wuw...@hu...>; Portable Computing Language development discussion <poc...@li...> 抄送: Fanbohao <fan...@hu...> 主题: Re: 答复: [pocl-devel] [POCL_DBG] How to debug the cl file. Hi, This would be an extremely useful feature, but I don't know of anyone working on it at the moment. Contributions welcome. BR, Pekka On 16.11.2017 02:43, Wuweijia wrote: > hi > Is there any milestone about how to release that function? > BR > Owen > > -----邮件原件----- > 发件人: Pekka Jääskeläinen [mailto:pek...@tu...] > 发送时间: 2017年11月15日 19:32 > 收件人: Portable Computing Language development discussion > <poc...@li...>; Wuweijia <wuw...@hu...> > 抄送: Fanbohao <fan...@hu...> > 主题: Re: [pocl-devel] [POCL_DBG] How to debug the cl file. > > Hi, > > I think the debug info generation and preservation across WGF passes is unfinished. > > I typically end up using printf() when debugging kernel code, and look at the disassembly in gdb, which usually helps to spot the kernel lines where the crash happens. > > On 10.11.2017 10:33, Wuweijia wrote: >> I write cl file , and ran it with the pocl in arm64 >> server, compilation is ok. In the running there is some bug in the >> cl file, and application crash in the cl file . How can I debug the cl file with gdb. >> >> I think the pocl compile the cl file with some >> optimized options, no debug info. Show the gdb can not show me the >> callstack where cl file crash. > > -- > Pekka > -- Pekka |
From: Wuweijia <wuw...@hu...> - 2017-11-16 01:43:48
|
hi Is there any milestone about how to release that function? BR Owen -----邮件原件----- 发件人: Pekka Jääskeläinen [mailto:pek...@tu...] 发送时间: 2017年11月15日 19:32 收件人: Portable Computing Language development discussion <poc...@li...>; Wuweijia <wuw...@hu...> 抄送: Fanbohao <fan...@hu...> 主题: Re: [pocl-devel] [POCL_DBG] How to debug the cl file. Hi, I think the debug info generation and preservation across WGF passes is unfinished. I typically end up using printf() when debugging kernel code, and look at the disassembly in gdb, which usually helps to spot the kernel lines where the crash happens. On 10.11.2017 10:33, Wuweijia wrote: > I write cl file , and ran it with the pocl in arm64 > server, compilation is ok. In the running there is some bug in the > cl file, and application crash in the cl file . How can I debug the cl file with gdb. > > I think the pocl compile the cl file with some optimized > options, no debug info. Show the gdb can not show me the callstack > where cl file crash. -- Pekka |