From: Erik S. <esc...@pe...> - 2011-10-29 16:43:20
|
When I use clang 3.1 (a recent snapshot) to translate e.g. the fabs intrinsic, acting on a single floating point number, then then generated x86 code looks like _Z4fabsf: # @_Z4fabsf movd %xmm0, %eax andl $2147483647, %eax # imm = 0x7FFFFFFF movd %eax, %xmm0 ret This is not optimal, since the value is moved from xmm0 to eax and back, which is not necessary. Instead of andl, I expect to see the andss instruction. How do I go about having this corrected? Is this a problem in pocl, in clang, in llvm, or in the way one of these are used? -erik -- Erik Schnetter <esc...@pe...> http://www.cct.lsu.edu/~eschnett/ AIM: eschnett247, Skype: eschnett, Google Talk: sch...@gm... |
From: Pekka J. <pek...@tu...> - 2011-10-30 11:01:44
|
Hi Erik, On 10/29/2011 07:43 PM, Erik Schnetter wrote: > When I use clang 3.1 (a recent snapshot) to translate e.g. the fabs > intrinsic, acting on a single floating point number, then then > generated x86 code looks like > > _Z4fabsf: # @_Z4fabsf > movd %xmm0, %eax > andl $2147483647, %eax # imm = 0x7FFFFFFF > movd %eax, %xmm0 > ret > > This is not optimal, since the value is moved from xmm0 to eax and > back, which is not necessary. Instead of andl, I expect to see the > andss instruction. > > How do I go about having this corrected? Is this a problem in pocl, in > clang, in llvm, or in the way one of these are used? I'm not familiar with the SSE instruction extensions but quick googling didn't return 'andss' for single floats. E.g.: http://en.wikipedia.org/wiki/X86_instruction_listings I see this absf implementation uses bit manipulation to reset the sign bit of the float word to return the absolute. Thus, in case SSE does not have 'and', it has to go back to the x86 instruction set to perform the and to reset the sign bit. If SSE has a suitable 'and', it should be able to operate directly on the xmm reg in which case it's an LLVM instruction selection issue. In that case overriding the implementation with inline assembly can circumvent the issue. Of course, the preferred way is to add proper 'andss' to the instruction patterns in the LLVM side, if such is available. -- --Pekka |
From: Erik S. <esc...@pe...> - 2011-10-30 22:21:53
|
Pekka You are right; andss does not exist, but there is an andps instead. -erik 2011/10/30 Pekka Jääskeläinen <pek...@tu...>: > Hi Erik, > > On 10/29/2011 07:43 PM, Erik Schnetter wrote: >> When I use clang 3.1 (a recent snapshot) to translate e.g. the fabs >> intrinsic, acting on a single floating point number, then then >> generated x86 code looks like >> >> _Z4fabsf: # @_Z4fabsf >> movd %xmm0, %eax >> andl $2147483647, %eax # imm = 0x7FFFFFFF >> movd %eax, %xmm0 >> ret >> >> This is not optimal, since the value is moved from xmm0 to eax and >> back, which is not necessary. Instead of andl, I expect to see the >> andss instruction. >> >> How do I go about having this corrected? Is this a problem in pocl, in >> clang, in llvm, or in the way one of these are used? > > I'm not familiar with the SSE instruction extensions but quick googling > didn't return 'andss' for single floats. E.g.: > http://en.wikipedia.org/wiki/X86_instruction_listings > > I see this absf implementation uses bit manipulation to reset the sign bit > of the float word to return the absolute. Thus, in case SSE does not have > 'and', it has to go back to the x86 instruction set to perform the and > to reset the sign bit. > > If SSE has a suitable 'and', it should be able to operate directly on the > xmm reg in which case it's an LLVM instruction selection issue. In that case > overriding the implementation with inline assembly can circumvent the issue. > Of course, the preferred way is to add proper 'andss' to the instruction > patterns in the LLVM side, if such is available. > > -- > --Pekka > > > ------------------------------------------------------------------------------ > Get your Android app more play: Bring it to the BlackBerry PlayBook > in minutes. BlackBerry App World™ now supports Android™ Apps > for the BlackBerry® PlayBook™. Discover just how easy and simple > it is! http://p.sf.net/sfu/android-dev2dev > _______________________________________________ > Pocl-devel mailing list > Poc...@li... > https://lists.sourceforge.net/lists/listinfo/pocl-devel > -- Erik Schnetter <esc...@pe...> http://www.cct.lsu.edu/~eschnett/ AIM: eschnett247, Skype: eschnett, Google Talk: sch...@gm... |
From: Pekka J. <pek...@tu...> - 2011-10-31 07:25:51
|
On 10/31/2011 12:21 AM, Erik Schnetter wrote: > You are right; andss does not exist, but there is an andps instead. It seems to be an SIMD instruction that performs the 'and' for 4 single precision floats and you are performing it on a single one. I can understand that LLVM cannot select it automatically as in that case it would clobber all the other floats in the SIMD register too, and (at least when inlined) they can contain live data. Thus, if it selected it automatically, it had to "spill" the other parts of the SIMD reg before doing that which is quite costly. However, if you are sure using ANDPS here is faster, you can generate an inline asm that has a safe 'all ones' mask for the rest of the fields, right? -- --Pekka |
From: Erik S. <esc...@pe...> - 2011-10-31 14:24:15
|
Pekka Yes, such an andps instruction is possible. However, I am quite certain (but can't guarantee it) that the other vector elements of the respective xmm register are unused. That is at least the calling convention for x86; of course, I don't know whether llvm is doing anything clever within a routine, but by looking at the generated code, this does not seem to be the case. I've just figured out the extended asm syntax for this. -erik 2011/10/31 Pekka Jääskeläinen <pek...@tu...>: > On 10/31/2011 12:21 AM, Erik Schnetter wrote: >> >> You are right; andss does not exist, but there is an andps instead. > > It seems to be an SIMD instruction that performs the 'and' for 4 single > precision floats and you are performing it on a single one. > > I can understand that LLVM cannot select it automatically as in that case > it would clobber all the other floats in the SIMD register too, and (at > least when inlined) they can contain live data. Thus, if it selected it > automatically, it had to "spill" the other parts of the SIMD reg before > doing that which is quite costly. > > However, if you are sure using ANDPS here is faster, you can generate an > inline asm that has a safe 'all ones' mask for the rest of the fields, > right? > > -- > --Pekka > > -- Erik Schnetter <esc...@pe...> http://www.cct.lsu.edu/~eschnett/ AIM: eschnett247, Skype: eschnett, Google Talk: sch...@gm... |
From: Pekka J. <pek...@tu...> - 2011-10-31 14:47:44
|
On 10/31/2011 04:24 PM, Erik Schnetter wrote: > However, I am quite certain (but can't guarantee it) that the other > vector elements of the respective xmm register are unused. That is at > least the calling convention for x86; of course, I don't know whether Yes, that might be true for calls. However, with OpenCL C kernels we want to inline functions aggressively. In that case your asm clobber list has to include the whole xmm register. This means that the code that preceeds the call to the inline asm block has to save the XMM if it uses the other elements before entering your inline asm block. -- Pekka |
From: Erik S. <esc...@pe...> - 2011-10-31 15:37:12
|
The aggressive inlining (without having to program compiler intrinsics or perform header file gymnastics) is one of the most compelling features of OpenCL (and an LLVM implementation). Yes, I'm using the "correct" clobber specifiers, as you suggest. -erik 2011/10/31 Pekka Jääskeläinen <pek...@tu...>: > On 10/31/2011 04:24 PM, Erik Schnetter wrote: >> >> However, I am quite certain (but can't guarantee it) that the other >> vector elements of the respective xmm register are unused. That is at >> least the calling convention for x86; of course, I don't know whether > > Yes, that might be true for calls. However, with OpenCL C kernels we want > to inline functions aggressively. In that case your asm clobber list has > to include the whole xmm register. This means that the code that preceeds > the call to the inline asm block has to save the XMM if it uses the other > elements before entering your inline asm block. > > -- > Pekka > -- Erik Schnetter <esc...@pe...> http://www.cct.lsu.edu/~eschnett/ AIM: eschnett247, Skype: eschnett, Google Talk: sch...@gm... |