|
From: Erik S. <esc...@pe...> - 2012-01-22 15:10:34
|
2012/1/22 Pekka Jääskeläinen <pek...@tu...>: > Hi Erik, > > > On 01/20/2012 07:05 PM, Erik Schnetter wrote: >> >> 3-element vectors can't be accessed directly in memory, because they >> need to be aligned the same was as 4-vectors. One needs to use >> vload/vstore for this, which doesn't require alignment. >> >> I assume that 3-element vectors would be stored packed in global >> memory. I don't know whether this is what async_copy actually >> expects... However, un-packed storage can be handled by passing a >> respective 4-element vector as gentype. > > > I'm not sure if I got this. The current implementation of > the async copy is just a for loop that copies elements of > the arrays. > > The elements in the copy loop are handled as elements of > the actual gentype, e.g. float3. As far as I understood > this means the compiler should take care of the correct alignment of > the vectors in memory because it's stated with the alignment > attribute (to be the same as float4) both in the host side and the > kernel side. > > Did you check from the generated assembly that the copy loop does > not produce a correct alignment when copying buffers with 3-element > vectors? > > "The vload3 and vstore3 built-in functions can be used to read and > write, respectively, 3-component vector data types from an array of > packed scalar data type." > > From cl_platform.h: > > /* cl_float3 is identical in size, alignment and behavior to cl_float4. See > section 6.1.5. */ > typedef cl_float4 cl_float3; > > > So, my understanding of this is that it should work because both the > host and the device code are in the same understanding of the alignment > of the 3-element vectors. I.e. it's stored 4-aligned everywhere if > you handle buffers that contain 3-element vectors and thus copy loop > should be generated to code that adheres to this. > > If the input was packed (basically a float buffer) and one would like > to copy it to a buffer of 3-wide vectors, then vload3 and vstore3 should > be used. > > Please correct me if I understood this wrongly. Yes, this is correct. I was assuming that packed buffers were more common than unpacked ones (otherwise, why would one use float3?), and that async_copy would need to handle this. I now realise that there is no basis for this assumption of mine. Indeed there is a footnote in the standard stating that float3 should be copied in the same way as float4. Your implementation is correct. -erik -- Erik Schnetter <esc...@pe...> http://www.cct.lsu.edu/~eschnett/ AIM: eschnett247, Skype: eschnett, Google Talk: sch...@gm... |