Easy way to write part of 2D array to buffer

Help
Jeff Adams
2010-04-07
2012-12-21
  • Jeff Adams
    Jeff Adams
    2010-04-07

    I have a really big data set in (CPU) memory, but it's too big for the GPU.  So, I want to loop, calling a kernel with different pieces in order.

    ComputeCommandQueue.Write takes a start offset and length, which works great for my first operation where the data is basically one dimensional.

    However, when I have 2-dimensional data, I don't know if there is an easy way to put part of it into a buffer.  I'm looking for suggestions…

    Here's a concrete example.  Say the largest thing my kernel can process is 1000x1000.  But my input data is 2000x2000.  So I need to call the kernel four times.  The input buffer for the first call needs to contain the upper left corner, (here's a bad drawing of the four corners and array indexes that I need to put into the buffer for each corner):

    ______0-____999  |  ___1000-___1999
    ___2000-___2999  |  ___3000-___3999
    ___4000-___4999  |  ___5000-___5999
    ______etc______  |  ______etc______
    1998000-1998999  |  1999000-1999999


    2000000-2000999  |  2001000-2001999
    2002000-2002999  |  2003000-2003999
    ______etc______  |  ______etc______
    3998000-3998999  |  3999000-3999999

    The first thing that comes to mind is something like this:
    for (int row = 0; row < 1000; row++) {
        IntPtr source = Marshal.UnsafeAddrOfPinnedArrayElement(array2Kx2K, row * 2000);
        queue.Write(buffer1Kx1K, true, row * 1000, 1000, source, null);
    }

    But that's 1000 calls to Write.  Probably not the fastest way I can do it.

     
  • nythrix
    nythrix
    2010-04-07

    1000 Writes sounds bad but if you process your data one chunk at a time you could take advantage of this:
    time -->
    queue1:_Write 1st__Write 2nd____Write 3rd_____Write 4th…
    queue2:_________Process 1st__Process 2nd__Process3d…

    You'll have to use the appropriate events for task chaining. This is the other strong point of OpenCL (data and TASK parallel).

    Are you absolutely sure you cannot use a different chunk type? Four "slices" would be much faster:

    ______0____1999
    ___2000____3999
    _998000__999999
    ===============
    1000000__101999
    …etc
    ===============

    ===============

    ===============

    You can even combine these methods so that the writes and the processing would last about the same. In theory, that's the fastest you can go…