From: Philippe T. <phi...@gm...> - 2014-08-14 19:16:32
|
Also, should we use multiple templates to test the portability of the device-specific code? (testing all the local/global combinations should be enough) 2014-08-14 21:07 GMT+02:00 Philippe Tillet <phi...@gm...>: > Hey, > > The GEMM kernel(s) are getting pretty tricky, with quite a few fallbacks > involved. This gets hard to test, so I thought it could be a good idea to > discuss this. Basically, here is how it works: > > A = [A1 A2; A3 A4] > B = [B1 B2; B3 B4] > C = [C1 C2; C3 C4] > > Where each block is divided according to the corresponding block size of > the template. For example; A1 is the closest multiple of the size tuple > (ML, KL), where ML is the number of rows computed by each work group, and > KL the "width step" for computing the inner products (If the kernel use > local memories, it will load successive blocks of size ML*KL in each work > group). > > A few kernels are enqueued so that: > C1 = A1*B1 [optimized kernel] > C1 += A2*B3 [fallback] if needed > C2 = A1*B2 [fallback] if needed > C2 += A2*B4 [fallback] if needed > etc... > > Basically, one optimized kernel doing the bulk of the work, and the other > ones doing the "clean-up". This works well for full matrices and ranges. > When slices are involved, things get more complicated. If the stride is on > the non-leading dimension (stride2 for column-major matrices), then it can > be incorporated in the optimized kernel. (by appending ld *= stride2 at the > beginning of the kernel). However, if stride1 > 1, then we need to use the > fallback kernel. This is a reasonable thing to do : in most applications I > know of, only one stride is accessed at the time (we want a set of the > rows/columns of a given matrix). > > However, this becomes really messy to test! Basically, I think that, to > have an exhaustive enough testing suite, then we should go for: > > - Matrices of complicated arbitrary sizes (143, 284, 395). It is important > to space them by more than 128, to be sure that A1, B1 and C1 is not square. > - Ranges of similar complicated sizes. > - "Optimized" range: (128, 256, 384) for example > - matrix row-wise slices, matrix col-wise slices, matrix slice in both > directions. > > I am ready to rewrite the GEMM tests accordingly, but any thought on the > procedure would be appreciated! > > Philippe > |