From: Karl R. <ru...@iu...> - 2014-08-14 20:10:35
|
Hi, > The GEMM kernel(s) are getting pretty tricky, with quite a few fallbacks > involved. This gets hard to test, so I thought it could be a good idea > to discuss this. Basically, here is how it works: > > A = [A1 A2; A3 A4] > B = [B1 B2; B3 B4] > C = [C1 C2; C3 C4] > > Where each block is divided according to the corresponding block size of > the template. For example; A1 is the closest multiple of the size tuple > (ML, KL), where ML is the number of rows computed by each work group, > and KL the "width step" for computing the inner products (If the kernel > use local memories, it will load successive blocks of size ML*KL in each > work group). > > A few kernels are enqueued so that: > C1 = A1*B1 [optimized kernel] > C1 += A2*B3 [fallback] if needed > C2 = A1*B2 [fallback] if needed > C2 += A2*B4 [fallback] if needed > etc... > > Basically, one optimized kernel doing the bulk of the work, and the > other ones doing the "clean-up". This works well for full matrices and > ranges. When slices are involved, things get more complicated. If the > stride is on the non-leading dimension (stride2 for column-major > matrices), then it can be incorporated in the optimized kernel. (by > appending ld *= stride2 at the beginning of the kernel). However, if > stride1 > 1, then we need to use the fallback kernel. This is a > reasonable thing to do : in most applications I know of, only one stride > is accessed at the time (we want a set of the rows/columns of a given > matrix). > > However, this becomes really messy to test! Basically, I think that, to > have an exhaustive enough testing suite, then we should go for: > > - Matrices of complicated arbitrary sizes (143, 284, 395). It is > important to space them by more than 128, to be sure that A1, B1 and C1 > is not square. > - Ranges of similar complicated sizes. > - "Optimized" range: (128, 256, 384) for example > - matrix row-wise slices, matrix col-wise slices, matrix slice in both > directions. As far as I can tell, all you need to do is to adjust the matrix sizes in the existing gemm tests? It covers all this already. What am I missing? > I am ready to rewrite the GEMM tests accordingly, but any thought on the > procedure would be appreciated! The GEMM tests are quite an issue already, because they consume a lot of time particularly on weaker systems. A substantial part of the problem is the verification on the CPU with uBLAS, which both adds an uBLAS dependency and is also rather slow. The current test sizes are pretty much the minimum possible, but still they take minutes to complete. Without a proper strategy to deal with this, chances are high that we make our test system almost unmanageable... Any clever approaches appreciated! Best regards, Karli |