From: Karl R. <ru...@iu...> - 2013-07-28 19:49:57
|
Hey, > My preferred option is to pad by default and either to make the > padding a multiple of four or sixteen. However, we need to maintain > a full set of unpadded operations, because user-provided buffers > need not be padded (and a subsequent padding may be too expensive) > > > I think making it a multiple of 16 always is a good option, because we > can reasonably assume that optimal performance are rarely obtained when > a work item performs (unroll) more than 16*16 operations, on most of the > kernels. > However, we have to have a clear and easily extensible dispatch > mechanism that dispatch some sizes to some specific kernel, which is > what I was talking about: > Best {m, k, n} big block sizes for the GEMM kernel: > > GEMM Row-Major * Row-Major > AMD : 16 * 64 * 256 > NVidia : 16 * 128 * 128 > Intel CPU : 64 * 64 * 128. I expect this to be also dependent on the hardware generation. The best approach that comes to my mind is to introduce some hardware descriptor, which provides nicely preprocessed information from the OpenCL backend. A rather simple Vendor: [AMD, INTEL, NVIDIA, ...] Type: [CPU, GPU, MIC, etc.] Generation: [Southern Island, Fermi, Kepler, ... , UNKNOWN] should give us enough dispatch possibilities for the hardware 'out there'. If the detection of the hardware generation fails, we just use some compatibility kernel (and eventually ask the user to submit hardware information when running the tuner). > Of course, it is bound to be device-specific rather than vendor > specific, and once the autotuning procedure works better we might have > block sizes such as 96, 112, etc... Furthermore, for the kernel to be > correct, each size has to be a multiple of the block size (3 > constraints).We can never expect the user to call the kernel on the > proper sizes. Probem, the padding on ViennaCL is static, while this > block size is known at runtime... Should we just write somewhere in the > documentation what the best kernels are? The padding is no longer 'static'. The 'ALIGNMENT' template parameter is now ignored (vector_base no longer holds an ALIGNMENT parameter), so we can introduce a runtime padding without breaking old code. Thus, we can pick a proper padding entirely at runtime, tailored to the underlying device. > Even though the number of possible kernel variations is large > (though finite), there's only a limited set which actually gives > good performance. These are the important kernels to be tested > thoroughly. > > > Yes, but this limited set is device/program - specific, and it is hard > to know (that's why autotuning is for). I don't think anyone could tell > me explicitly which combination of {alignment, ml, kl, nl, ms, ks, ns, > use_lhs_shared, use_rhs_shared, unroll} gives good performance ;) And > even if I choose two values for each parameters, it leads to 2¹⁰ = 1024 > test per layout/transposition combination = 32 768 tests ..... which is > ridiculously high :D > What about integrating the test procedure into the autotuning procedure? > It's not intuitive but I see no better way. Yes, a good autotuning procedure should verify the correctness of the results obtained anyway. There may be compiler or hardware bugs which can lead to fast, but erroneous kernels. A two-stage scheme seems best here: - First, find the fastest kernel (either without checking, or just checking for a particular size). - Second, verify this kernel for a couple of different sizes. If this fails, pick the next kernel, etc. > Sooner or later we will have to go for the runtime option anyway. I > don't see any benefit of being overly pessimistic with 16kB if we > have the true local memory available at runtime. > > > Right, it's not over-complicated to do. The problem is more about > knowing the right optimization profile used at runtime (the local memory > used by the to-be-compiled kernel). Ok, it means that this optimization > profile should not change (since I think we cannot really use global > objects), so that this local memory value is consistent over time. Only > the autotuner will be allowed to play with optimization profiles, then, > which is fine for me. There is no reason to expect that the hardware changes during the execution of a process. Even if a hardware falls off the bus because it overheats, it doesn't come back without rebooting the machine (verified with two SDKs). > After the 1.5.0 release. There's too much other new functionality, > so the release is already over-due. This gives us more time to > design the API properly rather than coming up with some quick-fix. > > > Ok :) However, I need these for my research, so I'll make it work for > OpenCL just after the 1.5.0 release :) It's very easy to add operations to the statement objects, so there's no problem adding more any time after the release. > I'm not sure about what you mean by 'explicit specifications'. Could > you please elaborate? > > > Hmm, something like a set of all the formal restrictions : > > - nested {inner/mat-vec/mat-mat}-products are not allowed > - composite operations are not allowed as LHS or RHS of a matrix-matrix > product node. > - matrix-matrix product kernels can only take the standard GEMM form > ... Shouldn't this be part of the documentation anyway? ;-) Best regards, Karli |