On several GPU Blas3 papers, I read that it was inherently faster when the
3 matrices were row_major. To confirm this, I changed the layout in
blas3bench and observed a 30% to 50% fall of the performance.
My guess is that using a single kernel for optimized layout and converting
before launching the kernel might significantly impact performances in a
positive way :p
Has anybody ever tried this trick?