From: Philippe T. <phi...@gm...> - 2014-08-17 11:33:42
|
Hey, 2014-08-17 11:52 GMT+02:00 Karl Rupp <ru...@iu...>: > Hi, > > > > So it seems like most of the features are ready for ViennaCL 1.6. My > >> merge from a few days ago (finally) fully integrated the use of >> device-specific kernels for BLAS1, BLAS2, BLAS3. >> > > hurray! :-) > > > The reduction API is >> still missing, though, but I think that the priority should be to polish >> the code, and to ensure ViennaCL is still stable despite the migration >> to device-specific kernels. Specifically, I think that we should spend >> the next few weeks cleaning the code-base. >> > > Agreed. The full set of nightly test machines should be back to > operational tomorrow, as we can finally move back into our offices in > Vienna. These older systems should give us some more confidence on the > stability. > > > > I can list a few points that >> have caught my eye >> >> - I've rewritten from scratch the GEMM test. It now uses ublas::prod 27 >> times, instead of 2500, and thanks to a few macros the file size is >> substantially smaller (~250 lines vs ~850). The test now completes about >> 15times faster using the single threaded ViennaCL implementation, and in >> the glimpse of an eye (~10seconds) when OpenCL is used. Hurray! >> More importantly, this new version allowed me to spot the bug which was >> responsible for the failure of libviennacl-blas3 in tonight's dash. The >> culprit was blas3_prod-test choosing slice1==slice2, while the slices >> were bugged for C=row-major+sliced... This is frightening because some >> other similar glitches may be hidden here and there in the test suite. >> For example, the matrix_vector test passes, but the libviennacl-blas2 >> test fails. Probably due to some stride issues for row-major matrices. >> Things get more complicated now that the col-major kernels are used for >> the row-major cases. Anyhow, I think that we should somehow ensure that >> there is no such glitch remaining in the test suite, before shipping >> ViennaCL 1.6 (ie, all matrix_slices/ranges use different offsets/strides >> in each direction) >> > > Cool, thanks, that looks indeed a lot more compact now. With regard to > tweaking the tests towards full coverage of row/column strides: This is > something you have to add to the tests, because you know the internals of > the kernels best and can design the tests towards testing corner cases. > What caught my attention was the use of {start_M, start_N, start_K} as well > as {stride_M, stride_N, stride_K}: Wouldn't it be better to use separates > strides for A, B, C in both dimensions, making it six parameters rather > than three? > > > It would be probably better, indeed. > - I really think that we should rewrite the benchmarks for the 1.6 >> release, all the more that it would value the substantial performance >> improvement that this release will bring. I can start writing a >> condensed benchmark including copy, axpy, dot, gemv, gemm. I think it >> would be cool to have sparse,solver,qr also included in that routine. I >> won't have the time to carry out this ; I'm moving to the United States >> in 1 week :-p >> > > I can take care of that, particularly as I'll have to adjust the tests in > the benchmark GUI as well. However, I don't think that we should merge the > sparse and solver routines into the same executable, these are two distinct > fields of application (dense vs. sparse linear algebra). Merging too many > different things into one executable also has some disadvantages if one > piece of functionality does not work on a certain machine for whatever > reason. Cool! Yes, you're right, dense and sparse routines are not used for the same purposes. Which operation should be included in each executable, then? One for dense benchmarks, and one for sparse benchmarks? - I've noticed a some of unsafe/faulty legacy code dating back to when > the layout was made a runtime parameter. > * nmf only implements matrix<T>, but in principle matrix_base<T> should > work (since no custom kernel is called, I believe) > NMF uses a custom kernel and thus only works with OpenCL. A generalization > to matrix_base should be straight-forward, yes. I should be able to do it > for the release. The kernel it uses is: template <typename StringType> void generate_nmf_el_wise_mul_div(StringType & source, std::string const & numeric_string) { source.append("__kernel void el_wise_mul_div( \n"); source.append(" __global "); source.append(numeric_string); source.append(" * matrix1, \n"); source.append(" __global const "); source.append(numeric_string); source.append(" * matrix2, \n"); source.append(" __global const "); source.append(numeric_string); source.append(" * matrix3, \n"); source.append(" unsigned int size) \n"); source.append("{ \n"); source.append(" for (unsigned int i = get_global_id(0); i < size; i += get_global_size(0)) \n"); source.append(" { \n"); source.append(" "); source.append(numeric_string); source.append(" val = matrix1[i] * matrix2[i]; \n"); source.append(" "); source.append(numeric_string); source.append(" divisor = matrix3[i]; \n"); source.append(" matrix1[i] = (divisor > ("); source.append(numeric_string); source.append(")0.00001) ? (val / divisor) : ("); source.append(numeric_string); source.append(")0; \n"); source.append(" } \n"); source.append("} \n"); } So, the layout of the matrix shouldn't matter, indeed. It would be pretty easy to have this kernel generated by the generator, too, as this can be represented by the expression tree : matrix1 = select(matrix3 > 0.00001, element_div(element_prod(matrix1, matrix2), matrix3), cast<T>(0)). However, we're running out of time so I wouldn't port it. But we have to keep in mind that this would be a trivial thing to do. > > > > * There was a faulty row_major(is_row_major<BaseType>::value) in >> matrix_range and matrix_stride. This caused matrix_range<matrix_base<T> >> > to be column-major no matter what. More generally, there are a couple >> of places using the static is_row_major<> or alignment<> traits. I >> thought that it could be a good idea to delete these traits to be sure >> that there can be no such faulty code anywhere else. Am I overlooking >> any side effect? >> > > These should be removed, yes. However, I recall that it wasn't that > straight-forward to remove back when I removed the layout template > parameter from matrix_base, since at a few places it was still needed for > matrix<T,F>. With the FFT code and the eigenvalue routines being merged in > a few days, this should simplify considerably. Yes, I noticed the current FFT code making a heavy use of the legacy layouts. It'll be much better once the new version gets merged! > > > > - We should definitely have a discussion on matrix padding, which is no >> longer required anywhere in ViennaCL, as far as I know. I am in favor of >> making size()==internal_size() by default. That's not the point of the >> e-mail, but we should have a discussion on what we should do with it! >> > > Getting rid of the padding would certainly remove the traps of using > fast_copy() on a matrix. Other than that, I don't think it has a > substantial influence on the code because internal_size() is still needed > for dealing with ranges. > > There may be an influence on certain bandwidth-limited operations, though, > as for example a matrix addition may lead to bank conflicts (or channel > conflicts, whatever...) when accessing GPU RAM for certain matrix sizes. > Before making a decision on the padding issue, we should run some > benchmarks to see whether there is an impact. > > Well, one thing I'm sure of is that we should give the possibility to use no padding if needed (for memory constraints), or (probably even better) to choose the padding size. I completely agree that removing padding will have a harmful influence for ldsize=some_weird_number. However, we certainly don't need to pad both size1 and size2. Padding size2() for row-major matrices, and size1() for column major matrices, will not cause any performance regression. > > > - Finally, there is a performance regression for GEMM with slices, due >> to my fallback being too extreme (one element computed per work-unit). >> I'm on it, so don't worry if you've got like 3GFLOP/s on slices in the >> current blas3 benchmark. >> > > Ok, I assume that you'll fix this and don't worry about it further. :-) > > > > Okay, that's pretty much everything I'm worried about, I think! >> > > There are a couple of more things for the release to be completed. They > are essentially all listed in the issue tracker and have the 1.6.0 > milestone assigned to it, except for the unification of coding style. When > are you available for tackling that together? I'm available after Monday. > I'm available from today to Friday. I'll be unavailable for quite some time afterwards for any significant work. I will still be available for critical work such as fixing correctness issues in the generated code, but overall I'll be busy designing my PhD course/research plans. What I plan to do before leaving: - Fix the GEMM performance regression of the fallback kernel - Refurbish the benchmark code for dense operations - Rewrite the matrix-vector tests Not much more. This is my last week in France, so I want to spend some time with my family. I've also been having a really hard time lately when adding support for vector types, ranges and strides inside the generated kernels, so I feel like taking a short break before my PhD begins... Philippe > Best regards, > Karli > > |