You can subscribe to this list here.
2012 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
(6) |
Aug
(30) |
Sep
(1) |
Oct
(10) |
Nov
(8) |
Dec
(1) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2013 |
Jan
|
Feb
(9) |
Mar
(3) |
Apr
(1) |
May
(2) |
Jun
(2) |
Jul
(73) |
Aug
(145) |
Sep
(32) |
Oct
(45) |
Nov
(4) |
Dec
(76) |
2014 |
Jan
(24) |
Feb
(92) |
Mar
(27) |
Apr
(15) |
May
(57) |
Jun
(49) |
Jul
(105) |
Aug
(125) |
Sep
(7) |
Oct
(19) |
Nov
(70) |
Dec
(4) |
2015 |
Jan
|
Feb
|
Mar
(3) |
Apr
|
May
(8) |
Jun
|
Jul
(40) |
Aug
(29) |
Sep
|
Oct
(8) |
Nov
(1) |
Dec
(7) |
2016 |
Jan
(12) |
Feb
(7) |
Mar
(8) |
Apr
(4) |
May
(20) |
Jun
(4) |
Jul
(38) |
Aug
(44) |
Sep
(11) |
Oct
(10) |
Nov
(13) |
Dec
(4) |
2017 |
Jan
|
Feb
(7) |
Mar
|
Apr
|
May
(1) |
Jun
|
Jul
|
Aug
(2) |
Sep
|
Oct
|
Nov
|
Dec
|
2018 |
Jan
(1) |
Feb
|
Mar
|
Apr
|
May
|
Jun
(4) |
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
From: Karl R. <ru...@iu...> - 2012-08-23 15:09:59
|
Dear ViennaCL users and developers, NVIDIA has just released CUDA 5, but it lacks any reference to OpenCL, including OpenCL samples. This might be a first step to drop OpenCL support at all, which is certainly unacceptable for the GPGPU world. Please consider signing the following petition: "OpenCL samples in CUDA 5 SDK" http://www.ipetitions.com/petition/opencl-examples-in-cuda-5-sdk/ Thanks and best regards, Karli |
From: Philippe T. <phi...@gm...> - 2012-08-23 13:56:19
|
Hello everybody ! Browsing through the internals, I have found : template <typename SCALARTYPE, typename F, unsigned int ALIGNMENT> void fast_copy(SCALARTYPE * cpu_matrix_begin, SCALARTYPE * cpu_matrix_end, matrix<SCALARTYPE, F, ALIGNMENT> & gpu_matrix) Might not have the intended behavior : it reallocs the internal gpu_buffer of gpu_matrix to a buffer of size (cpu_matrix_end-cpu_matrix_begin)*sizeof(SCALARTYPE) , and copies the data. A user might however want to copy just a part of the cpu_matrix to the gpu. Plus, reallocating the matrix without changing its sizes sounds a bit weird. Instead, wouldn't it be more intuitive to call clEnqueueWriteBuffer, and to create an additional constructor : matrix(size1, size2, cpu_matrix_begin) , which would indeed allocate with CL_COPY_HOST_PTR flag. Plus, such a constructor covers a special const-correctness case i've faced : Creating a gpu matrix from cpu data, but not writing them back. In that case, the gpu_matrix is const but it is not possible to call fast_copy on it ! Best regards, Philippe |
From: Philippe T. <phi...@gm...> - 2012-08-20 12:37:37
|
Hello, 2012/8/20 Karl Rupp <ru...@iu...> > Hello, > > > For now, there is absolutely no data locality control, which means I end >> up with relativity slow performances for now (250GFLops on GTX260 + >> Tesla C2050), which is still a bit faster than using only 1GPU. I'll >> now focus on improving data locality... If it is possible to implement >> such a locality control, doing lazy transfer might just be a pretty >> effective approach :) >> > > Considering the additional functionality in the OpenCL 1.2 standard, I > think that this locality control is available (soon). > > Has anybody done (or read in a paper) a comparison of the performances >> of images and buffer ? >> > > There is no general recommendation possible, because the different SDKs > have (you guess it) different behavior. As there is currently no support > for the image type, I would refrain from using images right now. The > matrix-matrix multiplication kernels for buffers are pretty much efficient > right now, so I don't think it makes sense right now to play many hours > with images just to get a 10 percent performance increase. > > > Also, now that I have done a fully multi-threaded program, I have >> started to see some little weaknesses in the backend's design, when it >> comes to multi-threading. >> I think this would be a better idea to set a device argument ( or even a >> queue argument, but that may be a bit overkill.) to all the functions >> that enqueue something. For example : >> >> mutex_.lock(); >> viennacl::ocl::switch_device( device_of_my_thread ); >> prod(A,B,C); >> mutex_.unlock(); >> >> does not really have the expected behavior, as prod(A,B,C) does not >> always immediately returns. Indeed, it returns when the data are >> migrated on the device. (yes, clEnqueueNDRangeKernel does that...). In >> my very specific case, I bypassed this issue by doing something like >> transfer_memory(device_of_my_**thread,A); >> transfer_memory(device_of_my_**thread,B); >> transfer_memory(device_of_my_**thread,C); >> mutex_.lock(); >> viennacl::ocl::switch_device( device_of_my_thread ); >> prod(A,B,C); >> mutex_.unlock(); >> Doing the transfers out of the mutex, the enqueueing returns immediately >> and the other threads do not have to wait forever. This is also a >> possible cause of why I had so much troubles before, but I don't feel >> like git-reverting for double-checking :p >> > > As the memory transfers prior to the computations seem to be reasonable, > the second option is more appealing in my view. > > > > Plus, now the standard guarantees all the API functions are thread-safe, >> except for clSetKernelArg when called on the same kernel object by two >> different thread. But from what I have seen ViennaCL uses different >> kernel objects so there shouldn't be any problem :p >> Maybe adding an argument to almost all the existing function is probably >> a bit harsh, but do you think it is necessary to solve that >> multi-threading issue? >> > > The problem seems to be the convenience layer in the OpenCL backend. I > don't see a simple way to extend the API suitably. A nice-to-have is > something like > > exec_on_device( A = prod(B, C), > device_id); > > but in order to still provide the simple > > A = prod(B, C); > > one would have to call all operations in the destructor of some temporary > proxy object generated by operator=(). However, throwing an exception in a > destructor is a no-go... > > The other option is to implicitly bind the computation to the device the > lvalue resides on. Thus, the computation of A = prod(B, C) would be handled > on the device where A is currently located. However, as OpenCL might move > memory buffers around, I don't think this is the cleanest solution either... > Oh, yes ! I like (love !) the last option :) It should be possible to do one context per device (now that the actual data is stored in RAM, it doesn't matter :p). It should then be possible in viennacl::ocl::enqueue(KernelType const & k) to do something like : assert(all the handles are on the same device) clEnqueueNDRangeKernel( viennacl::ocl::get_queue( handle.device() ).handle().get(), blablabla...) It is rather clean and only requires a few lines of code to be changed in the current codebase. > Best regards, > Karli > > Best regards, Phil |
From: Karl R. <ru...@iu...> - 2012-08-20 11:59:44
|
Hello, > For now, there is absolutely no data locality control, which means I end > up with relativity slow performances for now (250GFLops on GTX260 + > Tesla C2050), which is still a bit faster than using only 1GPU. I'll > now focus on improving data locality... If it is possible to implement > such a locality control, doing lazy transfer might just be a pretty > effective approach :) Considering the additional functionality in the OpenCL 1.2 standard, I think that this locality control is available (soon). > Has anybody done (or read in a paper) a comparison of the performances > of images and buffer ? There is no general recommendation possible, because the different SDKs have (you guess it) different behavior. As there is currently no support for the image type, I would refrain from using images right now. The matrix-matrix multiplication kernels for buffers are pretty much efficient right now, so I don't think it makes sense right now to play many hours with images just to get a 10 percent performance increase. > Also, now that I have done a fully multi-threaded program, I have > started to see some little weaknesses in the backend's design, when it > comes to multi-threading. > I think this would be a better idea to set a device argument ( or even a > queue argument, but that may be a bit overkill.) to all the functions > that enqueue something. For example : > > mutex_.lock(); > viennacl::ocl::switch_device( device_of_my_thread ); > prod(A,B,C); > mutex_.unlock(); > > does not really have the expected behavior, as prod(A,B,C) does not > always immediately returns. Indeed, it returns when the data are > migrated on the device. (yes, clEnqueueNDRangeKernel does that...). In > my very specific case, I bypassed this issue by doing something like > transfer_memory(device_of_my_thread,A); > transfer_memory(device_of_my_thread,B); > transfer_memory(device_of_my_thread,C); > mutex_.lock(); > viennacl::ocl::switch_device( device_of_my_thread ); > prod(A,B,C); > mutex_.unlock(); > Doing the transfers out of the mutex, the enqueueing returns immediately > and the other threads do not have to wait forever. This is also a > possible cause of why I had so much troubles before, but I don't feel > like git-reverting for double-checking :p As the memory transfers prior to the computations seem to be reasonable, the second option is more appealing in my view. > Plus, now the standard guarantees all the API functions are thread-safe, > except for clSetKernelArg when called on the same kernel object by two > different thread. But from what I have seen ViennaCL uses different > kernel objects so there shouldn't be any problem :p > Maybe adding an argument to almost all the existing function is probably > a bit harsh, but do you think it is necessary to solve that > multi-threading issue? The problem seems to be the convenience layer in the OpenCL backend. I don't see a simple way to extend the API suitably. A nice-to-have is something like exec_on_device( A = prod(B, C), device_id); but in order to still provide the simple A = prod(B, C); one would have to call all operations in the destructor of some temporary proxy object generated by operator=(). However, throwing an exception in a destructor is a no-go... The other option is to implicitly bind the computation to the device the lvalue resides on. Thus, the computation of A = prod(B, C) would be handled on the device where A is currently located. However, as OpenCL might move memory buffers around, I don't think this is the cleanest solution either... Best regards, Karli |
From: Philippe T. <phi...@gm...> - 2012-08-18 20:53:41
|
Hello everybody :) After a discussion with karl and many troubles due to memory transfers, we've decided to do ourself the memory transfers. That is, as of now, the scheduler transfers the data from CPU to GPU at the time of enqueueing a task, and the multi_matrix is stored in RAM : Pros : * Makes it possible to use multiple devices from different platforms. For example, 1 NVidia GPU + 1 Integrated GPU from an APU or an Ivy Bridge. Yes, multi-gpu for everyone ! :) * The matrix stored can be very big! (it can fill the RAM!) * With this method, it is now possible to implement linalg operations with images, as it is now possible to distinguish between read-only and read-write according to the constness of the parameters. For some operations (non inplace), read-write is about the same as write-only . * We do the transfers ourselves Cons : * We do the transfers ourselves :) For now, there is absolutely no data locality control, which means I end up with relativity slow performances for now (250GFLops on GTX260 + Tesla C2050), which is still a bit faster than using only 1GPU. I'll now focus on improving data locality... If it is possible to implement such a locality control, doing lazy transfer might just be a pretty effective approach :) Has anybody done (or read in a paper) a comparison of the performances of images and buffer ? Also, now that I have done a fully multi-threaded program, I have started to see some little weaknesses in the backend's design, when it comes to multi-threading. I think this would be a better idea to set a device argument ( or even a queue argument, but that may be a bit overkill.) to all the functions that enqueue something. For example : mutex_.lock(); viennacl::ocl::switch_device( device_of_my_thread ); prod(A,B,C); mutex_.unlock(); does not really have the expected behavior, as prod(A,B,C) does not always immediately returns. Indeed, it returns when the data are migrated on the device. (yes, clEnqueueNDRangeKernel does that...). In my very specific case, I bypassed this issue by doing something like transfer_memory(device_of_my_thread,A); transfer_memory(device_of_my_thread,B); transfer_memory(device_of_my_thread,C); mutex_.lock(); viennacl::ocl::switch_device( device_of_my_thread ); prod(A,B,C); mutex_.unlock(); Doing the transfers out of the mutex, the enqueueing returns immediately and the other threads do not have to wait forever. This is also a possible cause of why I had so much troubles before, but I don't feel like git-reverting for double-checking :p Plus, now the standard guarantees all the API functions are thread-safe, except for clSetKernelArg when called on the same kernel object by two different thread. But from what I have seen ViennaCL uses different kernel objects so there shouldn't be any problem :p Maybe adding an argument to almost all the existing function is probably a bit harsh, but do you think it is necessary to solve that multi-threading issue? Best regards, Philippe |
From: Philippe T. <phi...@gm...> - 2012-08-11 16:11:51
|
Hello everybody There is an interesting news on hgpu.org. I forward it :D It could be very useful for testing multi-gpu stuff or multi-platform stuff ! I'll sure give it a try :D The Node 1 is OpenCL 1.2 compatible ^^ Free GPU computing node at hgpu.org Registered users can now run their OpenCL application at hgpu.org. We provide 1 minute of computer time per each run on two nodes with two AMD and one nVidia graphics processing units, correspondingly. There are no restrictions on the number of starts. The platforms are Node 1 - *GPU device 0*: AMD/ATI Radeon HD 5870 2GB, 850MHz - *GPU device 1*: AMD/ATI Radeon HD 6970 2GB, 880MHz - *CPU*: AMD Phenom II X6 @ 2.8GHz 1055T - *RAM*: 12GB - *HDD*: 2TB, Raid-0 - *OS*: OpenSUSE <http://www.opensuse.org/> 11.4 - *SDK*: AMD APP SDK<http://developer.amd.com/gpu/AMDAPPSDK/Pages/default.aspx> 2.7 Node 2 - *GPU device 0*: AMD/ATI Radeon HD 7970 3GB, 1000MHz - *GPU device 1*: nVidia GeForce GTX 560 Ti 2GB, 822MHz - *CPU*: Intel Core i7-2600 @ 3.4GHz - *RAM*: 16GB - *HDD*: 2TB, Raid-0 - *OS*: OpenSUSE <http://www.opensuse.org/> 11.4 - *SDK*: nVidia CUDA Toolkit<http://developer.nvidia.com/cuda-toolkit-40> 4.2.9, AMD APP SDK<http://developer.amd.com/gpu/AMDAPPSDK/Pages/default.aspx> 2.7 Completed OpenCL project should be uploaded via User dashboard<http://hgpu.org/?page_id=3314> (see instructions and example there), compilation and execution terminal output logs will be provided to the user. The information send to hgpu.org will be treated according to our Privacy Policy |
From: Karl R. <ru...@iu...> - 2012-08-09 21:06:58
|
Dear ViennaCL users, ViennaCL 1.3.1 is now available for download! Among smaller improvements, the noteworthy changes are: - Extended flexibility of submatrix and subvector proxies. - Block-ILU for compressed_matrix is now applied on the GPU during the solver cycle phase. - SVD now supports double precision. - Fixed a problem with matrix-matrix products if the result matrix is not initialized properly (thanks to Laszlo Marak). Thanks to all contributors :-) Best regards, Karl Rupp |
From: <ru...@iu...> - 2012-08-08 15:07:32
|
Dear developers and contributors, I've just pushed the release candidate for version 1.3.1 to the sf.net repository. Please check it out and test the compilation on your system. In particular, a verification of successful compilation on GCC 4.7 is appreciated. All bugs reported and changes suggested on sourceforge are fixed. Meanwhile I'll verify compilation on Windows XP/7. Please report any issues by tomorrow (Tue, Aug 9, 18:00 UTC). Yes, the time window is rather small, but this will provide more time for the stabilization of 1.4.0. Thanks and best regards, Karli |
From: Philippe T. <phi...@gm...> - 2012-08-05 23:51:11
|
2012/8/5 Karl Rupp <ru...@iu...> > Hi, > > > > I have (finally !) been able to make the multi-gpu thing work in >> parallel on all devices. I achieve for now 313GFlops on GTX470 + Tesla >> C2050 on a 9600*9600 matrix (each GPU perform ~200GFlops individually). >> > > Very nice! :-) > > > > The OpenCL implementations usually don't like when the same handle has >> to be dealt with in two different queues. In my case, >> clEnqueueNDRangeKernel didn't return until the buffer became available >> for transfer (which is weird, as this function is somewhat guaranteed to >> return immediatly !). Therefore, it was necessary to call the copy >> constructor of the matrices which had to be used in two different kernels. >> > > Does the problem only show up with two queues on two different devices, or > is the same true with two queues on one device? I don't really know, i have not been able to test two queues on one device :p Why? > > > > Even doing that, both the original and the copy matrix live on the same >> device : the one associated with the queue used in clEnqueueCopyBuffer . >> Therefore, once all the copies have been done, it is necessary to >> migrate all the handles to the appropriate device. For now, NVidia is >> not OpenCL 1.2 compatible, so it's not possible to use the new API call >> clEnqueueMigrateMemObjects... For now, it is necessary to use a "trick" : >> >> float* tmp = new float[size] >> clEnqueueReadBuffer //read into tmp >> delete[] tmp; >> >> This transfers the handles to the host, which is enough as long as the >> host does not have to execute a kernel (in that case everything ends up >> being serialized too :( ) >> > > Since OpenCL 1.2 'solves' the problem and we have a workaround for OpenCL > 1.1, let's just accept the current situation. It might even be the case > that the clEnqueueCopyBuffer() internally does the same thing: Copy data > between GPUs via CPU RAM. > > One more remark on the 'workaround code': > You better want to use a std::vector<> in order to avoid issues with > deallocation. The raw pointer can be extracted via &(tmp[0]) > > > > For now, the tasks are not arranged so as to minimize the number of copy >> performed, it should increase performance. >> > > The typical assumption in the benchmarks reported by various institutions > is that the data is already set up on the device accordingly. 313 GFLOPs > (general note: this is counting fmad as one operation, so the marketing > value would be 626 GFLOPs) are pretty good already. Yay marketing value! :D. Actually, this was kind of just a warm up. I realized my implementation was not really doing any dynamic scheduling and was only good when the two devices had the same computing power... I'll have to give my threading model another look ! I kind of refuse to fallback on static scheduling, it really does not seem flexible enough to me (plus AMD advises to use a dynamic scheduler !). Some GFlops still can be gained I think :p should be possible to reach something like 350 I think, and a better performance overall when devices are not equally powerful (which might be the case, even if the two devices are the same, as one of them might already be doing something). I also want to make the scheduler out-of-order and solving the dependancies alone, considering the equality of the input/output handles... kind of like a Parallel Compiler... ;) I definitely don't feel like writing the tree for every existing operation. There was at HPC2012 a talk of Marc Baboulin on using concurrently GPU and CPU for several tasks. I also think about associating a cl_device_type with each task. Maybe, then, in the future, it would be possible to take more benefits on heterogeneous computing with the AMD Platform... Also, modern GPUs are increasingly able to enqueue multiple kernels at once (HD 7950, Last NVidia generation), so in my design I have to take in account the fact that I might even want to enqueues multiple kernels before flushing in the future (for now, we do not have the hardware to test it anyway :D) Still a lot of things to do, but the first is probably to debug the scheduler to make it more dynamic, I'm going in holiday this week so it will be hard for me to work on that =) > > I fear I have also caused memory leaks on the device, as >> NVidiaVisualProfiler won't work... i'll have to debug this :p >> > > No, not necessarily. I've encountered similar problems with the > installation on the testing machine, yet I could reproduce the issue with a > rather simple clean OpenCL code. If I remember correctly, the issue was > related to some static handles and their point of destruction by the > compiler. So, it's not necessarily your fault ;-) > Hmmm, maybe :p I'll try to give it a deeper look once everything works better. > > Best regards, > Karli > > > Best regards, Philippe |
From: Karl R. <ru...@iu...> - 2012-08-05 16:51:48
|
Hi, > I have (finally !) been able to make the multi-gpu thing work in > parallel on all devices. I achieve for now 313GFlops on GTX470 + Tesla > C2050 on a 9600*9600 matrix (each GPU perform ~200GFlops individually). Very nice! :-) > The OpenCL implementations usually don't like when the same handle has > to be dealt with in two different queues. In my case, > clEnqueueNDRangeKernel didn't return until the buffer became available > for transfer (which is weird, as this function is somewhat guaranteed to > return immediatly !). Therefore, it was necessary to call the copy > constructor of the matrices which had to be used in two different kernels. Does the problem only show up with two queues on two different devices, or is the same true with two queues on one device? > Even doing that, both the original and the copy matrix live on the same > device : the one associated with the queue used in clEnqueueCopyBuffer . > Therefore, once all the copies have been done, it is necessary to > migrate all the handles to the appropriate device. For now, NVidia is > not OpenCL 1.2 compatible, so it's not possible to use the new API call > clEnqueueMigrateMemObjects... For now, it is necessary to use a "trick" : > > float* tmp = new float[size] > clEnqueueReadBuffer //read into tmp > delete[] tmp; > > This transfers the handles to the host, which is enough as long as the > host does not have to execute a kernel (in that case everything ends up > being serialized too :( ) Since OpenCL 1.2 'solves' the problem and we have a workaround for OpenCL 1.1, let's just accept the current situation. It might even be the case that the clEnqueueCopyBuffer() internally does the same thing: Copy data between GPUs via CPU RAM. One more remark on the 'workaround code': You better want to use a std::vector<> in order to avoid issues with deallocation. The raw pointer can be extracted via &(tmp[0]) > For now, the tasks are not arranged so as to minimize the number of copy > performed, it should increase performance. The typical assumption in the benchmarks reported by various institutions is that the data is already set up on the device accordingly. 313 GFLOPs (general note: this is counting fmad as one operation, so the marketing value would be 626 GFLOPs) are pretty good already. > I fear I have also caused memory leaks on the device, as > NVidiaVisualProfiler won't work... i'll have to debug this :p No, not necessarily. I've encountered similar problems with the installation on the testing machine, yet I could reproduce the issue with a rather simple clean OpenCL code. If I remember correctly, the issue was related to some static handles and their point of destruction by the compiler. So, it's not necessarily your fault ;-) Best regards, Karli |
From: Philippe T. <phi...@gm...> - 2012-08-02 14:10:04
|
Hello everybody ! I have (finally !) been able to make the multi-gpu thing work in parallel on all devices. I achieve for now 313GFlops on GTX470 + Tesla C2050 on a 9600*9600 matrix (each GPU perform ~200GFlops individually). Here are the main concurrency troubles I have faced, so that everybody knows about them :) The OpenCL implementations usually don't like when the same handle has to be dealt with in two different queues. In my case, clEnqueueNDRangeKernel didn't return until the buffer became available for transfer (which is weird, as this function is somewhat guaranteed to return immediatly !). Therefore, it was necessary to call the copy constructor of the matrices which had to be used in two different kernels. Even doing that, both the original and the copy matrix live on the same device : the one associated with the queue used in clEnqueueCopyBuffer . Therefore, once all the copies have been done, it is necessary to migrate all the handles to the appropriate device. For now, NVidia is not OpenCL 1.2 compatible, so it's not possible to use the new API call clEnqueueMigrateMemObjects... For now, it is necessary to use a "trick" : float* tmp = new float[size] clEnqueueReadBuffer //read into tmp delete[] tmp; This transfers the handles to the host, which is enough as long as the host does not have to execute a kernel (in that case everything ends up being serialized too :( ) The drawback is that, when the handle is already on the host, a useless copy is performed. For now, the tasks are not arranged so as to minimize the number of copy performed, it should increase performance. I fear I have also caused memory leaks on the device, as NVidiaVisualProfiler won't work... i'll have to debug this :p Best regards, Philippe |
From: Karl R. <ru...@iu...> - 2012-08-02 12:48:06
|
Hi, > I think that's a good way to go, except that I think the SSE and OpenMP > BLAS implementations shouldn't be separate. I'm a little bit > intimidated because all the OpenCL code would have to be translated to > CPU code in order for a CPU backend to have full functionality. This > might not be done anytime soon. That's a point, keeping SSE and OpenMP together makes sense. Not all OpenCL kernels need to be translated to SSE/OpenMP right away. Most of the operations can be handled with simple loops, possibly even in a generative way (e.g. templates). It's sufficient if you focus on the 'interesting' kernels, I can add the simpler kernels/operatins as well. > Also, do you think my sse blas and tred2 will be included in the next > release? When is the next release? The next release is expected to be next week, version 1.3.1. This is going to be a bugfix release and further stabilizes some of the new experimental features. I hope to include your SSE contributions in 1.4.0, which will also include the developments from the Google Summer of Code (generalized eigenvalue problems) and is expected to be in the second half of September. This is, however, not set in stone - as university courses usually start at this time we better bring all summer developments to a stable state. :-) Best regards, Karli > PS: reply cc'ed to viennacl-devel :-) > > On Wed, Aug 1, 2012 at 3:38 AM, Karl Rupp <ru...@iu... > <mailto:ru...@iu...>> wrote: > > Hi Alex, > > I've spent some more thoughts on how to separate the linear algebra > backends suitably. Currently, some OpenCL statements are mixed into > the vector<> and matrix<> classes, while the operations are clearly > separated via calls to externally defined functions (e.g. > prod_impl()), cf. vector_operations.hpp and matrix_operations.hpp. > > To simplify your development efforts I could continue this > separation and also move initialization routines to separate header > files. In the best case, all that is necessary for a CPU-only > fallback is to have e.g. in vector.hpp something like > > #ifdef VIENNACL_NO_OPENCL > #include "viennacl/linalg/vector-__operations-cpu.hpp" > #else > #include "viennacl/linalg/vector-__operations-opencl.hpp" > #endif > > Going one step further, we could even separate the convenience types > from the BLAS backend and support something like > > #if defined VIENNACL_USE_SSE_BLAS > #include "viennacl/linalg/vector-__operations-sse.hpp" > #elif defined VIENNACL_USE_OPENCL_BLAS > #include "viennacl/linalg/vector-__operations-opencl.hpp" > #elif defined VIENNACL_USE_OPENMP_BLAS > #include "viennacl/linalg/vector-__operations-openmp.hpp" > ... > #else > #include "viennacl/linalg/vector-__operations-fallback.hpp" > #endif > > We probably won't have the development resources for supporting a > whole zoo of different backends, yet I like the idea of a clean > separation. What do you think? > > Best regards, > Karli > > PS: cc'ed to viennacl-devel > > > > > On 07/29/2012 06:49 AM, Alex Christensen wrote: > > I made tred2 not copy memory, and it works with ublas matrices. > My goal > is to make a backend so that defining VIENNACL_NO_OPENCL makes > existing > code work without a gpu (or even linking to an OpenCL library). > I'll > let you know if I run into any problems. Hopefully the existing > QR code > will work with that. > > Since the LU routines don't do partial pivoting, should I > include my cpu > LU function with partial pivoting? Should I include my cholesky > function also, maybe as a separate header? The only cholesky > function I > have found in ViennaCL is in spai. > > Alex > > > |
From: Philippe T. <phi...@gm...> - 2012-08-02 02:27:11
|
Hi again, I have been able to observe some very funny behavior. I am using one host thread per queue, and, depending on the race condition (ie which devices takes which handle), the operation is longer or shorter to execute ! The good news is that, in some cases, I have got 0.96seconds for a 6400*6400 float matrix, which is about 6.4^3/0.96 = 273GFlop, which is slightly more than what a single GPU can do. I have just done an API for creating copies upon dependancies, but for now it is just not working... because I have still not taken care of data locality ! Going for one context per device still seems to be a bad idea, as no function of the OpenCL API will work. For example, the copy constructor of matrix() will have an undefined behavior in the cases where the two matrices are on another device, because clEnqueueCopyBuffer assumes a common context ! and i'm sure there are a lot of similar cases, unfortunately. Plus, the OpenCL Standards tend to encourage this use of multiple devices, with the introduction as new functions such as clEnqueueMigrateMemObjects (or whatever the name is :D). Anyway, i hope i'll soon be able to get out of this *&é'(_èç ! Good night everybody :) 2012/8/1 Karl Rupp <ru...@iu...> > Hi Philippe, > > thanks for the investigations. > > > kernel 1, device 1 : C(0,0) = A(0,0) * B(0,0) > > kernel 2, device 2 : C(0,1) = A(0,0) * B(0,1), > > Both the AMD and the NVidia SDK are unable to multicast A(0,0) from the > > host to the two GPUs. Even if the two kernels are enqued in parallel, > > the execution is serialized, because the 2nd device has to wait for > > A(0,0) to be available. This is exactly the behavior I feared. > > It does not happen with a simple matrix addition, where all the handles > > are independant. > > Okay, I see, so the const-qualifiers for the kernel handles are ignored > (or not abused for a more efficient implementation). Thus, it seems like > we have to use separate memory handles in such case and that we better > attach some meta-information ('current device') to each memory handle. > > > > I'm desesperately looking for a low-memory handle multicasting. I might > > give the Khronos forum a try, even though enqueuing the same handle on > > different queues is left implementation-defined by the standards! > > Oh dear, 'implementation-defined' is nothing I want to see at this point > :-( Seems like we should perhaps reconsider using one context per device > and benchmark memory transfers for the two options (i.e. one context for > all devices vs. one context per device). > > > But well, the good news is that the kernels are executing! > > Yep, some good news :-) > > Best regards, > Karli > > > > > > > > 2012/7/31 Karl Rupp <ru...@iu... <mailto:ru...@iu... > >> > > > > Hello again, > > > > I've justed pushed the following changes to the > sourceforge-repository: > > * operator+= and operator-= no longer create temporaries > > * A = prod(B,C) does not fail if there is garbage in A > > > > Best regards, > > Karli > > > > > > > > On 07/29/2012 03:59 PM, Philippe Tillet wrote: > > > > Hello everybody ! > > > > I'll inaugurate this mailing list with a little question. > > I have not seen any kernel for computing the operation A += > > prod(B,C) . > > Does this mean that this operation is done doing : > > > > tmp = prod(B,C) > > a+=tmp > > > > ? > > > > For computing the multi_matrix ( project i'm working on, matrix > > composed > > of multiple handles, to solve the CL_MAX_ALLOCABLE_MEMORY and > > the multi > > devices issue), I need to do several updates of this kind, in a > > block > > layout. For a 2*2 block layout : > > > > C(0,0).clear(); > > => > > C(0,0) += prod( A(0,0), B(0,0) ) > > => > > C(0,0) += prod( A(0,1), B(1,0) ) > > > > C(0,1).clear(); > > => > > C(0,1) += prod( A(0,0), B(0,1) ) > > => > > C(0,1) += prod( A(0,1), B(1,1) ) > > > > ... > > ... > > > > This "sort-of-rank-1-update approach" is a special case of the > > SUMMA > > Algorithm (OpenCL doing the memory transfers in the back > ground, > > for now > > at least) and seems to be efficient from a memory point of > view. > > Using > > another approach would lead to both a huge memory consumption > and > > significant memory transfers... > > > > Is there any way of doing so in ViennaCL ? > > > > Best regards ! > > Phil > > > > > > > > > ------------------------------__------------------------------__------------------ > > Live Security Virtual Conference > > Exclusive live event will cover all the ways today's security > and > > threat landscape has changed and how IT managers can respond. > > Discussions > > will include endpoint security, mobile security and the latest > > in malware > > threats. > > http://www.accelacomm.com/jaw/__sfrnl04242012/114/50122263/ > > <http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/> > > > > > > > > _________________________________________________ > > ViennaCL-devel mailing list > > ViennaCL-devel@lists.__sourceforge.net > > <mailto:Vie...@li...> > > https://lists.sourceforge.net/__lists/listinfo/viennacl-devel > > <https://lists.sourceforge.net/lists/listinfo/viennacl-devel> > > > > > > > > > > > > > > > ------------------------------------------------------------------------------ > > Live Security Virtual Conference > > Exclusive live event will cover all the ways today's security and > > threat landscape has changed and how IT managers can respond. Discussions > > will include endpoint security, mobile security and the latest in malware > > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > > _______________________________________________ > > ViennaCL-devel mailing list > > Vie...@li... > > https://lists.sourceforge.net/lists/listinfo/viennacl-devel > > > > > > ------------------------------------------------------------------------------ > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. Discussions > will include endpoint security, mobile security and the latest in malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > _______________________________________________ > ViennaCL-devel mailing list > Vie...@li... > https://lists.sourceforge.net/lists/listinfo/viennacl-devel > |
From: Alex C. <ach...@gm...> - 2012-08-02 01:13:14
|
I think that's a good way to go, except that I think the SSE and OpenMP BLAS implementations shouldn't be separate. I'm a little bit intimidated because all the OpenCL code would have to be translated to CPU code in order for a CPU backend to have full functionality. This might not be done anytime soon. Also, do you think my sse blas and tred2 will be included in the next release? When is the next release? Alex PS: reply cc'ed to viennacl-devel On Wed, Aug 1, 2012 at 3:38 AM, Karl Rupp <ru...@iu...> wrote: > Hi Alex, > > I've spent some more thoughts on how to separate the linear algebra > backends suitably. Currently, some OpenCL statements are mixed into the > vector<> and matrix<> classes, while the operations are clearly separated > via calls to externally defined functions (e.g. prod_impl()), cf. > vector_operations.hpp and matrix_operations.hpp. > > To simplify your development efforts I could continue this separation and > also move initialization routines to separate header files. In the best > case, all that is necessary for a CPU-only fallback is to have e.g. in > vector.hpp something like > > #ifdef VIENNACL_NO_OPENCL > #include "viennacl/linalg/vector-**operations-cpu.hpp" > #else > #include "viennacl/linalg/vector-**operations-opencl.hpp" > #endif > > Going one step further, we could even separate the convenience types from > the BLAS backend and support something like > > #if defined VIENNACL_USE_SSE_BLAS > #include "viennacl/linalg/vector-**operations-sse.hpp" > #elif defined VIENNACL_USE_OPENCL_BLAS > #include "viennacl/linalg/vector-**operations-opencl.hpp" > #elif defined VIENNACL_USE_OPENMP_BLAS > #include "viennacl/linalg/vector-**operations-openmp.hpp" > ... > #else > #include "viennacl/linalg/vector-**operations-fallback.hpp" > #endif > > We probably won't have the development resources for supporting a whole > zoo of different backends, yet I like the idea of a clean separation. What > do you think? > > Best regards, > Karli > > PS: cc'ed to viennacl-devel > > > > > On 07/29/2012 06:49 AM, Alex Christensen wrote: > >> I made tred2 not copy memory, and it works with ublas matrices. My goal >> is to make a backend so that defining VIENNACL_NO_OPENCL makes existing >> code work without a gpu (or even linking to an OpenCL library). I'll >> let you know if I run into any problems. Hopefully the existing QR code >> will work with that. >> >> Since the LU routines don't do partial pivoting, should I include my cpu >> LU function with partial pivoting? Should I include my cholesky >> function also, maybe as a separate header? The only cholesky function I >> have found in ViennaCL is in spai. >> >> Alex >> >> > |
From: Karl R. <ru...@iu...> - 2012-08-01 10:55:50
|
Hi, my first thought was that OpenCL on the CPU would be *the* cool thing. However, running some benchmarks soon showed that OpenCL won't be the best choice for everything. The most annoying thing is the kernel launch overhead, which is even on the CPU in the range of 10 us. Even for a moderate 1 GHz CPU, this translates to 10k CPU cycles 'wasted'. Thus, for 'small' operations OpenCL won't give any good performance... :-( I agree that the current OpenCL backend requires extensions in order to handle device-specific implementations. As a (recent) prominent example, the super-fast matrix-matrix multiplication kernels for NVIDIA GPUs are not optimal for AMD GPUs and even for some older NVIDIA GPUs. The OpenCL backend, however, is essentially independent of the user types, so the ideal design in my view is the following: Layer 1: User API types (viennacl::vector<>, etc.) Layer 2: BLAS calling code (OpenCL, SSE, etc. Maybe hybrid?) Layer 3: BLAS backend details (OpenCL kernel management, etc.) The VIENNACL_USE_XYZ_BLAS defines would basically select the Layer 2 implementation to be used, while the better OpenCL kernel management with possibly several tuned kernels resides entirely in Layer 3. In a first step I think it's better to select Layer 2 statically via preprocessor defines. In a second step we might even use hybrid approaches by using SSE for small operations on OpenCL handles on APUs, thus circumventing the ugly kernel launch overheads in such cases. (Just as a remark: The scientific community is right now *not* interested in APUs due to the lack of double precision support. Due to thermal limitations I don't think an APU will beat a standalone GPU anytime soon, even though the latency problems introduced by PCI-Express are obvious.) In other words: Since there does not seem to be a single 'best programming approach', let's combine the best of different approaches. :-) Best regards, Karli On 08/01/2012 12:33 PM, Philippe Tillet wrote: > Hi everybody ! > > My personal opinion about that is that the capabilities on OpenCL on the > CPU should not be overlooked. Intel is also putting a lot of efforts > into getting strong OpenCL tools! > I wonder if CPU-Optimized kernels would not be able to beat an SSE > implementation. > Plus, in my opinion, we are tending to a future where everybody will > have an OpenCL-capable GPU,(as of today, the Intel HD Graphics of the > Ivy Bridge is OpenCL-capable on windows, and AMD APUs also have 2 > devices recognized on both Windows and Linux). > > Therefore, wouldn't it be better to focus on optimizing some kernels for > the CPUs, and letting the implementation redirect the computation to the > proper kernel? > Plus, it would be a chance to write an API for dealing with > platform-specific kernels, and therefore give us the possibility in the > future to optimize some kernels for either AMD CPU, AMD GPU, Intel, > NVidia GPU...! > > Best regards, > Philippe > > > 2012/8/1 Karl Rupp <ru...@iu... <mailto:ru...@iu...>> > > Hi Alex, > > I've spent some more thoughts on how to separate the linear algebra > backends suitably. Currently, some OpenCL statements are mixed into the > vector<> and matrix<> classes, while the operations are clearly > separated via calls to externally defined functions (e.g. prod_impl()), > cf. vector_operations.hpp and matrix_operations.hpp. > > To simplify your development efforts I could continue this separation > and also move initialization routines to separate header files. In the > best case, all that is necessary for a CPU-only fallback is to have e.g. > in vector.hpp something like > > #ifdef VIENNACL_NO_OPENCL > #include "viennacl/linalg/vector-operations-cpu.hpp" > #else > #include "viennacl/linalg/vector-operations-opencl.hpp" > #endif > > Going one step further, we could even separate the convenience types > from the BLAS backend and support something like > > #if defined VIENNACL_USE_SSE_BLAS > #include "viennacl/linalg/vector-operations-sse.hpp" > #elif defined VIENNACL_USE_OPENCL_BLAS > #include "viennacl/linalg/vector-operations-opencl.hpp" > #elif defined VIENNACL_USE_OPENMP_BLAS > #include "viennacl/linalg/vector-operations-openmp.hpp" > ... > #else > #include "viennacl/linalg/vector-operations-fallback.hpp" > #endif > > We probably won't have the development resources for supporting a whole > zoo of different backends, yet I like the idea of a clean separation. > What do you think? > > Best regards, > Karli > > PS: cc'ed to viennacl-devel > > > > > On 07/29/2012 06:49 AM, Alex Christensen wrote: > > I made tred2 not copy memory, and it works with ublas matrices. > My goal > > is to make a backend so that defining VIENNACL_NO_OPENCL makes > existing > > code work without a gpu (or even linking to an OpenCL library). I'll > > let you know if I run into any problems. Hopefully the existing > QR code > > will work with that. > > > > Since the LU routines don't do partial pivoting, should I include > my cpu > > LU function with partial pivoting? Should I include my cholesky > > function also, maybe as a separate header? The only cholesky > function I > > have found in ViennaCL is in spai. > > > > Alex > > > > > ------------------------------------------------------------------------------ > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. > Discussions > will include endpoint security, mobile security and the latest in > malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > _______________________________________________ > ViennaCL-devel mailing list > Vie...@li... > <mailto:Vie...@li...> > https://lists.sourceforge.net/lists/listinfo/viennacl-devel > > > > > ------------------------------------------------------------------------------ > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. Discussions > will include endpoint security, mobile security and the latest in malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > > > > _______________________________________________ > ViennaCL-devel mailing list > Vie...@li... > https://lists.sourceforge.net/lists/listinfo/viennacl-devel > |
From: Philippe T. <phi...@gm...> - 2012-08-01 10:33:36
|
Hi everybody ! My personal opinion about that is that the capabilities on OpenCL on the CPU should not be overlooked. Intel is also putting a lot of efforts into getting strong OpenCL tools! I wonder if CPU-Optimized kernels would not be able to beat an SSE implementation. Plus, in my opinion, we are tending to a future where everybody will have an OpenCL-capable GPU,(as of today, the Intel HD Graphics of the Ivy Bridge is OpenCL-capable on windows, and AMD APUs also have 2 devices recognized on both Windows and Linux). Therefore, wouldn't it be better to focus on optimizing some kernels for the CPUs, and letting the implementation redirect the computation to the proper kernel? Plus, it would be a chance to write an API for dealing with platform-specific kernels, and therefore give us the possibility in the future to optimize some kernels for either AMD CPU, AMD GPU, Intel, NVidia GPU...! Best regards, Philippe 2012/8/1 Karl Rupp <ru...@iu...> > Hi Alex, > > I've spent some more thoughts on how to separate the linear algebra > backends suitably. Currently, some OpenCL statements are mixed into the > vector<> and matrix<> classes, while the operations are clearly > separated via calls to externally defined functions (e.g. prod_impl()), > cf. vector_operations.hpp and matrix_operations.hpp. > > To simplify your development efforts I could continue this separation > and also move initialization routines to separate header files. In the > best case, all that is necessary for a CPU-only fallback is to have e.g. > in vector.hpp something like > > #ifdef VIENNACL_NO_OPENCL > #include "viennacl/linalg/vector-operations-cpu.hpp" > #else > #include "viennacl/linalg/vector-operations-opencl.hpp" > #endif > > Going one step further, we could even separate the convenience types > from the BLAS backend and support something like > > #if defined VIENNACL_USE_SSE_BLAS > #include "viennacl/linalg/vector-operations-sse.hpp" > #elif defined VIENNACL_USE_OPENCL_BLAS > #include "viennacl/linalg/vector-operations-opencl.hpp" > #elif defined VIENNACL_USE_OPENMP_BLAS > #include "viennacl/linalg/vector-operations-openmp.hpp" > ... > #else > #include "viennacl/linalg/vector-operations-fallback.hpp" > #endif > > We probably won't have the development resources for supporting a whole > zoo of different backends, yet I like the idea of a clean separation. > What do you think? > > Best regards, > Karli > > PS: cc'ed to viennacl-devel > > > > > On 07/29/2012 06:49 AM, Alex Christensen wrote: > > I made tred2 not copy memory, and it works with ublas matrices. My goal > > is to make a backend so that defining VIENNACL_NO_OPENCL makes existing > > code work without a gpu (or even linking to an OpenCL library). I'll > > let you know if I run into any problems. Hopefully the existing QR code > > will work with that. > > > > Since the LU routines don't do partial pivoting, should I include my cpu > > LU function with partial pivoting? Should I include my cholesky > > function also, maybe as a separate header? The only cholesky function I > > have found in ViennaCL is in spai. > > > > Alex > > > > > > ------------------------------------------------------------------------------ > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. Discussions > will include endpoint security, mobile security and the latest in malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > _______________________________________________ > ViennaCL-devel mailing list > Vie...@li... > https://lists.sourceforge.net/lists/listinfo/viennacl-devel > |
From: Karl R. <ru...@iu...> - 2012-08-01 09:38:53
|
Hi Alex, I've spent some more thoughts on how to separate the linear algebra backends suitably. Currently, some OpenCL statements are mixed into the vector<> and matrix<> classes, while the operations are clearly separated via calls to externally defined functions (e.g. prod_impl()), cf. vector_operations.hpp and matrix_operations.hpp. To simplify your development efforts I could continue this separation and also move initialization routines to separate header files. In the best case, all that is necessary for a CPU-only fallback is to have e.g. in vector.hpp something like #ifdef VIENNACL_NO_OPENCL #include "viennacl/linalg/vector-operations-cpu.hpp" #else #include "viennacl/linalg/vector-operations-opencl.hpp" #endif Going one step further, we could even separate the convenience types from the BLAS backend and support something like #if defined VIENNACL_USE_SSE_BLAS #include "viennacl/linalg/vector-operations-sse.hpp" #elif defined VIENNACL_USE_OPENCL_BLAS #include "viennacl/linalg/vector-operations-opencl.hpp" #elif defined VIENNACL_USE_OPENMP_BLAS #include "viennacl/linalg/vector-operations-openmp.hpp" ... #else #include "viennacl/linalg/vector-operations-fallback.hpp" #endif We probably won't have the development resources for supporting a whole zoo of different backends, yet I like the idea of a clean separation. What do you think? Best regards, Karli PS: cc'ed to viennacl-devel On 07/29/2012 06:49 AM, Alex Christensen wrote: > I made tred2 not copy memory, and it works with ublas matrices. My goal > is to make a backend so that defining VIENNACL_NO_OPENCL makes existing > code work without a gpu (or even linking to an OpenCL library). I'll > let you know if I run into any problems. Hopefully the existing QR code > will work with that. > > Since the LU routines don't do partial pivoting, should I include my cpu > LU function with partial pivoting? Should I include my cholesky > function also, maybe as a separate header? The only cholesky function I > have found in ViennaCL is in spai. > > Alex > |
From: Karl R. <ru...@iu...> - 2012-08-01 08:39:22
|
Hi Philippe, thanks for the investigations. > kernel 1, device 1 : C(0,0) = A(0,0) * B(0,0) > kernel 2, device 2 : C(0,1) = A(0,0) * B(0,1), > Both the AMD and the NVidia SDK are unable to multicast A(0,0) from the > host to the two GPUs. Even if the two kernels are enqued in parallel, > the execution is serialized, because the 2nd device has to wait for > A(0,0) to be available. This is exactly the behavior I feared. > It does not happen with a simple matrix addition, where all the handles > are independant. Okay, I see, so the const-qualifiers for the kernel handles are ignored (or not abused for a more efficient implementation). Thus, it seems like we have to use separate memory handles in such case and that we better attach some meta-information ('current device') to each memory handle. > I'm desesperately looking for a low-memory handle multicasting. I might > give the Khronos forum a try, even though enqueuing the same handle on > different queues is left implementation-defined by the standards! Oh dear, 'implementation-defined' is nothing I want to see at this point :-( Seems like we should perhaps reconsider using one context per device and benchmark memory transfers for the two options (i.e. one context for all devices vs. one context per device). > But well, the good news is that the kernels are executing! Yep, some good news :-) Best regards, Karli > > 2012/7/31 Karl Rupp <ru...@iu... <mailto:ru...@iu...>> > > Hello again, > > I've justed pushed the following changes to the sourceforge-repository: > * operator+= and operator-= no longer create temporaries > * A = prod(B,C) does not fail if there is garbage in A > > Best regards, > Karli > > > > On 07/29/2012 03:59 PM, Philippe Tillet wrote: > > Hello everybody ! > > I'll inaugurate this mailing list with a little question. > I have not seen any kernel for computing the operation A += > prod(B,C) . > Does this mean that this operation is done doing : > > tmp = prod(B,C) > a+=tmp > > ? > > For computing the multi_matrix ( project i'm working on, matrix > composed > of multiple handles, to solve the CL_MAX_ALLOCABLE_MEMORY and > the multi > devices issue), I need to do several updates of this kind, in a > block > layout. For a 2*2 block layout : > > C(0,0).clear(); > => > C(0,0) += prod( A(0,0), B(0,0) ) > => > C(0,0) += prod( A(0,1), B(1,0) ) > > C(0,1).clear(); > => > C(0,1) += prod( A(0,0), B(0,1) ) > => > C(0,1) += prod( A(0,1), B(1,1) ) > > ... > ... > > This "sort-of-rank-1-update approach" is a special case of the > SUMMA > Algorithm (OpenCL doing the memory transfers in the back ground, > for now > at least) and seems to be efficient from a memory point of view. > Using > another approach would lead to both a huge memory consumption and > significant memory transfers... > > Is there any way of doing so in ViennaCL ? > > Best regards ! > Phil > > > > ------------------------------__------------------------------__------------------ > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. > Discussions > will include endpoint security, mobile security and the latest > in malware > threats. > http://www.accelacomm.com/jaw/__sfrnl04242012/114/50122263/ > <http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/> > > > > _________________________________________________ > ViennaCL-devel mailing list > ViennaCL-devel@lists.__sourceforge.net > <mailto:Vie...@li...> > https://lists.sourceforge.net/__lists/listinfo/viennacl-devel > <https://lists.sourceforge.net/lists/listinfo/viennacl-devel> > > > > > > > ------------------------------------------------------------------------------ > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. Discussions > will include endpoint security, mobile security and the latest in malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > _______________________________________________ > ViennaCL-devel mailing list > Vie...@li... > https://lists.sourceforge.net/lists/listinfo/viennacl-devel > |
From: Karl R. <ru...@iu...> - 2012-08-01 08:25:04
|
From: Philippe Tillet <phi...@gm...> Date: Wed, 1 Aug 2012 00:22:34 +0200 To: Karl Rupp <ru...@iu...> Hello ! Thank you very much, I have been able to cherry pick the commit on my local branch. I can now get very interesting profiling infos through the program such as : Now, I am able to see a very interesting (and depressing) behavior. Upon memory dependancies, i.e. : kernel 1, device 1 : C(0,0) = A(0,0) * B(0,0) kernel 2, device 2 : C(0,1) = A(0,0) * B(0,1), Both the AMD and the NVidia SDK are unable to multicast A(0,0) from the host to the two GPUs. Even if the two kernels are enqued in parallel, the execution is serialized, because the 2nd device has to wait for A(0,0) to be available. This is exactly the behavior I feared. It does not happen with a simple matrix addition, where all the handles are independant. I'm desesperately looking for a low-memory handle multicasting. I might give the Khronos forum a try, even though enqueuing the same handle on different queues is left implementation-defined by the standards! But well, the good news is that the kernels are executing! Good night and thanks again for the patch :) 2012/7/31 Karl Rupp <ru...@iu... <mailto:ru...@iu...>> Hello again, I've justed pushed the following changes to the sourceforge-repository: * operator+= and operator-= no longer create temporaries * A = prod(B,C) does not fail if there is garbage in A Best regards, Karli On 07/29/2012 03:59 PM, Philippe Tillet wrote: Hello everybody ! I'll inaugurate this mailing list with a little question. I have not seen any kernel for computing the operation A += prod(B,C) . Does this mean that this operation is done doing : tmp = prod(B,C) a+=tmp ? For computing the multi_matrix ( project i'm working on, matrix composed of multiple handles, to solve the CL_MAX_ALLOCABLE_MEMORY and the multi devices issue), I need to do several updates of this kind, in a block layout. For a 2*2 block layout : C(0,0).clear(); => C(0,0) += prod( A(0,0), B(0,0) ) => C(0,0) += prod( A(0,1), B(1,0) ) C(0,1).clear(); => C(0,1) += prod( A(0,0), B(0,1) ) => C(0,1) += prod( A(0,1), B(1,1) ) ... ... This "sort-of-rank-1-update approach" is a special case of the SUMMA Algorithm (OpenCL doing the memory transfers in the back ground, for now at least) and seems to be efficient from a memory point of view. Using another approach would lead to both a huge memory consumption and significant memory transfers... Is there any way of doing so in ViennaCL ? Best regards ! Phil ------------------------------__------------------------------__------------------ Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/__sfrnl04242012/114/50122263/ <http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/> _________________________________________________ ViennaCL-devel mailing list ViennaCL-devel@lists.__sourceforge.net <mailto:Vie...@li...> https://lists.sourceforge.net/__lists/listinfo/viennacl-devel <https://lists.sourceforge.net/lists/listinfo/viennacl-devel> |
From: Karl R. <ru...@iu...> - 2012-07-31 16:12:58
|
Hello again, I've justed pushed the following changes to the sourceforge-repository: * operator+= and operator-= no longer create temporaries * A = prod(B,C) does not fail if there is garbage in A Best regards, Karli On 07/29/2012 03:59 PM, Philippe Tillet wrote: > Hello everybody ! > > I'll inaugurate this mailing list with a little question. > I have not seen any kernel for computing the operation A += prod(B,C) . > Does this mean that this operation is done doing : > > tmp = prod(B,C) > a+=tmp > > ? > > For computing the multi_matrix ( project i'm working on, matrix composed > of multiple handles, to solve the CL_MAX_ALLOCABLE_MEMORY and the multi > devices issue), I need to do several updates of this kind, in a block > layout. For a 2*2 block layout : > > C(0,0).clear(); > => > C(0,0) += prod( A(0,0), B(0,0) ) > => > C(0,0) += prod( A(0,1), B(1,0) ) > > C(0,1).clear(); > => > C(0,1) += prod( A(0,0), B(0,1) ) > => > C(0,1) += prod( A(0,1), B(1,1) ) > > ... > ... > > This "sort-of-rank-1-update approach" is a special case of the SUMMA > Algorithm (OpenCL doing the memory transfers in the back ground, for now > at least) and seems to be efficient from a memory point of view. Using > another approach would lead to both a huge memory consumption and > significant memory transfers... > > Is there any way of doing so in ViennaCL ? > > Best regards ! > Phil > > > ------------------------------------------------------------------------------ > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. Discussions > will include endpoint security, mobile security and the latest in malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > > > > _______________________________________________ > ViennaCL-devel mailing list > Vie...@li... > https://lists.sourceforge.net/lists/listinfo/viennacl-devel > |
From: Karl R. <ru...@iu...> - 2012-07-29 16:14:50
|
Hi, > (...) > > It would then allow some neat things such as : > viennacl::ocl::event & evt = viennacl::linalg::prod_impl(mat1, mat2, mat3); > clWaitForEvents(1, &evt.handle.get()) > std::cout << "Kernel execution time : " << > viennacl::ocl::time_of_execution_us(evt) << std::endl; this appears to be a good choice, yes. The second line should be something like evt.wait() or viennacl::ocl::wait(evt); rather than an explicit exposition of the OpenCL API, as one could think of linear algebra backends other than OpenCL. > As for now, the only use of events would be profiling and callbacks, as > the dependancies between tasks would be handled at an upper level by the > scheduler. > Maybe there is a cleaner way to do so? Events are mostly of interest if there is out-of-order execution of command queues enabled. However, I don't see any application where we really *need* to have out-of-order execution. With in-order execution, however, commands (operations *and* memory transfers) are enqueued in the correct order anyway... Thus, I can't see any good use of events either at the moment... Best regards, Karli |
From: Philippe T. <phi...@gm...> - 2012-07-29 15:16:27
|
Hello again ! :) In order to make the scheduler flexible, I am defining a task as a tuple (boost::function<void()>, viennacl::ocl::kernel & k) . The function is obtained through a boost::bind() of the corresponding linalg function, and the kernel is "passed manually", but is necessary to retrieve an event ( in order to get profiling infos, use callbacks ... I only collect these information for now, without actually using it). I was wondering whether it would not be a better option to change the linalg function so they return the event associated with the kernel enqueued. A task would then have for only member a boost::function<viennacl::ocl::event& ()>, and : template <typename T1, typename T2, typename T3 > typename viennacl::enable_if< viennacl::is_matrix<T1>::value && viennacl::is_matrix<T2>::value && viennacl::is_matrix<T3>::value >::type prod_impl(const T1 & A, const viennacl::matrix_expression< const T2, const T2, op_trans> & B, T3 & C) Would just have to be replaced by : template <typename T1, typename T2, typename T3 > typename viennacl::enable_if< viennacl::is_matrix<T1>::value && viennacl::is_matrix<T2>::value && viennacl::is_matrix<T3>::value , viennacl::ocl::event& >::type prod_impl(const T1 & A, const viennacl::matrix_expression< const T2, const T2, op_trans> & B, T3 & C) It would then allow some neat things such as : viennacl::ocl::event & evt = viennacl::linalg::prod_impl(mat1, mat2, mat3); clWaitForEvents(1, &evt.handle.get()) std::cout << "Kernel execution time : " << viennacl::ocl::time_of_execution_us(evt) << std::endl; As for now, the only use of events would be profiling and callbacks, as the dependancies between tasks would be handled at an upper level by the scheduler. Maybe there is a cleaner way to do so? Best regards, Philippe |
From: Karl R. <ru...@iu...> - 2012-07-29 14:43:48
|
Hi Philippe, > I have not seen any kernel for computing the operation A += prod(B,C) . > Does this mean that this operation is done doing : > > tmp = prod(B,C) > a+=tmp > > ? This is currently the case. The reason, however, is only a lack of operator overloads rather than a lack of a suitable OpenCL kernel. The matrix-matrix-product kernels are able to accomplish the DGEMM-like C <- alpha * prod(A, B) + beta * C, where alpha and beta are arbitrary scalars. As one of our users recently spotted an issue with beta == 0, I'll push the fixes and better operator overloads next week. > For computing the multi_matrix ( project i'm working on, matrix composed > of multiple handles, to solve the CL_MAX_ALLOCABLE_MEMORY and the multi > devices issue), I need to do several updates of this kind, in a block > layout. For a 2*2 block layout : > > C(0,0).clear(); > => > C(0,0) += prod( A(0,0), B(0,0) ) > => > C(0,0) += prod( A(0,1), B(1,0) ) > > C(0,1).clear(); > => > C(0,1) += prod( A(0,0), B(0,1) ) > => > C(0,1) += prod( A(0,1), B(1,1) ) > > ... > ... > > This "sort-of-rank-1-update approach" is a special case of the SUMMA > Algorithm (OpenCL doing the memory transfers in the back ground, for now > at least) and seems to be efficient from a memory point of view. Using > another approach would lead to both a huge memory consumption and > significant memory transfers... Yes, that's the way to go. :-) Just use the existing operator+=, the fix will handle that for you 'automagically' :-) Best regards, Karli |
From: Philippe T. <phi...@gm...> - 2012-07-29 14:00:25
|
Hello everybody ! I'll inaugurate this mailing list with a little question. I have not seen any kernel for computing the operation A += prod(B,C) . Does this mean that this operation is done doing : tmp = prod(B,C) a+=tmp ? For computing the multi_matrix ( project i'm working on, matrix composed of multiple handles, to solve the CL_MAX_ALLOCABLE_MEMORY and the multi devices issue), I need to do several updates of this kind, in a block layout. For a 2*2 block layout : C(0,0).clear(); => C(0,0) += prod( A(0,0), B(0,0) ) => C(0,0) += prod( A(0,1), B(1,0) ) C(0,1).clear(); => C(0,1) += prod( A(0,0), B(0,1) ) => C(0,1) += prod( A(0,1), B(1,1) ) ... ... This "sort-of-rank-1-update approach" is a special case of the SUMMA Algorithm (OpenCL doing the memory transfers in the back ground, for now at least) and seems to be efficient from a memory point of view. Using another approach would lead to both a huge memory consumption and significant memory transfers... Is there any way of doing so in ViennaCL ? Best regards ! Phil |
From: Developer m. S. and/or d. n. f. here.
<vie...@li...> - 2012-07-27 10:43:53
|
Dear developers, this is a warm welcome to our new mailing list (and a quick test whether everything works as expected). :-) Best regards, Karli |