From: Karl R. <ru...@iu...> - 2012-08-01 09:38:53
|
Hi Alex, I've spent some more thoughts on how to separate the linear algebra backends suitably. Currently, some OpenCL statements are mixed into the vector<> and matrix<> classes, while the operations are clearly separated via calls to externally defined functions (e.g. prod_impl()), cf. vector_operations.hpp and matrix_operations.hpp. To simplify your development efforts I could continue this separation and also move initialization routines to separate header files. In the best case, all that is necessary for a CPU-only fallback is to have e.g. in vector.hpp something like #ifdef VIENNACL_NO_OPENCL #include "viennacl/linalg/vector-operations-cpu.hpp" #else #include "viennacl/linalg/vector-operations-opencl.hpp" #endif Going one step further, we could even separate the convenience types from the BLAS backend and support something like #if defined VIENNACL_USE_SSE_BLAS #include "viennacl/linalg/vector-operations-sse.hpp" #elif defined VIENNACL_USE_OPENCL_BLAS #include "viennacl/linalg/vector-operations-opencl.hpp" #elif defined VIENNACL_USE_OPENMP_BLAS #include "viennacl/linalg/vector-operations-openmp.hpp" ... #else #include "viennacl/linalg/vector-operations-fallback.hpp" #endif We probably won't have the development resources for supporting a whole zoo of different backends, yet I like the idea of a clean separation. What do you think? Best regards, Karli PS: cc'ed to viennacl-devel On 07/29/2012 06:49 AM, Alex Christensen wrote: > I made tred2 not copy memory, and it works with ublas matrices. My goal > is to make a backend so that defining VIENNACL_NO_OPENCL makes existing > code work without a gpu (or even linking to an OpenCL library). I'll > let you know if I run into any problems. Hopefully the existing QR code > will work with that. > > Since the LU routines don't do partial pivoting, should I include my cpu > LU function with partial pivoting? Should I include my cholesky > function also, maybe as a separate header? The only cholesky function I > have found in ViennaCL is in spai. > > Alex > |
From: Philippe T. <phi...@gm...> - 2012-08-01 10:33:36
|
Hi everybody ! My personal opinion about that is that the capabilities on OpenCL on the CPU should not be overlooked. Intel is also putting a lot of efforts into getting strong OpenCL tools! I wonder if CPU-Optimized kernels would not be able to beat an SSE implementation. Plus, in my opinion, we are tending to a future where everybody will have an OpenCL-capable GPU,(as of today, the Intel HD Graphics of the Ivy Bridge is OpenCL-capable on windows, and AMD APUs also have 2 devices recognized on both Windows and Linux). Therefore, wouldn't it be better to focus on optimizing some kernels for the CPUs, and letting the implementation redirect the computation to the proper kernel? Plus, it would be a chance to write an API for dealing with platform-specific kernels, and therefore give us the possibility in the future to optimize some kernels for either AMD CPU, AMD GPU, Intel, NVidia GPU...! Best regards, Philippe 2012/8/1 Karl Rupp <ru...@iu...> > Hi Alex, > > I've spent some more thoughts on how to separate the linear algebra > backends suitably. Currently, some OpenCL statements are mixed into the > vector<> and matrix<> classes, while the operations are clearly > separated via calls to externally defined functions (e.g. prod_impl()), > cf. vector_operations.hpp and matrix_operations.hpp. > > To simplify your development efforts I could continue this separation > and also move initialization routines to separate header files. In the > best case, all that is necessary for a CPU-only fallback is to have e.g. > in vector.hpp something like > > #ifdef VIENNACL_NO_OPENCL > #include "viennacl/linalg/vector-operations-cpu.hpp" > #else > #include "viennacl/linalg/vector-operations-opencl.hpp" > #endif > > Going one step further, we could even separate the convenience types > from the BLAS backend and support something like > > #if defined VIENNACL_USE_SSE_BLAS > #include "viennacl/linalg/vector-operations-sse.hpp" > #elif defined VIENNACL_USE_OPENCL_BLAS > #include "viennacl/linalg/vector-operations-opencl.hpp" > #elif defined VIENNACL_USE_OPENMP_BLAS > #include "viennacl/linalg/vector-operations-openmp.hpp" > ... > #else > #include "viennacl/linalg/vector-operations-fallback.hpp" > #endif > > We probably won't have the development resources for supporting a whole > zoo of different backends, yet I like the idea of a clean separation. > What do you think? > > Best regards, > Karli > > PS: cc'ed to viennacl-devel > > > > > On 07/29/2012 06:49 AM, Alex Christensen wrote: > > I made tred2 not copy memory, and it works with ublas matrices. My goal > > is to make a backend so that defining VIENNACL_NO_OPENCL makes existing > > code work without a gpu (or even linking to an OpenCL library). I'll > > let you know if I run into any problems. Hopefully the existing QR code > > will work with that. > > > > Since the LU routines don't do partial pivoting, should I include my cpu > > LU function with partial pivoting? Should I include my cholesky > > function also, maybe as a separate header? The only cholesky function I > > have found in ViennaCL is in spai. > > > > Alex > > > > > > ------------------------------------------------------------------------------ > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. Discussions > will include endpoint security, mobile security and the latest in malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > _______________________________________________ > ViennaCL-devel mailing list > Vie...@li... > https://lists.sourceforge.net/lists/listinfo/viennacl-devel > |
From: Karl R. <ru...@iu...> - 2012-08-01 10:55:50
|
Hi, my first thought was that OpenCL on the CPU would be *the* cool thing. However, running some benchmarks soon showed that OpenCL won't be the best choice for everything. The most annoying thing is the kernel launch overhead, which is even on the CPU in the range of 10 us. Even for a moderate 1 GHz CPU, this translates to 10k CPU cycles 'wasted'. Thus, for 'small' operations OpenCL won't give any good performance... :-( I agree that the current OpenCL backend requires extensions in order to handle device-specific implementations. As a (recent) prominent example, the super-fast matrix-matrix multiplication kernels for NVIDIA GPUs are not optimal for AMD GPUs and even for some older NVIDIA GPUs. The OpenCL backend, however, is essentially independent of the user types, so the ideal design in my view is the following: Layer 1: User API types (viennacl::vector<>, etc.) Layer 2: BLAS calling code (OpenCL, SSE, etc. Maybe hybrid?) Layer 3: BLAS backend details (OpenCL kernel management, etc.) The VIENNACL_USE_XYZ_BLAS defines would basically select the Layer 2 implementation to be used, while the better OpenCL kernel management with possibly several tuned kernels resides entirely in Layer 3. In a first step I think it's better to select Layer 2 statically via preprocessor defines. In a second step we might even use hybrid approaches by using SSE for small operations on OpenCL handles on APUs, thus circumventing the ugly kernel launch overheads in such cases. (Just as a remark: The scientific community is right now *not* interested in APUs due to the lack of double precision support. Due to thermal limitations I don't think an APU will beat a standalone GPU anytime soon, even though the latency problems introduced by PCI-Express are obvious.) In other words: Since there does not seem to be a single 'best programming approach', let's combine the best of different approaches. :-) Best regards, Karli On 08/01/2012 12:33 PM, Philippe Tillet wrote: > Hi everybody ! > > My personal opinion about that is that the capabilities on OpenCL on the > CPU should not be overlooked. Intel is also putting a lot of efforts > into getting strong OpenCL tools! > I wonder if CPU-Optimized kernels would not be able to beat an SSE > implementation. > Plus, in my opinion, we are tending to a future where everybody will > have an OpenCL-capable GPU,(as of today, the Intel HD Graphics of the > Ivy Bridge is OpenCL-capable on windows, and AMD APUs also have 2 > devices recognized on both Windows and Linux). > > Therefore, wouldn't it be better to focus on optimizing some kernels for > the CPUs, and letting the implementation redirect the computation to the > proper kernel? > Plus, it would be a chance to write an API for dealing with > platform-specific kernels, and therefore give us the possibility in the > future to optimize some kernels for either AMD CPU, AMD GPU, Intel, > NVidia GPU...! > > Best regards, > Philippe > > > 2012/8/1 Karl Rupp <ru...@iu... <mailto:ru...@iu...>> > > Hi Alex, > > I've spent some more thoughts on how to separate the linear algebra > backends suitably. Currently, some OpenCL statements are mixed into the > vector<> and matrix<> classes, while the operations are clearly > separated via calls to externally defined functions (e.g. prod_impl()), > cf. vector_operations.hpp and matrix_operations.hpp. > > To simplify your development efforts I could continue this separation > and also move initialization routines to separate header files. In the > best case, all that is necessary for a CPU-only fallback is to have e.g. > in vector.hpp something like > > #ifdef VIENNACL_NO_OPENCL > #include "viennacl/linalg/vector-operations-cpu.hpp" > #else > #include "viennacl/linalg/vector-operations-opencl.hpp" > #endif > > Going one step further, we could even separate the convenience types > from the BLAS backend and support something like > > #if defined VIENNACL_USE_SSE_BLAS > #include "viennacl/linalg/vector-operations-sse.hpp" > #elif defined VIENNACL_USE_OPENCL_BLAS > #include "viennacl/linalg/vector-operations-opencl.hpp" > #elif defined VIENNACL_USE_OPENMP_BLAS > #include "viennacl/linalg/vector-operations-openmp.hpp" > ... > #else > #include "viennacl/linalg/vector-operations-fallback.hpp" > #endif > > We probably won't have the development resources for supporting a whole > zoo of different backends, yet I like the idea of a clean separation. > What do you think? > > Best regards, > Karli > > PS: cc'ed to viennacl-devel > > > > > On 07/29/2012 06:49 AM, Alex Christensen wrote: > > I made tred2 not copy memory, and it works with ublas matrices. > My goal > > is to make a backend so that defining VIENNACL_NO_OPENCL makes > existing > > code work without a gpu (or even linking to an OpenCL library). I'll > > let you know if I run into any problems. Hopefully the existing > QR code > > will work with that. > > > > Since the LU routines don't do partial pivoting, should I include > my cpu > > LU function with partial pivoting? Should I include my cholesky > > function also, maybe as a separate header? The only cholesky > function I > > have found in ViennaCL is in spai. > > > > Alex > > > > > ------------------------------------------------------------------------------ > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. > Discussions > will include endpoint security, mobile security and the latest in > malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > _______________________________________________ > ViennaCL-devel mailing list > Vie...@li... > <mailto:Vie...@li...> > https://lists.sourceforge.net/lists/listinfo/viennacl-devel > > > > > ------------------------------------------------------------------------------ > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. Discussions > will include endpoint security, mobile security and the latest in malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > > > > _______________________________________________ > ViennaCL-devel mailing list > Vie...@li... > https://lists.sourceforge.net/lists/listinfo/viennacl-devel > |
From: Alex C. <ach...@gm...> - 2012-08-02 01:13:14
|
I think that's a good way to go, except that I think the SSE and OpenMP BLAS implementations shouldn't be separate. I'm a little bit intimidated because all the OpenCL code would have to be translated to CPU code in order for a CPU backend to have full functionality. This might not be done anytime soon. Also, do you think my sse blas and tred2 will be included in the next release? When is the next release? Alex PS: reply cc'ed to viennacl-devel On Wed, Aug 1, 2012 at 3:38 AM, Karl Rupp <ru...@iu...> wrote: > Hi Alex, > > I've spent some more thoughts on how to separate the linear algebra > backends suitably. Currently, some OpenCL statements are mixed into the > vector<> and matrix<> classes, while the operations are clearly separated > via calls to externally defined functions (e.g. prod_impl()), cf. > vector_operations.hpp and matrix_operations.hpp. > > To simplify your development efforts I could continue this separation and > also move initialization routines to separate header files. In the best > case, all that is necessary for a CPU-only fallback is to have e.g. in > vector.hpp something like > > #ifdef VIENNACL_NO_OPENCL > #include "viennacl/linalg/vector-**operations-cpu.hpp" > #else > #include "viennacl/linalg/vector-**operations-opencl.hpp" > #endif > > Going one step further, we could even separate the convenience types from > the BLAS backend and support something like > > #if defined VIENNACL_USE_SSE_BLAS > #include "viennacl/linalg/vector-**operations-sse.hpp" > #elif defined VIENNACL_USE_OPENCL_BLAS > #include "viennacl/linalg/vector-**operations-opencl.hpp" > #elif defined VIENNACL_USE_OPENMP_BLAS > #include "viennacl/linalg/vector-**operations-openmp.hpp" > ... > #else > #include "viennacl/linalg/vector-**operations-fallback.hpp" > #endif > > We probably won't have the development resources for supporting a whole > zoo of different backends, yet I like the idea of a clean separation. What > do you think? > > Best regards, > Karli > > PS: cc'ed to viennacl-devel > > > > > On 07/29/2012 06:49 AM, Alex Christensen wrote: > >> I made tred2 not copy memory, and it works with ublas matrices. My goal >> is to make a backend so that defining VIENNACL_NO_OPENCL makes existing >> code work without a gpu (or even linking to an OpenCL library). I'll >> let you know if I run into any problems. Hopefully the existing QR code >> will work with that. >> >> Since the LU routines don't do partial pivoting, should I include my cpu >> LU function with partial pivoting? Should I include my cholesky >> function also, maybe as a separate header? The only cholesky function I >> have found in ViennaCL is in spai. >> >> Alex >> >> > |
From: Karl R. <ru...@iu...> - 2012-08-02 12:48:06
|
Hi, > I think that's a good way to go, except that I think the SSE and OpenMP > BLAS implementations shouldn't be separate. I'm a little bit > intimidated because all the OpenCL code would have to be translated to > CPU code in order for a CPU backend to have full functionality. This > might not be done anytime soon. That's a point, keeping SSE and OpenMP together makes sense. Not all OpenCL kernels need to be translated to SSE/OpenMP right away. Most of the operations can be handled with simple loops, possibly even in a generative way (e.g. templates). It's sufficient if you focus on the 'interesting' kernels, I can add the simpler kernels/operatins as well. > Also, do you think my sse blas and tred2 will be included in the next > release? When is the next release? The next release is expected to be next week, version 1.3.1. This is going to be a bugfix release and further stabilizes some of the new experimental features. I hope to include your SSE contributions in 1.4.0, which will also include the developments from the Google Summer of Code (generalized eigenvalue problems) and is expected to be in the second half of September. This is, however, not set in stone - as university courses usually start at this time we better bring all summer developments to a stable state. :-) Best regards, Karli > PS: reply cc'ed to viennacl-devel :-) > > On Wed, Aug 1, 2012 at 3:38 AM, Karl Rupp <ru...@iu... > <mailto:ru...@iu...>> wrote: > > Hi Alex, > > I've spent some more thoughts on how to separate the linear algebra > backends suitably. Currently, some OpenCL statements are mixed into > the vector<> and matrix<> classes, while the operations are clearly > separated via calls to externally defined functions (e.g. > prod_impl()), cf. vector_operations.hpp and matrix_operations.hpp. > > To simplify your development efforts I could continue this > separation and also move initialization routines to separate header > files. In the best case, all that is necessary for a CPU-only > fallback is to have e.g. in vector.hpp something like > > #ifdef VIENNACL_NO_OPENCL > #include "viennacl/linalg/vector-__operations-cpu.hpp" > #else > #include "viennacl/linalg/vector-__operations-opencl.hpp" > #endif > > Going one step further, we could even separate the convenience types > from the BLAS backend and support something like > > #if defined VIENNACL_USE_SSE_BLAS > #include "viennacl/linalg/vector-__operations-sse.hpp" > #elif defined VIENNACL_USE_OPENCL_BLAS > #include "viennacl/linalg/vector-__operations-opencl.hpp" > #elif defined VIENNACL_USE_OPENMP_BLAS > #include "viennacl/linalg/vector-__operations-openmp.hpp" > ... > #else > #include "viennacl/linalg/vector-__operations-fallback.hpp" > #endif > > We probably won't have the development resources for supporting a > whole zoo of different backends, yet I like the idea of a clean > separation. What do you think? > > Best regards, > Karli > > PS: cc'ed to viennacl-devel > > > > > On 07/29/2012 06:49 AM, Alex Christensen wrote: > > I made tred2 not copy memory, and it works with ublas matrices. > My goal > is to make a backend so that defining VIENNACL_NO_OPENCL makes > existing > code work without a gpu (or even linking to an OpenCL library). > I'll > let you know if I run into any problems. Hopefully the existing > QR code > will work with that. > > Since the LU routines don't do partial pivoting, should I > include my cpu > LU function with partial pivoting? Should I include my cholesky > function also, maybe as a separate header? The only cholesky > function I > have found in ViennaCL is in spai. > > Alex > > > |