You can subscribe to this list here.
2010 
_{Jan}

_{Feb}

_{Mar}

_{Apr}

_{May}
(1) 
_{Jun}
(8) 
_{Jul}
(16) 
_{Aug}
(6) 
_{Sep}

_{Oct}

_{Nov}

_{Dec}
(5) 

2011 
_{Jan}
(4) 
_{Feb}
(3) 
_{Mar}
(5) 
_{Apr}

_{May}
(24) 
_{Jun}

_{Jul}
(5) 
_{Aug}
(17) 
_{Sep}

_{Oct}
(6) 
_{Nov}
(9) 
_{Dec}
(8) 
2012 
_{Jan}
(5) 
_{Feb}
(14) 
_{Mar}
(25) 
_{Apr}
(7) 
_{May}
(15) 
_{Jun}
(12) 
_{Jul}
(22) 
_{Aug}
(4) 
_{Sep}
(10) 
_{Oct}
(10) 
_{Nov}
(19) 
_{Dec}
(17) 
2013 
_{Jan}
(8) 
_{Feb}
(10) 
_{Mar}
(16) 
_{Apr}
(3) 
_{May}
(16) 
_{Jun}
(26) 
_{Jul}

_{Aug}
(9) 
_{Sep}

_{Oct}
(8) 
_{Nov}
(17) 
_{Dec}
(2) 
2014 
_{Jan}
(37) 
_{Feb}
(15) 
_{Mar}
(6) 
_{Apr}
(9) 
_{May}
(11) 
_{Jun}
(11) 
_{Jul}
(9) 
_{Aug}
(9) 
_{Sep}
(19) 
_{Oct}
(4) 
_{Nov}
(22) 
_{Dec}
(21) 
2015 
_{Jan}

_{Feb}
(7) 
_{Mar}
(2) 
_{Apr}
(17) 
_{May}
(22) 
_{Jun}
(11) 
_{Jul}
(11) 
_{Aug}
(6) 
_{Sep}

_{Oct}

_{Nov}

_{Dec}

S  M  T  W  T  F  S 







1

2

3

4

5
(1) 
6
(1) 
7

8

9

10

11

12

13
(4) 
14
(1) 
15

16

17

18

19
(4) 
20
(2) 
21

22

23

24

25
(2) 
26

27

28


From: Karl Rupp <rupp@iu...>  20140225 11:47:54

Hi Tom, > For the following program under version 1.4.2: > > const char* const prog = "__kernel void ker() {};"; > const char* const name = "ker"; > viennacl::ocl::program & my_prog = > viennacl::ocl::current_context().add_program(prog, name); > my_prog.add_kernel(name); > > viennacl::ocl::kernel kernel1(my_prog.get_kernel(name)); > > viennacl::ocl::kernel kernel2(kernel1); > > kernel1 = kernel2; > > I get a CL_INVALID_KERNEL on the line kernel1 = kernel2; > > The stack trace is: (...) > > I can't think that I am doing anything wrong. Is this a bug? Yes, I could reproduce this with 1.4.2. The code runs works with 1.5.1 (where you don't need my_prog.add_kernel(name)). If you want to patch 1.4.1, then replace in viennacl/ocl/handle.hpp, lines ~95: // remove: cl_int err = clRetainKernel(something); VIENNACL_ERR_CHECK(err); // replace with if (something) { cl_int err = clRetainKernel(something); VIENNACL_ERR_CHECK(err); } Hope that helps :) Best regards, Karli 
From: tom nicholson <tfwnicholson@gm...>  20140225 11:17:33

Hi, For the following program under version 1.4.2: const char* const prog = "__kernel void ker() {};"; const char* const name = "ker"; viennacl::ocl::program & my_prog = viennacl::ocl::current_context().add_program(prog, name); my_prog.add_kernel(name); viennacl::ocl::kernel kernel1(my_prog.get_kernel(name)); viennacl::ocl::kernel kernel2(kernel1); kernel1 = kernel2; I get a CL_INVALID_KERNEL on the line kernel1 = kernel2; The stack trace is: #0 0x00007ffff6378ba5 in raise () from /lib64/libc.so.6 #1 0x00007ffff637a4bb in abort () from /lib64/libc.so.6 #2 0x00007ffff6bcac8d in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib64/libstdc++.so.6 #3 0x00007ffff6bc8d26 in ?? () from /usr/lib64/libstdc++.so.6 #4 0x00007ffff6bc8d53 in std::terminate() () from /usr/lib64/libstdc++.so.6 #5 0x00007ffff6bc8f7e in __cxa_throw () from /usr/lib64/libstdc++.so.6 #6 0x00000000007a9a67 in viennacl::ocl::error_checker<void>::raise_exception (err=48) at thirdparty/viennacl/ocl/error.hpp:551 #7 0x00000000007a55c9 in viennacl::ocl::error_checker<void>::checkError (err=48) at thirdparty/viennacl/ocl/error.hpp:597 #8 0x000000000079fe22 in viennacl::ocl::handle_inc_dec_helper<_cl_kernel*>::inc (something=@0x7fffffffe800) at thirdparty/viennacl/ocl/handle.hpp:95 #9 0x00000000007a9ddc in viennacl::ocl::handle<_cl_kernel*>::inc (this=0x7fffffffe800) at thirdparty/viennacl/ocl/handle.hpp:191 #10 0x00000000007a56fc in viennacl::ocl::handle<_cl_kernel*>::operator= (this=0x7fffffffe800, other=...) at thirdparty/viennacl/ocl/handle.hpp:162 #11 0x00000000007a0cd7 in viennacl::ocl::kernel::operator= (this=0x7fffffffe800, other=...) at thirdparty/viennacl/ocl/kernel.hpp:95 I can't think that I am doing anything wrong. Is this a bug? Cheers, Tom 
From: Toby St Clere Smithe <pyviennacl@ts...>  20140220 11:54:57

Dear ViennaCL users, If you've ever used Python for your numerical applications, you know what joy it can be. Now, the easy power of ViennaCL 1.5.1 is at last married to that experience. I am pleased to announce the first release of PyViennaCL! Download links for source and Ubuntu binaries are found at the usual place: http://viennacl.sourceforge.net/viennacldownload.html * If you are or know anyone who could help with building PyViennaCL for other systems (Windows, Mac OS X, CentOS / RHEL, Fedora, SuSE, ...), please get in touch! See the following link for documentation and example code: http://viennacl.sourceforge.net/pyviennacl/doc/ PyViennaCL 1.0.0 exposes most of the functionality of ViennaCL: + sparse (compressed, coordinate, ELL, and hybrid) and dense (rowmajor and columnmajor) matrices, vectors and scalars on your compute device using OpenCL; + standard arithmetic operations and mathematical functions; + fast matrix products for sparse and dense matrices, and inner and outer products for vectors; + direct solvers for dense triangular systems; + iterative solvers for sparse and dense systems, using the BiCGStab, CG, and GMRES algorithms; + iterative algorithms for eigenvalue estimation problems. PyViennaCL has also been designed for straightforward use in the context of NumPy and SciPy: PyViennaCL objects can be constructed using NumPy arrays, and arithmetic operations and comparisons in PyViennaCL are typeagnostic. Some ViennaCL functionality is not yet available, and these features are planned for a release in the coming months: + preconditioners and QR factorization; + additional solvers and other algorithms, such as FFT computation; + structured matrices; + CUDA support (use OpenCL for now!); + advanced OpenCL integration. Spread the word! Toby St Clere Smithe 
From: Toby St Clere Smithe <pyviennacl@ts...>  20140220 11:41:23

Dear ViennaCL users, If you've ever used Python for your numerical applications, you know what joy it can be. Now, the easy power of ViennaCL 1.5.1 is at last married to that experience. I am pleased to announce the first release of PyViennaCL! Download links for source and Ubuntu binaries are found at the usual place: http://viennacl.sourceforge.net/viennacldownload.html * If you are or know anyone who could help with building PyViennaCL for other systems (Windows, Mac OS X, CentOS / RHEL, Fedora, SuSE, ...), please get in touch! PyViennaCL 1.0.0 exposes most of the functionality of ViennaCL: + sparse (compressed, coordinate, ELL, and hybrid) and dense (rowmajor and columnmajor) matrices, vectors and scalars on your compute device using OpenCL; + standard arithmetic operations and mathematical functions; + fast matrix products for sparse and dense matrices, and inner and outer products for vectors; + direct solvers for dense triangular systems; + iterative solvers for sparse and dense systems, using the BiCGStab, CG, and GMRES algorithms; + iterative algorithms for eigenvalue estimation problems. PyViennaCL has also been designed for straightforward use in the context of NumPy and SciPy: PyViennaCL objects can be constructed using NumPy arrays, and arithmetic operations and comparisons in PyViennaCL are typeagnostic. Some ViennaCL functionality is not yet available, and these features are planned for a release in the coming months: + preconditioners and QR factorization; + additional solvers and other algorithms, such as FFT computation; + structured matrices; + CUDA support (use OpenCL for now!); + advanced OpenCL integration. Spread the word! Toby St Clere Smithe 
From: Karl Rupp <rupp@iu...>  20140219 22:09:40

Hi Paul, > Thanks for your quick reply. > > Regarding point 1, it actually wasn't poor performance of the LU factorization code in absolute terms that I was asking you about, but rather why there is such a large *difference* in performance of the same code when run on an Nvidia GTX 680 versus a 780M, when these same two cards perform very similarly on all the other benchmarks. To recap, for the first three BLAS 3 benchmarks, the 780M performs at 80% the speed of the 680, but for the LU factorization the figure is 4.6%. My question is whether you can point to a reason for this large discrepancy. Hmm, good point, the two are basically the same piece of hardware, I don't have an explanation for that. Maybe different driver versions? > Regarding point 2, I tried exactly what you described, but was then confronted with a number of unresolved symbol linker errors for functions like clRetainMemObject, when my OpenCL.lib was clearly available in the project. So I can't explain why those symbols would be unresolved. Can you send the full error message? This is used in examples/benchmarks/generator_blas{1,2,3}.cpp, so it's supposed to be working. Maybe you forget #define VIENNACL_WITH_OPENCL or #include "viennacl/ocl/backend.hpp" ? Best regards, Karli 
From: Dufort, Paul <Paul.D<ufort@uh...>  20140219 21:38:55

Hi Karl, Thanks for your quick reply. Regarding point 1, it actually wasn't poor performance of the LU factorization code in absolute terms that I was asking you about, but rather why there is such a large *difference* in performance of the same code when run on an Nvidia GTX 680 versus a 780M, when these same two cards perform very similarly on all the other benchmarks. To recap, for the first three BLAS 3 benchmarks, the 780M performs at 80% the speed of the 680, but for the LU factorization the figure is 4.6%. My question is whether you can point to a reason for this large discrepancy. Regarding point 2, I tried exactly what you described, but was then confronted with a number of unresolved symbol linker errors for functions like clRetainMemObject, when my OpenCL.lib was clearly available in the project. So I can't explain why those symbols would be unresolved. Regards, Paul Paul Dufort, Ph.D. Computational Imaging Scientist The Joint Department of Medical Imaging Mount Sinai Hospital, University Health Network, Women's College Hospital Room MP 14322 Back Office, 14th Floor Main Pavilion Department of Medical Imaging, Toronto Western Hospital 399 Bathurst Street, Toronto, ON M5T 2S8 Cell: 6472916180 EMail: paul.dufort@... Original Message From: Karl Rupp [mailto:rupp@...] Sent: February 19, 2014 9:33 AM To: Dufort, Paul; 'viennaclsupport@...' Cc: Philippe Tillet Subject: Re: [ViennaCLsupport] Performance reduction from 1.4.2 to 1.5.1 Hi Paul, > I have a couple more ViennaCL performance testing puzzles for you, if you are interested: > > 1) I have now tried the latest version on my gaming laptop, an Asus G750JH with a 4 GB GTX 780M. The performance on the single precision dense matrix test was great at 480 Gflops, about 80% of my GTX 680's 600 Gflops. The range and slice tests were similarly very good. However, the LU factorization test was only 3.7 Gflops on the 780M, compared to 80 Gflops on the desktop 680! I tried the same trick as before, substituting "Kepler" and "GTX 780M"in profiles.hpp, but this time it has no effect. The two devices are quite similar in their characteristics, so I was surprised by this large difference. Any idea what could cause this? The reason for this is a rather poor implementation of the LU factorization kernel. The issue is pretty old (cf. https://github.com/viennacl/viennacldev/issues/1 ) and we pretty much know how to improve it, yet our motivation is somehow limited since dense LU factorizations are pretty much covered by the big BLAS libraries. Your feedback helps to increase the priority, as someone is using it :) > 2) I later updated my laptop Nvidia driver, and it changed the order of the two OpenCL platforms on my system so that Intel is first. This means the blas3bench test now runs by default on my Intel hardware and not on my Nvidia GTX 780M. I tried switching platforms in the blas3bench code and recompiling, but it produced some strange unresolved symbol errors for some OpenCL API functions at the link stage. Is there a straightforward way to choose among multiple platforms for the benchmarks? Yes, have a look at the manual on how to customize the OpenCL environment. The function your are looking for is viennacl::ocl::set_context_platform_index(id, platform_index); where 'id' is usually zero and the platform index is either '0' or '1' for the two platforms on your system. > Thanks in advance for any advice you can provide! Best regards, Karli This email may contain confidential and/or privileged information for the sole use of the intended recipient. Any review or distribution by anyone other than the person for whom it was originally intended is strictly prohibited. If you have received this email in error, please contact the sender and delete all copies. Opinions, conclusions or other information contained in this email may not be that of the organization. 
From: Karl Rupp <rupp@iu...>  20140219 14:32:51

Hi Paul, > I have a couple more ViennaCL performance testing puzzles for you, if you are interested: > > 1) I have now tried the latest version on my gaming laptop, an Asus G750JH with a 4 GB GTX 780M. The performance on the single precision dense matrix test was great at 480 Gflops, about 80% of my GTX 680's 600 Gflops. The range and slice tests were similarly very good. However, the LU factorization test was only 3.7 Gflops on the 780M, compared to 80 Gflops on the desktop 680! I tried the same trick as before, substituting "Kepler" and "GTX 780M"in profiles.hpp, but this time it has no effect. The two devices are quite similar in their characteristics, so I was surprised by this large difference. Any idea what could cause this? The reason for this is a rather poor implementation of the LU factorization kernel. The issue is pretty old (cf. https://github.com/viennacl/viennacldev/issues/1 ) and we pretty much know how to improve it, yet our motivation is somehow limited since dense LU factorizations are pretty much covered by the big BLAS libraries. Your feedback helps to increase the priority, as someone is using it :) > 2) I later updated my laptop Nvidia driver, and it changed the order of the two OpenCL platforms on my system so that Intel is first. This means the blas3bench test now runs by default on my Intel hardware and not on my Nvidia GTX 780M. I tried switching platforms in the blas3bench code and recompiling, but it produced some strange unresolved symbol errors for some OpenCL API functions at the link stage. Is there a straightforward way to choose among multiple platforms for the benchmarks? Yes, have a look at the manual on how to customize the OpenCL environment. The function your are looking for is viennacl::ocl::set_context_platform_index(id, platform_index); where 'id' is usually zero and the platform index is either '0' or '1' for the two platforms on your system. > Thanks in advance for any advice you can provide! Best regards, Karli 
From: Dufort, Paul <Paul.D<ufort@uh...>  20140219 13:58:18

Hi Karl and Philippe, I have a couple more ViennaCL performance testing puzzles for you, if you are interested: 1) I have now tried the latest version on my gaming laptop, an Asus G750JH with a 4 GB GTX 780M. The performance on the single precision dense matrix test was great at 480 Gflops, about 80% of my GTX 680's 600 Gflops. The range and slice tests were similarly very good. However, the LU factorization test was only 3.7 Gflops on the 780M, compared to 80 Gflops on the desktop 680! I tried the same trick as before, substituting "Kepler" and "GTX 780M"in profiles.hpp, but this time it has no effect. The two devices are quite similar in their characteristics, so I was surprised by this large difference. Any idea what could cause this? 2) I later updated my laptop Nvidia driver, and it changed the order of the two OpenCL platforms on my system so that Intel is first. This means the blas3bench test now runs by default on my Intel hardware and not on my Nvidia GTX 780M. I tried switching platforms in the blas3bench code and recompiling, but it produced some strange unresolved symbol errors for some OpenCL API functions at the link stage. Is there a straightforward way to choose among multiple platforms for the benchmarks? Thanks in advance for any advice you can provide! Regards, Paul Paul Dufort, Ph.D. Computational Imaging Scientist The Joint Department of Medical Imaging Mount Sinai Hospital, University Health Network, Women's College Hospital Room MP 14322 Back Office, 14th Floor Main Pavilion Department of Medical Imaging, Toronto Western Hospital 399 Bathurst Street, Toronto, ON M5T 2S8 Cell: 6472916180 EMail: paul.dufort@... This email may contain confidential and/or privileged information for the sole use of the intended recipient. Any review or distribution by anyone other than the person for whom it was originally intended is strictly prohibited. If you have received this email in error, please contact the sender and delete all copies. Opinions, conclusions or other information contained in this email may not be that of the organization. 
From: Philippe Tillet <phil.tillet@gm...>  20140214 14:11:08

Hi, Thanks a lot for the report, Paul, and for the fix, Karl. I can confirm that the parameterized GEMM source code was mostly tested on Fermi (the GPU on my Laptop is a Fermi) and HD5850/7970 (some GPUs I had to work on). The fallback for NVidia GPUs is by definition very conservative since the localmemory / workgroup sizes have to be valid for all the NVidia GPUs. Most of my research from now on will be focused on that GEMM tuning procedure, so the performance should improve over the incoming versions of ViennaCL! At the moment, the performance of ViennaCL for CUDAenabled GPUs is still considerably lower than that of MAGMA(CUDA)/CuBlas, hence linking to external BLAS libraries will be provided in the near future (I'll try to do this by the end of next week). The philosophy of ViennaCL is to expose a stable interface, so expect your performance to grow even larger in the near future. Thanks again for reporting the performance regression, this is very important for us! Philippe 20140213 22:47 GMT+01:00 Karl Rupp <rupp@...>: > Hi Paul, > > > > Thanks very much for responding so quickly and for your > suggestion, it worked really well: with that change, the performance > surpassed the 1.4.2 value of 500 Gflops and went all the way up to 600 > Gflops! > > Great! I already pushed a patch to our repository, of course acknowledging > your great help: > > https://github.com/viennacl/viennacldev/commit/ > dfab050e6c9a1e7e4b604154120e7734d1dff412 > > > > Really, I can't say enough good things about ViennaCL. I've been >> trying to make a go of it with OpenCL because I dislike proprietary >> solutions like CUDA. But as you will know, the ecosystem of libraries for >> OpenCL is still pretty sparse compared to CUDA. So the existence of such a >> highly specialized, high quality library like ViennaCL is a major enabler. >> > > It is our believe that vendorindependent standards are the best longterm > solution, particularly for us scientists with longterm commitments on > research using these codes. During the time not spent on writing papers, we > do our best to push things into the right direction. :) > > The best you can do to support this is to convince your colleagues and, to > the degree possible, *require* companies to provide a good support for open > standards (i.e. tell them that you don't want to be vendorlocked). Anyway, > I assume you do this already :) > > Best regards, > Karli > > 
From: Dufort, Paul <Paul.D<ufort@uh...>  20140213 21:48:09

Hi Karl, Thanks very much for responding so quickly and for your suggestion, it worked really well: with that change, the performance surpassed the 1.4.2 value of 500 Gflops and went all the way up to 600 Gflops! > Thank you very much for the positive feedback! Really, I can't say enough good things about ViennaCL. I've been trying to make a go of it with OpenCL because I dislike proprietary solutions like CUDA. But as you will know, the ecosystem of libraries for OpenCL is still pretty sparse compared to CUDA. So the existence of such a highly specialized, high quality library like ViennaCL is a major enabler. Regards, Paul Paul Dufort, Ph.D. Computational Imaging Scientist The Joint Department of Medical Imaging Mount Sinai Hospital, University Health Network, Women's College Hospital Room MP 14322 Back Office, 14th Floor Main Pavilion Department of Medical Imaging, Toronto Western Hospital 399 Bathurst Street, Toronto, ON M5T 2S8 Cell: 6472916180 EMail: paul.dufort@... Original Message From: Karl Rupp [mailto:rupp@...] Sent: February 13, 2014 3:49 PM To: Dufort, Paul; 'viennaclsupport@...' Cc: Philippe Tillet Subject: Re: [ViennaCLsupport] Performance reduction from 1.4.2 to 1.5.1 Deal Paul, > First, I want to say thank you very much for creating > this library. It is extremely useful and easy to use, and I have come > to rely on it a great deal in my research. Thank you very much for the positive feedback! > Now to the problem: I just upgraded from 1.4.2 to > 1.5.1 and found that the blas3bench results have taken a substantial hit. > Specifically, the basic dense matrixmatrix multiply consistently > gives me 500 Gflops on my Nvidia GTX 680 using version 1.4.1 and > 1.4.2, but has now dropped to 340 Gflops with 1.5.1. I've tried > fiddling with various things, but to no avail  it is a stable result. > Do you have any idea why this might be happening? I suppose that this is because of the ongoing integration of the kernel generator, which also brings a device database. So far there are only few devices (device families) in the database, it needs to be filled incrementally. As you can see here: https://github.com/viennacl/viennacldev/blob/master/viennacl/generator/autotuning/profiles.hpp we only have reference data for a GTX 470, which we use for all other NVIDIA Fermi GPUs (this is reasonable). We couldn't include a full tuning profile for Kepler GPUs on time for the release, hence it uses a fallback implementation. You can try an adhoc change in the released version as follows:  Edit viennacl/generator/profiles.hpp  Go to about line 250 and find the 22 lines for the GTX 470.  Copy&Paste the block, replacing "viennacl::ocl::Fermi" with "viennacl::ocl::Kepler", and "GeForce GTX 470" by "GeForce GTX 680" This should use the same Fermi kernel on a Kepler GPU, basically reproducing the 'old' behavior from 1.4.2. If the above doesn't work, I can only recommend to use 1.4.2 until we have a higher device database population. @Philippe: Do you know a better workaround? Either way, thanks for reporting, Paul. We definitely need to get this performance regression fixed in the next release. Best regards, Karli This email may contain confidential and/or privileged information for the sole use of the intended recipient. Any review or distribution by anyone other than the person for whom it was originally intended is strictly prohibited. If you have received this email in error, please contact the sender and delete all copies. Opinions, conclusions or other information contained in this email may not be that of the organization. 
From: Karl Rupp <rupp@iu...>  20140213 21:47:24

Hi Paul, > Thanks very much for responding so quickly and for your suggestion, it worked really well: with that change, the performance surpassed the 1.4.2 value of 500 Gflops and went all the way up to 600 Gflops! Great! I already pushed a patch to our repository, of course acknowledging your great help: https://github.com/viennacl/viennacldev/commit/dfab050e6c9a1e7e4b604154120e7734d1dff412 > Really, I can't say enough good things about ViennaCL. I've been trying to make a go of it with OpenCL because I dislike proprietary solutions like CUDA. But as you will know, the ecosystem of libraries for OpenCL is still pretty sparse compared to CUDA. So the existence of such a highly specialized, high quality library like ViennaCL is a major enabler. It is our believe that vendorindependent standards are the best longterm solution, particularly for us scientists with longterm commitments on research using these codes. During the time not spent on writing papers, we do our best to push things into the right direction. :) The best you can do to support this is to convince your colleagues and, to the degree possible, *require* companies to provide a good support for open standards (i.e. tell them that you don't want to be vendorlocked). Anyway, I assume you do this already :) Best regards, Karli 
From: Karl Rupp <rupp@iu...>  20140213 20:49:32

Deal Paul, > First, I want to say thank you very much for creating > this library. It is extremely useful and easy to use, and I have come to > rely on it a great deal in my research. Thank you very much for the positive feedback! > Now to the problem: I just upgraded from 1.4.2 to 1.5.1 > and found that the blas3bench results have taken a substantial hit. > Specifically, the basic dense matrixmatrix multiply consistently gives > me 500 Gflops on my Nvidia GTX 680 using version 1.4.1 and 1.4.2, but > has now dropped to 340 Gflops with 1.5.1. I’ve tried fiddling with > various things, but to no avail – it is a stable result. Do you have any > idea why this might be happening? I suppose that this is because of the ongoing integration of the kernel generator, which also brings a device database. So far there are only few devices (device families) in the database, it needs to be filled incrementally. As you can see here: https://github.com/viennacl/viennacldev/blob/master/viennacl/generator/autotuning/profiles.hpp we only have reference data for a GTX 470, which we use for all other NVIDIA Fermi GPUs (this is reasonable). We couldn't include a full tuning profile for Kepler GPUs on time for the release, hence it uses a fallback implementation. You can try an adhoc change in the released version as follows:  Edit viennacl/generator/profiles.hpp  Go to about line 250 and find the 22 lines for the GTX 470.  Copy&Paste the block, replacing "viennacl::ocl::Fermi" with "viennacl::ocl::Kepler", and "GeForce GTX 470" by "GeForce GTX 680" This should use the same Fermi kernel on a Kepler GPU, basically reproducing the 'old' behavior from 1.4.2. If the above doesn't work, I can only recommend to use 1.4.2 until we have a higher device database population. @Philippe: Do you know a better workaround? Either way, thanks for reporting, Paul. We definitely need to get this performance regression fixed in the next release. Best regards, Karli 
From: Dufort, Paul <Paul.D<ufort@uh...>  20140213 19:35:43

Hi, First, I want to say thank you very much for creating this library. It is extremely useful and easy to use, and I have come to rely on it a great deal in my research. Now to the problem: I just upgraded from 1.4.2 to 1.5.1 and found that the blas3bench results have taken a substantial hit. Specifically, the basic dense matrixmatrix multiply consistently gives me 500 Gflops on my Nvidia GTX 680 using version 1.4.1 and 1.4.2, but has now dropped to 340 Gflops with 1.5.1. I've tried fiddling with various things, but to no avail  it is a stable result. Do you have any idea why this might be happening? Thanks very much in advance for any enlightenment you can provide. Regards, Paul Paul Dufort, Ph.D. Computational Imaging Scientist The Joint Department of Medical Imaging Mount Sinai Hospital, University Health Network, Women's College Hospital Room MP 14322 Back Office, 14th Floor Main Pavilion Department of Medical Imaging, Toronto Western Hospital 399 Bathurst Street, Toronto, ON M5T 2S8 Cell: 6472916180 EMail: paul.dufort@... This email may contain confidential and/or privileged information for the sole use of the intended recipient. Any review or distribution by anyone other than the person for whom it was originally intended is strictly prohibited. If you have received this email in error, please contact the sender and delete all copies. Opinions, conclusions or other information contained in this email may not be that of the organization. 
From: Karl Rupp <rupp@iu...>  20140206 09:57:54

Hi Matthew, > I don’t know if this belongs as a possible addition to the roadmap, > something that is currently possible or perhaps it is completely outside > the scope of this project. However, without asking we will never know! :) > I think the easiest way to describe this would be to take a simplified > example of ViennaCL’s iterative solver functionality and comment it up > where appropriate. My comments are in dark red. > > Modified example from > http://viennacl.sourceforge.net/viennaclexamplesiterative.html > > // Set up some ViennaCL objects: > > viennacl::vector<ScalarType> vcl_rhs; > > viennacl::vector<ScalarType> vcl_result; > > viennacl::compressed_matrix<ScalarType> vcl_matrix; > > > /* Initialize and fill all objects here: > > This would initially be done on the CPU via any number of i/o methods > and problem initializations. > > For the sake of this example let us define the following: > > *vcl_matrix*is 3000 by 3000 band matrix of floats with a band of <=40 > wide except for a few outliers. > > *vcl_rhs*is 3000 floats > > *vcl_result*is 3000 floats set to 0.0 at this point */ Here is where all explicit host<>device transfer happens. > // > // Compute ILUT preconditioners for GPU objects: > // > > viennacl::linalg::ilut_tag ilut_conf(10, 1e5); //10 entries, rel. tol. 1e5 > > *typedef*viennacl::linalg::ilut_precond< > > ublas::compressed_matrix<ScalarType> > > ublas_ilut_t; > > viennacl::linalg::ilut_precond< > > viennacl::compressed_matrix<ScalarType> > vcl_ilut_t; > > > //preconditioner for ViennaCL objects: > > vcl_ilut_t vcl_ilut(vcl_matrix, ilut_conf); Here is another (implicit) data transfer: Data is copied from vcl_matrix back to host, the preconditioner is set up there, and data is then prepared such that the preconditioner can be applied accordingly. For ILUT this means that all computations happen on the CPU, because there is not enough parallelism available in general. > > // > // Conjugate gradient solver using ILUT preconditioner > // > > vcl_result = solve(vcl_matrix, //using viennacl objects on GPU > vcl_rhs, > viennacl::linalg::cg_tag(), > vcl_ilut); > > /*Now we have solved for x in Ax=b. I need to do this 3000 more times > with only minor changes to vcl_matrix and a new rhs for every > iteration. The changes to vcl_matrix involve zeroing out a number of > values ( perhaps 5 to 10) running the solver and then restoring those > values to their original state before zeroing out another few values and > repeating the process. It seems very wasteful to copy the entire matrix > 3000 times to memory. Absolutely, you definitely want to avoid unnecessary copies of the full matrix. If you only modify a few entries, you can also consider reusing the ILU preconditioner computed above for all 3000 solves, which will result in a lot of savings. > Each uncompressed copy of the matrix takes up > about 35MB with 3,000 copies obviously exceeding even the largest GPU's > memory. I say that I could copy them all at the same time (in theory) > because each perturbation of vcl_matrix is independent of the others and > thus able to be scheduled in parallel. The rhs will need to be sent > every time, regardless, but it is much smaller. Can all 3000 solves be run in parallel, or is there any dependence among the solves? For example, do you need the result of the second solve to set up the matrix for the third solve? If you can solve several of them in parallel, we could extend the iterative solvers to run with multiple right hand sides. > I believe the solve > function does the copy to the GPU (correct?), Nope, solve() only takes the objects (vcl_matrix, vcl_rhs) and calls the operations on them. > is there a way to modify > the matrix in the GPU memory and let the solver know it does not need to > copy the data over? Better yet, I would like to reference the original > matrix and make the changes on the fly as each rhs becomes available to > be solved by the CPU. Perhaps just sending a list of changes (E.g. > vcl_matrix[2094,300] = 0, etc.) Any help in efficiently doing this would > be greatly appreciated. If it is only 510 entries, you can directly use operator(): vcl_matrix(42, 58) = 3.1415; vcl_matrix(58, 42) = 1.4152; ... which is certainly faster than running a full copy. Keep in mind, though, that this does not scale well if you need to adjust more than ~100 values, because each update involves a PCIExpress communication. Also, there's one more important detail to consider: Systems of size 3000 are usually fairly small for GPUs, so kernel launch overheads become significant. Best regards, Karli 
From: Matthew Musto <matthew.musto@gm...>  20140205 23:50:35

Karl, et al. I don't know if this belongs as a possible addition to the roadmap, something that is currently possible or perhaps it is completely outside the scope of this project. However, without asking we will never know! I think the easiest way to describe this would be to take a simplified example of ViennaCL's iterative solver functionality and comment it up where appropriate. My comments are in dark red. Modified example from http://viennacl.sourceforge.net/viennaclexamplesiterative.html // Set up some ViennaCL objects: viennacl::vector<ScalarType> vcl_rhs; viennacl::vector<ScalarType> vcl_result; viennacl::compressed_matrix<ScalarType> vcl_matrix; /* Initialize and fill all objects here: This would initially be done on the CPU via any number of i/o methods and problem initializations. For the sake of this example let us define the following: *vcl_matrix* is 3000 by 3000 band matrix of floats with a band of <=40 wide except for a few outliers. *vcl_rhs* is 3000 floats *vcl_result* is 3000 floats set to 0.0 at this point */ // // Compute ILUT preconditioners for GPU objects: // viennacl::linalg::ilut_tag ilut_conf(10, 1e5); //10 entries, rel. tol. 1e5 *typedef* viennacl::linalg::ilut_precond< ublas::compressed_matrix<ScalarType> > ublas_ilut_t; viennacl::linalg::ilut_precond< viennacl::compressed_matrix<ScalarType> > vcl_ilut_t; //preconditioner for ViennaCL objects: vcl_ilut_t vcl_ilut(vcl_matrix, ilut_conf); // // Conjugate gradient solver using ILUT preconditioner // vcl_result = solve(vcl_matrix, //using viennacl objects on GPU vcl_rhs, viennacl::linalg::cg_tag(), vcl_ilut); /*Now we have solved for x in Ax=b. I need to do this 3000 more times with only minor changes to vcl_matrix and a new rhs for every iteration. The changes to vcl_matrix involve zeroing out a number of values ( perhaps 5 to 10) running the solver and then restoring those values to their original state before zeroing out another few values and repeating the process. It seems very wasteful to copy the entire matrix 3000 times to memory. Each uncompressed copy of the matrix takes up about 35MB with 3,000 copies obviously exceeding even the largest GPU's memory. I say that I could copy them all at the same time (in theory) because each perturbation of vcl_matrix is independent of the others and thus able to be scheduled in parallel. The rhs will need to be sent every time, regardless, but it is much smaller. I believe the solve function does the copy to the GPU (correct?), is there a way to modify the matrix in the GPU memory and let the solver know it does not need to copy the data over? Better yet, I would like to reference the original matrix and make the changes on the fly as each rhs becomes available to be solved by the CPU. Perhaps just sending a list of changes (E.g. vcl_matrix[2094,300] = 0, etc.) Any help in efficiently doing this would be greatly appreciated. Thank you, Matt 