Screenshot instructions:
Windows
Mac
Red Hat Linux
Ubuntu
Click URL instructions:
Rightclick on ad, choose "Copy Link", then paste here →
(This may not be possible with some types of ads)
You can subscribe to this list here.
2010 
_{Jan}

_{Feb}

_{Mar}

_{Apr}

_{May}
(1) 
_{Jun}
(8) 
_{Jul}
(16) 
_{Aug}
(6) 
_{Sep}

_{Oct}

_{Nov}

_{Dec}
(5) 

2011 
_{Jan}
(4) 
_{Feb}
(3) 
_{Mar}
(5) 
_{Apr}

_{May}
(24) 
_{Jun}

_{Jul}
(5) 
_{Aug}
(17) 
_{Sep}

_{Oct}
(6) 
_{Nov}
(9) 
_{Dec}
(8) 
2012 
_{Jan}
(5) 
_{Feb}
(14) 
_{Mar}
(25) 
_{Apr}
(7) 
_{May}
(15) 
_{Jun}
(12) 
_{Jul}
(22) 
_{Aug}
(4) 
_{Sep}
(10) 
_{Oct}
(10) 
_{Nov}
(19) 
_{Dec}
(17) 
2013 
_{Jan}
(8) 
_{Feb}
(10) 
_{Mar}
(16) 
_{Apr}
(3) 
_{May}
(16) 
_{Jun}
(26) 
_{Jul}

_{Aug}
(9) 
_{Sep}

_{Oct}
(8) 
_{Nov}
(17) 
_{Dec}
(2) 
2014 
_{Jan}
(37) 
_{Feb}
(15) 
_{Mar}
(6) 
_{Apr}
(9) 
_{May}
(11) 
_{Jun}
(11) 
_{Jul}
(9) 
_{Aug}
(9) 
_{Sep}
(19) 
_{Oct}
(4) 
_{Nov}
(22) 
_{Dec}
(21) 
2015 
_{Jan}

_{Feb}
(7) 
_{Mar}
(2) 
_{Apr}
(17) 
_{May}
(22) 
_{Jun}
(11) 
_{Jul}
(11) 
_{Aug}
(6) 
_{Sep}
(7) 
_{Oct}

_{Nov}
(5) 
_{Dec}

2016 
_{Jan}
(1) 
_{Feb}
(3) 
_{Mar}
(4) 
_{Apr}
(8) 
_{May}
(8) 
_{Jun}
(11) 
_{Jul}
(2) 
_{Aug}

_{Sep}

_{Oct}

_{Nov}

_{Dec}
(6) 
2017 
_{Jan}

_{Feb}
(1) 
_{Mar}
(2) 
_{Apr}
(19) 
_{May}

_{Jun}
(7) 
_{Jul}
(7) 
_{Aug}
(2) 
_{Sep}
(6) 
_{Oct}

_{Nov}
(3) 
_{Dec}

S  M  T  W  T  F  S 





1

2

3

4
(2) 
5

6

7

8

9
(1) 
10
(4) 
11

12

13

14

15
(1) 
16
(9) 
17
(1) 
18

19
(2) 
20

21
(4) 
22
(1) 
23

24

25

26

27

28

29

30

31

From: Karl Rupp <rupp@iu...>  20120322 22:06:47

Dear ViennaCL users, ViennaCL 1.2.1 is now available for download! Highlights of the new release are the following features:  Fixed double precision problems on some (older) AMD GPUs.  Considerable improvements in the handling of matrix_range  Improved performance of matrixmatrix multiplication  Extended QR factorization, improved performance  Direct element access to compressed_matrix now possible  Fixed incorrect sizes when transferring data from nonsquare Eigen or MTL matrices Best regards, Karl Rupp 
From: Karl Rupp <rupp@iu...>  20120321 16:21:06

Hi Evan, if you're talking about ILU0, I have a CPUbased implementation lying around somewhere. I'll dig it out and let you know. Best regards, Karli On 03/21/2012 02:58 AM, Evan Bollig wrote: > Hey Karl, I have a need for ILU with zero fillin. Do you have that > available in one of your dev branches? Ill continue another direction > in the meantime. > > Thanks, > E > 
From: Karl Rupp <rupp@iu...>  20120321 09:33:13

Hi Krzysztof, thanks for the suggestion, we've discussed that internally a few weeks ago and consider it for 1.3.0. The 'clmadenable' might give a good boost on performance for BLAS level 3 operations, while BLAS level 1 and 2 won't show much of a difference. Thanks again and best regards, Karli On 03/21/2012 10:21 AM, Krzysztof Bzowski wrote: > It will be nice to have an overloaded function > viennacl::ocl::program& add_program > which receives user's kernel build options (for example: clmadenable) > > I am referring to the fourth argument of the function clBuildProgram. > http://www.khronos.org/registry/cl/sdk/1.0/docs/man/xhtml/clBuildProgram.html > > Thanks in advance > 
From: Krzysztof Bzowski <kbzowski@ag...>  20120321 09:21:24

It will be nice to have an overloaded function viennacl::ocl::program & add_program which receives user's kernel build options (for example: clmadenable) I am referring to the fourth argument of the function clBuildProgram. http://www.khronos.org/registry/cl/sdk/1.0/docs/man/xhtml/clBuildProgram.html Thanks in advance  Krzysztof Bzowski 
From: Evan Bollig <bollig@gm...>  20120321 01:58:33

Hey Karl, I have a need for ILU with zero fillin. Do you have that available in one of your dev branches? Ill continue another direction in the meantime. Thanks, E  Evan Bollig bollig@... bollig@... 
From: Karl Rupp <rupp@iu...>  20120319 16:43:01

Hi Evan, I'm not sure whether I've understood you correctly. Basically, size checks are performed inside the functions sub(), inplace_sub(), etc. Since these functions are called from the operator overloads, assertions are there as well. Do you by chance have NDEBUG defined, which eliminates all this assert() statements? Best regards, Karli On 03/19/2012 01:48 AM, Evan Bollig wrote: > Karl, was playing with some code and realized I can subtract two VCL > vectors of different size. I remembered the assertion you have for the > sub command, but its not part of the operator. Wondering if this is > intentional or not? > > assert( (viennacl::traits::size(vec1) == viennacl::traits::size(vec2)) > && (viennacl::traits::size(vec1) == > viennacl::traits::size(result)) > && "Incompatible vector sizes in sub()!"); > > Cheers, > > E > 
From: Evan Bollig <bollig@gm...>  20120319 00:48:49

Karl, was playing with some code and realized I can subtract two VCL vectors of different size. I remembered the assertion you have for the sub command, but its not part of the operator. Wondering if this is intentional or not? assert( (viennacl::traits::size(vec1) == viennacl::traits::size(vec2)) && (viennacl::traits::size(vec1) == viennacl::traits::size(result)) && "Incompatible vector sizes in sub()!"); Cheers, E  Evan Bollig bollig@... bollig@... 
From: Karl Rupp <rupp@iu...>  20120317 19:47:34

Hi Evan, > This leads me to believe that UBlas is already sorting the elements at > append time. One the default storage ordering is row_major too, which > gives the arrangement you need. Then all you should need to do is > access the > > coordinate_matrix::index1_data() > coordinate_matrix::index1_data() > coordinate_matrix::value_data() This might indeed work out. I'll have a look at it and try to squeeze some more performance out of it. Thanks and best regards, Karli 
From: Evan Bollig <bollig@gm...>  20120316 22:00:56

Ill try it out soon Karl. In the meantime, a few more things to consider. The UBlas documentation says: "void append_element (size_type i, size_type j, const_reference t)  Appends the value t at the jth element of the ith row. Duplicate elements can be appended to a coordinate_matrix. They are merged into a single arithmetically summed element by the sort function." (http://www.boost.org/doc/libs/1_49_0/libs/numeric/ublas/doc/matrix_sparse.htm) This leads me to believe that UBlas is already sorting the elements at append time. One the default storage ordering is row_major too, which gives the arrangement you need. Then all you should need to do is access the coordinate_matrix::index1_data() coordinate_matrix::index1_data() coordinate_matrix::value_data() (http://www.boost.org/doc/libs/1_49_0/libs/numeric/ublas/doc/html/classboost_1_1numeric_1_1ublas_1_1coordinate__matrix.html) E On Fri, Mar 16, 2012 at 7:53 AM, Karl Rupp <rupp@...> wrote: > Hi Evan, > > It seems to me that append_element() in the UBLAS COOmatrix is indeed just > performing push_back() to the end of a list. That would explain the fast > assembly times. However, on the GPU we require the data to be ordered with > respect to row indices in order to get any reasonable performance. > Consequently, an iteration over the COO matrix in the form > for (const_iterator1 row_it = cpu_matrix.begin1(); > row_it != cpu_matrix.end1(); > ++row_it) > { > for (const_iterator2 col_it = row_it.begin(); > col_it != row_it.end(); > ++col_it) > { ... } > } > is carried out. I think that the COOmatrix is just slow on this type of > operation, because it needs to resolve duplicates appended to the list of > entries. Thus, the time saved in assembly is then lost in > matrixmultiplications and/or copytoGPU. > > Could you please run a benchmark for one such iteration over all entries of > a COOmatrix? You should get an execution time of about 200 (ms?) for that. > > Best regards, > Karli > > > > On 03/16/2012 12:30 PM, Evan Bollig wrote: >> >> Ah, DNDEBUG. I totally overlooked that detail. I'm accustomed to >> adding flags to enable debugging, not disable it. :) >> >> Here is the update. The COO to COO/CSR copy is still expensive >> compared to CSR to COO/CSR, but the gap definitely closed between the >> CPU and GPU multiply. >> >> =================== >> 27556 UBLAS_COO_CPU Assemble  avg: 15.7583  tot: >> 47.2750  count= 3 >> 27556 UBLAS_COO_CPU Multiply test  avg: 99.4470  tot: >> 99.4470  count= 1 >> 27556 UBLAS_COO_CPU Copy To VCL_COO_GPU  avg: 244.5460  tot: >> 244.5460  count= 1 >> 27556 VCL_COO_GPU Multiply test  avg: 8.6465  tot: >> 17.2930  count= 2 >> 27556 UBLAS_CSR_CPU Assemble  avg: 89.5137  tot: >> 268.5410  count= 3 >> 27556 UBLAS_CSR_CPU Multiply test  avg: 5.8380  tot: >> 5.8380  count= 1 >> 27556 UBLAS_CSR_CPU Copy To VCL_CSR_GPU  avg: 21.6990  tot: >> 21.6990  count= 1 >> 27556 VCL_CSR_GPU Multiply test  avg: 8.0110  tot: >> 16.0220  count= 2 >> 27556 UBLAS_COO_CPU Copy To VCL_CSR_GPU  avg: 234.3600  tot: >> 234.3600  count= 1 >> 27556 UBLAS_CSR_CPU Copy To VCL_COO_GPU  avg: 26.9620  tot: >> 26.9620  count= 1 >> =================== >> >> Evan >> >> >> On Fri, Mar 16, 2012 at 5:07 AM, Karl Rupp<rupp@...> wrote: >>> >>> Hi Evan, >>> >>> >>>> First, I tested using the std::map for the CPU representation. Its not >>>> bad. The cost of transfer to GPU is a little high compared to the cost >>>> of a CPU side SpMV, but nothing of major concern. >>>> ============= STL Map =============== >>>> 27556 STL_Sparse_Mat Assemble  avg: 133.1483  tot: >>>> 798.8900  count= 6 >>>> 27556 STL_Sparse_Mat Multiply  avg: 34.1520  tot: >>>> 68.3040  count= 2 >>>> 27556 STL_Sparse_Mat Copy To VCL_COO_GPU  avg: 106.2620  tot: >>>> 212.5240  count= 2 >>>> 27556 VCL_COO_GPU Multiply  avg: 8.6280  tot: >>>> 17.2560  count= 2 >>>> 27556 STL_Sparse_Mat Copy To VCL_CSR_GPU  avg: 100.4015  tot: >>>> 200.8030  count= 2 >>>> 27556 VCL_CSR_GPU Multiply  avg: 7.9940  tot: >>>> 15.9880  count= 2 >>>> ==================================== >>> >>> >>> >>> Alright, these numbers are fine, similar to what I've observed here. >>> >>> >>> >>> >>>> >>>> Then I tested the UBLAS Compressed and Coordinate matrices to see what >>>> impact they would have. >>>> ============= UBLAS Matrix =============== >>>> 27556 UBLAS_COO_CPU Assemble  avg: 26.7830  tot: >>>> 80.3490  count= 3 >>>> 27556 UBLAS_COO_CPU Multiply test  avg: 41728.3359  tot: >>>> 41728.3359  count= 1 >>>> 27556 UBLAS_COO_CPU Copy To VCL_COO_GPU  avg: 879.2710  tot: >>>> 879.2710  count= 1 >>>> 27556 VCL_COO_GPU Multiply test  avg: 8.4865  tot: >>>> 16.9730  count= 2 >>>> 27556 UBLAS_CSR_CPU Assemble  avg: 107.1820  tot: >>>> 321.5460  count= 3 >>>> 27556 UBLAS_CSR_CPU Multiply test  avg: 4933.7212  tot: >>>> 4933.7212  count= 1 >>>> 27556 UBLAS_CSR_CPU Copy To VCL_CSR_GPU  avg: 83.7720  tot: >>>> 83.7720  count= 1 >>>> 27556 VCL_CSR_GPU Multiply test  avg: 7.8510  tot: >>>> 15.7020  count= 2 >>>> 27556 UBLAS_COO_CPU Copy To VCL_CSR_GPU  avg: 877.8260  tot: >>>> 877.8260  count= 1 >>>> 27556 UBLAS_CSR_CPU Copy To VCL_COO_GPU  avg: 83.4390  tot: >>>> 83.4390  count= 1 >>>> ======================================= >>> >>> >>> >>> Did you compile with the NDEBUG preprocessor constant defined? If this is >>> missing, you get really bad performance with UBLAS similar to what is >>> shown >>> in the table. It mostly affects the matrixvector multiplication, but >>> there >>> is also quite some overhead for the copy. >>> >>> >>>> You'll notice a nice speedup in assembly from UBlas versus the >>>> std::map. Clearly, UBLAS is inefficient for a CPU multiply. But we're >>>> targeting the GPU so its not that troubling. >>> >>> >>> >>> It seems like there is some CPU caching effect involved. For larger >>> matrices >>> (say, above 100k unknowns) the STLtypes are usually significantly faster >>> than the CSR. I'm not entirely sure for the COOformat, since I don't >>> know >>> all details of the internal storage scheme in UBLAS. The matrixvector >>> multiplication is definitely suffering from NDEBUG not being defined, it >>> is >>> usually comparable to the STL case. >>> >>> >>>> >>>> My biggest concern above is the difference in cost to copy from UBlas >>>> to ViennaCL. When I start with a COO and copy to either COO or CSR it >>>> is 10x slower than starting with a CSR matrix. Correct me if I'm >>>> wrong, but with the COO format both the CPU and GPU need to store >>>> three vectors representing the row/column index and values. If that's >>>> the case then the vectors should be directly copied from CPU to GPU. >>>> Perhaps you do some interleaving of the vectors? >>> >>> >>> >>> There is a spurious temporary array involved in both the CSR and the COO >>> case, mostly because I was unable to get direct access to the internal >>> arrays of the UBLAS matrices. OpenCL requires the data to be in a single >>> piece of memory, and this may not be the case with UBLAS matrices. >>> Moreover, >>> access to entries without NDEBUG defined adds another serious overhead. >>> Could you please rerun your benchmarks with NDEBUG defined? >>> >>> Thanks and best regards, >>> Karli >> >> >> >> >  Evan Bollig bollig@... bollig@... 
From: josephwinston <josephwinston@ma...>  20120316 20:32:43

Thanks. I'll do exactly that. On Mar 16, 2012, at 03:12 PM, Karl Rupp <rupp@...> wrote: Hello, there is no 'direct' feature supporting this, because one cannot query the available OpenCL memory (API limitation). However, an exception is thrown if one of the memory buffers involved cannot be allocated. You can catch such an exception (defined in viennacl/ocl/error.hpp) and use a fallback mechanism in such case. Best regards, Karli On 03/16/2012 07:59 PM, josephwinston wrote: > With other packages, for example Paradiso, there are thresholds that can > be set that determine if the problem is solved "in core" or out. Assume > for the moment that I am using the ::viennacl::linalg::cg_tag in a > ::viennacl::linalg::solve. How do I determine before making this call if > the GPU or even the CPU has enough memory to solve the system given a > ::viennacl::compressed_matrix< float > and ::viennacl::vector< float >? > > Thanks > > > >  > This SF email is sponsosred by: > Try Windows Azure free for 90 days Click Here > http://p.sf.net/sfu/sfd2dmsazure > > > > _______________________________________________ > ViennaCLsupport mailing list > ViennaCLsupport@... > https://lists.sourceforge.net/lists/listinfo/viennaclsupport 
From: Karl Rupp <rupp@iu...>  20120316 20:12:27

Hello, there is no 'direct' feature supporting this, because one cannot query the available OpenCL memory (API limitation). However, an exception is thrown if one of the memory buffers involved cannot be allocated. You can catch such an exception (defined in viennacl/ocl/error.hpp) and use a fallback mechanism in such case. Best regards, Karli On 03/16/2012 07:59 PM, josephwinston wrote: > With other packages, for example Paradiso, there are thresholds that can > be set that determine if the problem is solved "in core" or out. Assume > for the moment that I am using the ::viennacl::linalg::cg_tag in a > ::viennacl::linalg::solve. How do I determine before making this call if > the GPU or even the CPU has enough memory to solve the system given a > ::viennacl::compressed_matrix< float > and ::viennacl::vector< float >? > > Thanks > > > >  > This SF email is sponsosred by: > Try Windows Azure free for 90 days Click Here > http://p.sf.net/sfu/sfd2dmsazure > > > > _______________________________________________ > ViennaCLsupport mailing list > ViennaCLsupport@... > https://lists.sourceforge.net/lists/listinfo/viennaclsupport 
From: josephwinston <josephwinston@ma...>  20120316 18:59:42

With other packages, for example Paradiso, there are thresholds that can be set that determine if the problem is solved "in core" or out. Assume for the moment that I am using the ::viennacl::linalg::cg_tag in a ::viennacl::linalg::solve. How do I determine before making this call if the GPU or even the CPU has enough memory to solve the system given a ::viennacl::compressed_matrix< float > and ::viennacl::vector< float >? Thanks 
From: Karl Rupp <rupp@iu...>  20120316 11:53:54

Hi Evan, It seems to me that append_element() in the UBLAS COOmatrix is indeed just performing push_back() to the end of a list. That would explain the fast assembly times. However, on the GPU we require the data to be ordered with respect to row indices in order to get any reasonable performance. Consequently, an iteration over the COO matrix in the form for (const_iterator1 row_it = cpu_matrix.begin1(); row_it != cpu_matrix.end1(); ++row_it) { for (const_iterator2 col_it = row_it.begin(); col_it != row_it.end(); ++col_it) { ... } } is carried out. I think that the COOmatrix is just slow on this type of operation, because it needs to resolve duplicates appended to the list of entries. Thus, the time saved in assembly is then lost in matrixmultiplications and/or copytoGPU. Could you please run a benchmark for one such iteration over all entries of a COOmatrix? You should get an execution time of about 200 (ms?) for that. Best regards, Karli On 03/16/2012 12:30 PM, Evan Bollig wrote: > Ah, DNDEBUG. I totally overlooked that detail. I'm accustomed to > adding flags to enable debugging, not disable it. :) > > Here is the update. The COO to COO/CSR copy is still expensive > compared to CSR to COO/CSR, but the gap definitely closed between the > CPU and GPU multiply. > > =================== > 27556 UBLAS_COO_CPU Assemble  avg: 15.7583  tot: > 47.2750  count= 3 > 27556 UBLAS_COO_CPU Multiply test  avg: 99.4470  tot: > 99.4470  count= 1 > 27556 UBLAS_COO_CPU Copy To VCL_COO_GPU  avg: 244.5460  tot: > 244.5460  count= 1 > 27556 VCL_COO_GPU Multiply test  avg: 8.6465  tot: > 17.2930  count= 2 > 27556 UBLAS_CSR_CPU Assemble  avg: 89.5137  tot: > 268.5410  count= 3 > 27556 UBLAS_CSR_CPU Multiply test  avg: 5.8380  tot: > 5.8380  count= 1 > 27556 UBLAS_CSR_CPU Copy To VCL_CSR_GPU  avg: 21.6990  tot: > 21.6990  count= 1 > 27556 VCL_CSR_GPU Multiply test  avg: 8.0110  tot: > 16.0220  count= 2 > 27556 UBLAS_COO_CPU Copy To VCL_CSR_GPU  avg: 234.3600  tot: > 234.3600  count= 1 > 27556 UBLAS_CSR_CPU Copy To VCL_COO_GPU  avg: 26.9620  tot: > 26.9620  count= 1 > =================== > > Evan > > > On Fri, Mar 16, 2012 at 5:07 AM, Karl Rupp<rupp@...> wrote: >> Hi Evan, >> >> >>> First, I tested using the std::map for the CPU representation. Its not >>> bad. The cost of transfer to GPU is a little high compared to the cost >>> of a CPU side SpMV, but nothing of major concern. >>> ============= STL Map =============== >>> 27556 STL_Sparse_Mat Assemble  avg: 133.1483  tot: >>> 798.8900  count= 6 >>> 27556 STL_Sparse_Mat Multiply  avg: 34.1520  tot: >>> 68.3040  count= 2 >>> 27556 STL_Sparse_Mat Copy To VCL_COO_GPU  avg: 106.2620  tot: >>> 212.5240  count= 2 >>> 27556 VCL_COO_GPU Multiply  avg: 8.6280  tot: >>> 17.2560  count= 2 >>> 27556 STL_Sparse_Mat Copy To VCL_CSR_GPU  avg: 100.4015  tot: >>> 200.8030  count= 2 >>> 27556 VCL_CSR_GPU Multiply  avg: 7.9940  tot: >>> 15.9880  count= 2 >>> ==================================== >> >> >> Alright, these numbers are fine, similar to what I've observed here. >> >> >> >> >>> >>> Then I tested the UBLAS Compressed and Coordinate matrices to see what >>> impact they would have. >>> ============= UBLAS Matrix =============== >>> 27556 UBLAS_COO_CPU Assemble  avg: 26.7830  tot: >>> 80.3490  count= 3 >>> 27556 UBLAS_COO_CPU Multiply test  avg: 41728.3359  tot: >>> 41728.3359  count= 1 >>> 27556 UBLAS_COO_CPU Copy To VCL_COO_GPU  avg: 879.2710  tot: >>> 879.2710  count= 1 >>> 27556 VCL_COO_GPU Multiply test  avg: 8.4865  tot: >>> 16.9730  count= 2 >>> 27556 UBLAS_CSR_CPU Assemble  avg: 107.1820  tot: >>> 321.5460  count= 3 >>> 27556 UBLAS_CSR_CPU Multiply test  avg: 4933.7212  tot: >>> 4933.7212  count= 1 >>> 27556 UBLAS_CSR_CPU Copy To VCL_CSR_GPU  avg: 83.7720  tot: >>> 83.7720  count= 1 >>> 27556 VCL_CSR_GPU Multiply test  avg: 7.8510  tot: >>> 15.7020  count= 2 >>> 27556 UBLAS_COO_CPU Copy To VCL_CSR_GPU  avg: 877.8260  tot: >>> 877.8260  count= 1 >>> 27556 UBLAS_CSR_CPU Copy To VCL_COO_GPU  avg: 83.4390  tot: >>> 83.4390  count= 1 >>> ======================================= >> >> >> Did you compile with the NDEBUG preprocessor constant defined? If this is >> missing, you get really bad performance with UBLAS similar to what is shown >> in the table. It mostly affects the matrixvector multiplication, but there >> is also quite some overhead for the copy. >> >> >>> You'll notice a nice speedup in assembly from UBlas versus the >>> std::map. Clearly, UBLAS is inefficient for a CPU multiply. But we're >>> targeting the GPU so its not that troubling. >> >> >> It seems like there is some CPU caching effect involved. For larger matrices >> (say, above 100k unknowns) the STLtypes are usually significantly faster >> than the CSR. I'm not entirely sure for the COOformat, since I don't know >> all details of the internal storage scheme in UBLAS. The matrixvector >> multiplication is definitely suffering from NDEBUG not being defined, it is >> usually comparable to the STL case. >> >> >>> >>> My biggest concern above is the difference in cost to copy from UBlas >>> to ViennaCL. When I start with a COO and copy to either COO or CSR it >>> is 10x slower than starting with a CSR matrix. Correct me if I'm >>> wrong, but with the COO format both the CPU and GPU need to store >>> three vectors representing the row/column index and values. If that's >>> the case then the vectors should be directly copied from CPU to GPU. >>> Perhaps you do some interleaving of the vectors? >> >> >> There is a spurious temporary array involved in both the CSR and the COO >> case, mostly because I was unable to get direct access to the internal >> arrays of the UBLAS matrices. OpenCL requires the data to be in a single >> piece of memory, and this may not be the case with UBLAS matrices. Moreover, >> access to entries without NDEBUG defined adds another serious overhead. >> Could you please rerun your benchmarks with NDEBUG defined? >> >> Thanks and best regards, >> Karli > > > 
From: Evan Bollig <bollig@gm...>  20120316 11:31:02

Ah, DNDEBUG. I totally overlooked that detail. I'm accustomed to adding flags to enable debugging, not disable it. :) Here is the update. The COO to COO/CSR copy is still expensive compared to CSR to COO/CSR, but the gap definitely closed between the CPU and GPU multiply. =================== 27556 UBLAS_COO_CPU Assemble  avg: 15.7583  tot: 47.2750  count= 3 27556 UBLAS_COO_CPU Multiply test  avg: 99.4470  tot: 99.4470  count= 1 27556 UBLAS_COO_CPU Copy To VCL_COO_GPU  avg: 244.5460  tot: 244.5460  count= 1 27556 VCL_COO_GPU Multiply test  avg: 8.6465  tot: 17.2930  count= 2 27556 UBLAS_CSR_CPU Assemble  avg: 89.5137  tot: 268.5410  count= 3 27556 UBLAS_CSR_CPU Multiply test  avg: 5.8380  tot: 5.8380  count= 1 27556 UBLAS_CSR_CPU Copy To VCL_CSR_GPU  avg: 21.6990  tot: 21.6990  count= 1 27556 VCL_CSR_GPU Multiply test  avg: 8.0110  tot: 16.0220  count= 2 27556 UBLAS_COO_CPU Copy To VCL_CSR_GPU  avg: 234.3600  tot: 234.3600  count= 1 27556 UBLAS_CSR_CPU Copy To VCL_COO_GPU  avg: 26.9620  tot: 26.9620  count= 1 =================== Evan On Fri, Mar 16, 2012 at 5:07 AM, Karl Rupp <rupp@...> wrote: > Hi Evan, > > >> First, I tested using the std::map for the CPU representation. Its not >> bad. The cost of transfer to GPU is a little high compared to the cost >> of a CPU side SpMV, but nothing of major concern. >> ============= STL Map =============== >> 27556 STL_Sparse_Mat Assemble  avg: 133.1483  tot: >> 798.8900  count= 6 >> 27556 STL_Sparse_Mat Multiply  avg: 34.1520  tot: >> 68.3040  count= 2 >> 27556 STL_Sparse_Mat Copy To VCL_COO_GPU  avg: 106.2620  tot: >> 212.5240  count= 2 >> 27556 VCL_COO_GPU Multiply  avg: 8.6280  tot: >> 17.2560  count= 2 >> 27556 STL_Sparse_Mat Copy To VCL_CSR_GPU  avg: 100.4015  tot: >> 200.8030  count= 2 >> 27556 VCL_CSR_GPU Multiply  avg: 7.9940  tot: >> 15.9880  count= 2 >> ==================================== > > > Alright, these numbers are fine, similar to what I've observed here. > > > > >> >> Then I tested the UBLAS Compressed and Coordinate matrices to see what >> impact they would have. >> ============= UBLAS Matrix =============== >> 27556 UBLAS_COO_CPU Assemble  avg: 26.7830  tot: >> 80.3490  count= 3 >> 27556 UBLAS_COO_CPU Multiply test  avg: 41728.3359  tot: >> 41728.3359  count= 1 >> 27556 UBLAS_COO_CPU Copy To VCL_COO_GPU  avg: 879.2710  tot: >> 879.2710  count= 1 >> 27556 VCL_COO_GPU Multiply test  avg: 8.4865  tot: >> 16.9730  count= 2 >> 27556 UBLAS_CSR_CPU Assemble  avg: 107.1820  tot: >> 321.5460  count= 3 >> 27556 UBLAS_CSR_CPU Multiply test  avg: 4933.7212  tot: >> 4933.7212  count= 1 >> 27556 UBLAS_CSR_CPU Copy To VCL_CSR_GPU  avg: 83.7720  tot: >> 83.7720  count= 1 >> 27556 VCL_CSR_GPU Multiply test  avg: 7.8510  tot: >> 15.7020  count= 2 >> 27556 UBLAS_COO_CPU Copy To VCL_CSR_GPU  avg: 877.8260  tot: >> 877.8260  count= 1 >> 27556 UBLAS_CSR_CPU Copy To VCL_COO_GPU  avg: 83.4390  tot: >> 83.4390  count= 1 >> ======================================= > > > Did you compile with the NDEBUG preprocessor constant defined? If this is > missing, you get really bad performance with UBLAS similar to what is shown > in the table. It mostly affects the matrixvector multiplication, but there > is also quite some overhead for the copy. > > >> You'll notice a nice speedup in assembly from UBlas versus the >> std::map. Clearly, UBLAS is inefficient for a CPU multiply. But we're >> targeting the GPU so its not that troubling. > > > It seems like there is some CPU caching effect involved. For larger matrices > (say, above 100k unknowns) the STLtypes are usually significantly faster > than the CSR. I'm not entirely sure for the COOformat, since I don't know > all details of the internal storage scheme in UBLAS. The matrixvector > multiplication is definitely suffering from NDEBUG not being defined, it is > usually comparable to the STL case. > > >> >> My biggest concern above is the difference in cost to copy from UBlas >> to ViennaCL. When I start with a COO and copy to either COO or CSR it >> is 10x slower than starting with a CSR matrix. Correct me if I'm >> wrong, but with the COO format both the CPU and GPU need to store >> three vectors representing the row/column index and values. If that's >> the case then the vectors should be directly copied from CPU to GPU. >> Perhaps you do some interleaving of the vectors? > > > There is a spurious temporary array involved in both the CSR and the COO > case, mostly because I was unable to get direct access to the internal > arrays of the UBLAS matrices. OpenCL requires the data to be in a single > piece of memory, and this may not be the case with UBLAS matrices. Moreover, > access to entries without NDEBUG defined adds another serious overhead. > Could you please rerun your benchmarks with NDEBUG defined? > > Thanks and best regards, > Karli  Evan Bollig bollig@... bollig@... 
From: Karl Rupp <rupp@iu...>  20120316 09:08:09

Hi Evan, > First, I tested using the std::map for the CPU representation. Its not > bad. The cost of transfer to GPU is a little high compared to the cost > of a CPU side SpMV, but nothing of major concern. > ============= STL Map =============== > 27556 STL_Sparse_Mat Assemble  avg: 133.1483  tot: > 798.8900  count= 6 > 27556 STL_Sparse_Mat Multiply  avg: 34.1520  tot: > 68.3040  count= 2 > 27556 STL_Sparse_Mat Copy To VCL_COO_GPU  avg: 106.2620  tot: > 212.5240  count= 2 > 27556 VCL_COO_GPU Multiply  avg: 8.6280  tot: > 17.2560  count= 2 > 27556 STL_Sparse_Mat Copy To VCL_CSR_GPU  avg: 100.4015  tot: > 200.8030  count= 2 > 27556 VCL_CSR_GPU Multiply  avg: 7.9940  tot: > 15.9880  count= 2 > ==================================== Alright, these numbers are fine, similar to what I've observed here. > > Then I tested the UBLAS Compressed and Coordinate matrices to see what > impact they would have. > ============= UBLAS Matrix =============== > 27556 UBLAS_COO_CPU Assemble  avg: 26.7830  tot: > 80.3490  count= 3 > 27556 UBLAS_COO_CPU Multiply test  avg: 41728.3359  tot: > 41728.3359  count= 1 > 27556 UBLAS_COO_CPU Copy To VCL_COO_GPU  avg: 879.2710  tot: > 879.2710  count= 1 > 27556 VCL_COO_GPU Multiply test  avg: 8.4865  tot: > 16.9730  count= 2 > 27556 UBLAS_CSR_CPU Assemble  avg: 107.1820  tot: > 321.5460  count= 3 > 27556 UBLAS_CSR_CPU Multiply test  avg: 4933.7212  tot: > 4933.7212  count= 1 > 27556 UBLAS_CSR_CPU Copy To VCL_CSR_GPU  avg: 83.7720  tot: > 83.7720  count= 1 > 27556 VCL_CSR_GPU Multiply test  avg: 7.8510  tot: > 15.7020  count= 2 > 27556 UBLAS_COO_CPU Copy To VCL_CSR_GPU  avg: 877.8260  tot: > 877.8260  count= 1 > 27556 UBLAS_CSR_CPU Copy To VCL_COO_GPU  avg: 83.4390  tot: > 83.4390  count= 1 > ======================================= Did you compile with the NDEBUG preprocessor constant defined? If this is missing, you get really bad performance with UBLAS similar to what is shown in the table. It mostly affects the matrixvector multiplication, but there is also quite some overhead for the copy. > You'll notice a nice speedup in assembly from UBlas versus the > std::map. Clearly, UBLAS is inefficient for a CPU multiply. But we're > targeting the GPU so its not that troubling. It seems like there is some CPU caching effect involved. For larger matrices (say, above 100k unknowns) the STLtypes are usually significantly faster than the CSR. I'm not entirely sure for the COOformat, since I don't know all details of the internal storage scheme in UBLAS. The matrixvector multiplication is definitely suffering from NDEBUG not being defined, it is usually comparable to the STL case. > > My biggest concern above is the difference in cost to copy from UBlas > to ViennaCL. When I start with a COO and copy to either COO or CSR it > is 10x slower than starting with a CSR matrix. Correct me if I'm > wrong, but with the COO format both the CPU and GPU need to store > three vectors representing the row/column index and values. If that's > the case then the vectors should be directly copied from CPU to GPU. > Perhaps you do some interleaving of the vectors? There is a spurious temporary array involved in both the CSR and the COO case, mostly because I was unable to get direct access to the internal arrays of the UBLAS matrices. OpenCL requires the data to be in a single piece of memory, and this may not be the case with UBLAS matrices. Moreover, access to entries without NDEBUG defined adds another serious overhead. Could you please rerun your benchmarks with NDEBUG defined? Thanks and best regards, Karli 
From: Karl Rupp <rupp@iu...>  20120316 08:49:25

Hi Evan, thanks for reporting this. Seems like one of the operator overloads is missing here, it will be fixed in version 1.2.1 (released in the next few days). Best regards, Karli On 03/16/2012 12:21 AM, Evan Bollig wrote: > Karl, more issues with COO. > > I call GMRES like this for a CSR matrix (this works): > ================== > viennacl::linalg::gmres_tag tag(1e8, 100); > VCL_CSR_Mat A_csr(U_exact.size(), U_exact.size()); > U_approx = viennacl::linalg::solve(A_csr, U_exact, tag); > ========= > > When I change to COO it fails: > ======================= > viennacl::linalg::gmres_tag tag(1e8, 100); > VCL_COO_Mat A_coo(U_exact.size(), U_exact.size()); > U_approx = viennacl::linalg::solve(A_coo, U_exact, tag); > > Error Message Below: > ================ > In file included from > /home/code/external/opencl_libraries/viennacl1.2.0/viennacl/coordinate_matrix.hpp:32, > from /home/code/tests/util/viennacl_gmres_poisson/main.cpp:9: > /home/code/external/opencl_libraries/viennacl1.2.0/viennacl/linalg/coordinate_matrix_operations.hpp: > In member function ‘viennacl::vector<SCALARTYPE, ALIGNMENT>& > viennacl::vector<SCALARTYPE, ALIGNMENT>::operator=(const > viennacl::vector_expression<const > viennacl::coordinate_matrix<SCALARTYPE, MAT_ALIGNMENT>, const > viennacl::vector<SCALARTYPE, ALIGNMENT>, viennacl::op_prod>&) [with > unsigned int MAT_ALIGNMENT = 128u, SCALARTYPE = double, unsigned int > ALIGNMENT = 1u]’: > /home/code/external/opencl_libraries/viennacl1.2.0/viennacl/linalg/gmres.hpp:165: > instantiated from ‘VectorType viennacl::linalg::solve(const > MatrixType&, const VectorType&, const viennacl::linalg::gmres_tag&, > const PreconditionerType&) [with MatrixType = > viennacl::coordinate_matrix<double, 128u>, VectorType = > viennacl::vector<double, 1u>, PreconditionerType = > viennacl::linalg::no_precond]’ > /home/code/external/opencl_libraries/viennacl1.2.0/viennacl/linalg/gmres.hpp:357: > instantiated from ‘VectorType viennacl::linalg::solve(const > MatrixType&, const VectorType&, const viennacl::linalg::gmres_tag&) > [with MatrixType = VCL_COO_Mat, VectorType = viennacl::vector<double, > 1u>]’ > /home/code/tests/util/viennacl_gmres_poisson/main.cpp:121: > instantiated from ‘void benchmark_GMRES_Device(MatT&, VecT&, VecT&) > [with MatT = viennacl::coordinate_matrix<double, 128u>, VecT = > viennacl::vector<double, 1u>]’ > /home/code/tests/util/viennacl_gmres_poisson/main.cpp:216: > instantiated from ‘void run_SpMV(RBFFD&, Grid&) [with MatType = > std::vector<std::map<unsigned int, double, std::less<unsigned int>, > std::allocator<std::pair<const unsigned int, double> > >, > std::allocator<std::map<unsigned int, double, std::less<unsigned int>, > std::allocator<std::pair<const unsigned int, double> > > > >, > OpMatType = viennacl::coordinate_matrix<double, 128u>, MatrixType > assemble_t_e = (MatrixType)4, MatrixType operate_t_e = (MatrixType)4]’ > /home/code/tests/util/viennacl_gmres_poisson/main.cpp:299: > instantiated from ‘void run_test(RBFFD&, Grid&) [with MatrixType > assemble_t_e = (MatrixType)4, MatrixType operate_t_e = (MatrixType)4]’ > /home/code/tests/util/viennacl_gmres_poisson/main.cpp:386: > instantiated from here > /home/code/external/opencl_libraries/viennacl1.2.0/viennacl/linalg/coordinate_matrix_operations.hpp:175: > error: ‘const class viennacl::vector_expression<const > viennacl::coordinate_matrix<double, 128u>, const > viennacl::vector<double, 1u>, viennacl::op_prod>’ has no member named > ‘get_lhs’ > make[2]: *** [tests/util/viennacl_gmres_poisson/CMakeFiles/viennacl_gmres_poisson.x.dir/main.cpp.o] > Error 1 > make[1]: *** [tests/util/viennacl_gmres_poisson/CMakeFiles/viennacl_gmres_poisson.x.dir/all] > Error 2 > ===================== > > 
From: Evan Bollig <bollig@gm...>  20120316 02:04:48

Hey Karl, I was doing some SpMV benchmarks and noticed something interesting. I tested a range of problem sizes between N=1024 and N=27556 unknowns with the sparse matrix containing 40 nonzeros per row (nonsymmetric). The labels in my benchmarks below indicate: a) the problem size (N=27556) b) the matrix type used c) the task d) timing info I always assemble the matrix on the CPU. The multiply is either CPU or GPU and indicated by the matrix type, but always viennacl::linalg::prod(..). For the GPU multiply I also logged the time to copy the matrix to the device. I was curious to see what kind of impact a copy from COO_CPU to a CSR_GPU would have. These are my typedefs: typedef std::vector< std::map< unsigned int, double> > STL_Sparse_Mat; typedef boost::numeric::ublas::compressed_matrix<double> UBLAS_CSR_Mat; typedef boost::numeric::ublas::coordinate_matrix<double> UBLAS_COO_Mat; typedef viennacl::compressed_matrix<double> VCL_CSR_Mat; typedef viennacl::coordinate_matrix<double> VCL_COO_Mat; First, I tested using the std::map for the CPU representation. Its not bad. The cost of transfer to GPU is a little high compared to the cost of a CPU side SpMV, but nothing of major concern. ============= STL Map =============== 27556 STL_Sparse_Mat Assemble  avg: 133.1483  tot: 798.8900  count= 6 27556 STL_Sparse_Mat Multiply  avg: 34.1520  tot: 68.3040  count= 2 27556 STL_Sparse_Mat Copy To VCL_COO_GPU  avg: 106.2620  tot: 212.5240  count= 2 27556 VCL_COO_GPU Multiply  avg: 8.6280  tot: 17.2560  count= 2 27556 STL_Sparse_Mat Copy To VCL_CSR_GPU  avg: 100.4015  tot: 200.8030  count= 2 27556 VCL_CSR_GPU Multiply  avg: 7.9940  tot: 15.9880  count= 2 ==================================== Then I tested the UBLAS Compressed and Coordinate matrices to see what impact they would have. ============= UBLAS Matrix =============== 27556 UBLAS_COO_CPU Assemble  avg: 26.7830  tot: 80.3490  count= 3 27556 UBLAS_COO_CPU Multiply test  avg: 41728.3359  tot: 41728.3359  count= 1 27556 UBLAS_COO_CPU Copy To VCL_COO_GPU  avg: 879.2710  tot: 879.2710  count= 1 27556 VCL_COO_GPU Multiply test  avg: 8.4865  tot: 16.9730  count= 2 27556 UBLAS_CSR_CPU Assemble  avg: 107.1820  tot: 321.5460  count= 3 27556 UBLAS_CSR_CPU Multiply test  avg: 4933.7212  tot: 4933.7212  count= 1 27556 UBLAS_CSR_CPU Copy To VCL_CSR_GPU  avg: 83.7720  tot: 83.7720  count= 1 27556 VCL_CSR_GPU Multiply test  avg: 7.8510  tot: 15.7020  count= 2 27556 UBLAS_COO_CPU Copy To VCL_CSR_GPU  avg: 877.8260  tot: 877.8260  count= 1 27556 UBLAS_CSR_CPU Copy To VCL_COO_GPU  avg: 83.4390  tot: 83.4390  count= 1 ======================================= I should specify that the UBLAS times for COO assembly use this type of fill: A.append_element(i, j, val); whereas the CSR uses: A(i,j) = val; You'll notice a nice speedup in assembly from UBlas versus the std::map. Clearly, UBLAS is inefficient for a CPU multiply. But we're targeting the GPU so its not that troubling. My biggest concern above is the difference in cost to copy from UBlas to ViennaCL. When I start with a COO and copy to either COO or CSR it is 10x slower than starting with a CSR matrix. Correct me if I'm wrong, but with the COO format both the CPU and GPU need to store three vectors representing the row/column index and values. If that's the case then the vectors should be directly copied from CPU to GPU. Perhaps you do some interleaving of the vectors? Evan  Evan Bollig bollig@... bollig@... 
From: Evan Bollig <bollig@gm...>  20120315 23:22:11

Karl, more issues with COO. I call GMRES like this for a CSR matrix (this works): ================== viennacl::linalg::gmres_tag tag(1e8, 100); VCL_CSR_Mat A_csr(U_exact.size(), U_exact.size()); U_approx = viennacl::linalg::solve(A_csr, U_exact, tag); ========= When I change to COO it fails: ======================= viennacl::linalg::gmres_tag tag(1e8, 100); VCL_COO_Mat A_coo(U_exact.size(), U_exact.size()); U_approx = viennacl::linalg::solve(A_coo, U_exact, tag); Error Message Below: ================ In file included from /home/code/external/opencl_libraries/viennacl1.2.0/viennacl/coordinate_matrix.hpp:32, from /home/code/tests/util/viennacl_gmres_poisson/main.cpp:9: /home/code/external/opencl_libraries/viennacl1.2.0/viennacl/linalg/coordinate_matrix_operations.hpp: In member function ‘viennacl::vector<SCALARTYPE, ALIGNMENT>& viennacl::vector<SCALARTYPE, ALIGNMENT>::operator=(const viennacl::vector_expression<const viennacl::coordinate_matrix<SCALARTYPE, MAT_ALIGNMENT>, const viennacl::vector<SCALARTYPE, ALIGNMENT>, viennacl::op_prod>&) [with unsigned int MAT_ALIGNMENT = 128u, SCALARTYPE = double, unsigned int ALIGNMENT = 1u]’: /home/code/external/opencl_libraries/viennacl1.2.0/viennacl/linalg/gmres.hpp:165: instantiated from ‘VectorType viennacl::linalg::solve(const MatrixType&, const VectorType&, const viennacl::linalg::gmres_tag&, const PreconditionerType&) [with MatrixType = viennacl::coordinate_matrix<double, 128u>, VectorType = viennacl::vector<double, 1u>, PreconditionerType = viennacl::linalg::no_precond]’ /home/code/external/opencl_libraries/viennacl1.2.0/viennacl/linalg/gmres.hpp:357: instantiated from ‘VectorType viennacl::linalg::solve(const MatrixType&, const VectorType&, const viennacl::linalg::gmres_tag&) [with MatrixType = VCL_COO_Mat, VectorType = viennacl::vector<double, 1u>]’ /home/code/tests/util/viennacl_gmres_poisson/main.cpp:121: instantiated from ‘void benchmark_GMRES_Device(MatT&, VecT&, VecT&) [with MatT = viennacl::coordinate_matrix<double, 128u>, VecT = viennacl::vector<double, 1u>]’ /home/code/tests/util/viennacl_gmres_poisson/main.cpp:216: instantiated from ‘void run_SpMV(RBFFD&, Grid&) [with MatType = std::vector<std::map<unsigned int, double, std::less<unsigned int>, std::allocator<std::pair<const unsigned int, double> > >, std::allocator<std::map<unsigned int, double, std::less<unsigned int>, std::allocator<std::pair<const unsigned int, double> > > > >, OpMatType = viennacl::coordinate_matrix<double, 128u>, MatrixType assemble_t_e = (MatrixType)4, MatrixType operate_t_e = (MatrixType)4]’ /home/code/tests/util/viennacl_gmres_poisson/main.cpp:299: instantiated from ‘void run_test(RBFFD&, Grid&) [with MatrixType assemble_t_e = (MatrixType)4, MatrixType operate_t_e = (MatrixType)4]’ /home/code/tests/util/viennacl_gmres_poisson/main.cpp:386: instantiated from here /home/code/external/opencl_libraries/viennacl1.2.0/viennacl/linalg/coordinate_matrix_operations.hpp:175: error: ‘const class viennacl::vector_expression<const viennacl::coordinate_matrix<double, 128u>, const viennacl::vector<double, 1u>, viennacl::op_prod>’ has no member named ‘get_lhs’ make[2]: *** [tests/util/viennacl_gmres_poisson/CMakeFiles/viennacl_gmres_poisson.x.dir/main.cpp.o] Error 1 make[1]: *** [tests/util/viennacl_gmres_poisson/CMakeFiles/viennacl_gmres_poisson.x.dir/all] Error 2 =====================  Evan Bollig bollig@... bollig@... 
From: Karl Rupp <rupp@iu...>  20120310 09:35:22

Hi Evan, we have been thinking about a git repository on sourceforge for a while already. A git repository is already in use internally, so it shouldn't be too much of an effort  still, it requires some preparation and might take a few more months (lots of conferences approaching right now...) Best regards, Karli On 03/10/2012 12:29 AM, Evan Bollig wrote: > Hey Karl, I'm working on a few projects that leverage viennacl. Id > like to include viennacl as a git submodule so I can easily fetch a > new version on demand. Do you have a git repo? > > Evan > 
From: Karl Rupp <rupp@iu...>  20120310 09:23:02

Hi Evan, thanks for reporting the issue. You are absolutely right, coordinate_matrix is supposed to work for small matrix sizes as well. I'll include a fix for this in version 1.2.1 which will be released in the next couple of days. Best regards, Karli On 03/10/2012 01:03 AM, Evan Bollig wrote: > Found the problem. I was testing a very small system. 36x36 elements. > When I increased to 1024 the coordinate matrix started working fine. > Not a serious bug, but it seems odd that the compressed matrix would > work for a small system and coordinate would not. > > Cheers, > Evan > > > On Fri, Mar 9, 2012 at 7:02 PM, Evan Bollig<bollig@...> wrote: >> Karl, I have been using the compressed matrix, but wanted to compare >> performance to the coordinate matrix format today. I assemble matrices >> using vectors of maps: >> >> ================= >> typedef std::vector< std::map< unsigned int, double> > MatType; >> typedef viennacl::coordinate_matrix<double> MatTypeGPU; >> >> MatType A( N ); >> >> for (unsigned int i = 0; i< N; i++) { >> StencilType& sten = grid.getStencil(i); >> >> for (unsigned int j = 0; j< n; j++) { >> A[i][sten[j]] = 1; >> } >> } >> >> MatTypeGPU A_gpu( N, N ); >> copy(A, A_gpu); >> >> std::vector<double> x_host(A.size1(), 1); >> viennacl::vector<double> x(A.size1()); >> viennacl::copy(x_host.begin(), x_host.end(), x.begin()); >> >> b = viennacl::linalg::prod(A_gpu, x); >> >> std::vector<double> b_host(A.size1(), 1); >> viennacl::copy(b.begin(), b.end(), b_host.begin()); >> >> ================ >> >> When I change my MatTypeGPU to compressed_matrix this code works like >> a charm. But for coordinate matrix it crashes with the the error >> >> =========== >> terminate called after throwing an instance of >> 'viennacl::ocl::invalid_command_queue' >> what(): ViennaCL: FATAL ERROR: CL_INVALID_COMMAND_QUEUE. >> If you think that this is a bug in ViennaCL, please report it at >> viennaclsupport@... and supply at least the >> following information: >> * Operating System >> * Which OpenCL implementation (AMD, NVIDIA, etc.) >> * ViennaCL version >> Many thanks in advance! >> Aborted (core dumped) >> ============ >> >> Any thoughts? The same behavior if I start with a ublas matrix. >>  >> Evan Bollig >> bollig@... >> bollig@... > > > 
From: Evan Bollig <bollig@gm...>  20120310 00:04:12

Found the problem. I was testing a very small system. 36x36 elements. When I increased to 1024 the coordinate matrix started working fine. Not a serious bug, but it seems odd that the compressed matrix would work for a small system and coordinate would not. Cheers, Evan On Fri, Mar 9, 2012 at 7:02 PM, Evan Bollig <bollig@...> wrote: > Karl, I have been using the compressed matrix, but wanted to compare > performance to the coordinate matrix format today. I assemble matrices > using vectors of maps: > > ================= > typedef std::vector< std::map< unsigned int, double> > MatType; > typedef viennacl::coordinate_matrix<double> MatTypeGPU; > > MatType A( N ); > > for (unsigned int i = 0; i < N; i++) { > StencilType& sten = grid.getStencil(i); > > for (unsigned int j = 0; j < n; j++) { > A[i][sten[j]] = 1; > } > } > > MatTypeGPU A_gpu( N, N ); > copy(A, A_gpu); > > std::vector<double> x_host(A.size1(), 1); > viennacl::vector<double> x(A.size1()); > viennacl::copy(x_host.begin(), x_host.end(), x.begin()); > > b = viennacl::linalg::prod(A_gpu, x); > > std::vector<double> b_host(A.size1(), 1); > viennacl::copy(b.begin(), b.end(), b_host.begin()); > > ================ > > When I change my MatTypeGPU to compressed_matrix this code works like > a charm. But for coordinate matrix it crashes with the the error > > =========== > terminate called after throwing an instance of > 'viennacl::ocl::invalid_command_queue' > what(): ViennaCL: FATAL ERROR: CL_INVALID_COMMAND_QUEUE. > If you think that this is a bug in ViennaCL, please report it at > viennaclsupport@... and supply at least the > following information: > * Operating System > * Which OpenCL implementation (AMD, NVIDIA, etc.) > * ViennaCL version > Many thanks in advance! > Aborted (core dumped) > ============ > > Any thoughts? The same behavior if I start with a ublas matrix. >  > Evan Bollig > bollig@... > bollig@...  Evan Bollig bollig@... bollig@... 
From: Evan Bollig <bollig@gm...>  20120310 00:02:45

Karl, I have been using the compressed matrix, but wanted to compare performance to the coordinate matrix format today. I assemble matrices using vectors of maps: ================= typedef std::vector< std::map< unsigned int, double> > MatType; typedef viennacl::coordinate_matrix<double> MatTypeGPU; MatType A( N ); for (unsigned int i = 0; i < N; i++) { StencilType& sten = grid.getStencil(i); for (unsigned int j = 0; j < n; j++) { A[i][sten[j]] = 1; } } MatTypeGPU A_gpu( N, N ); copy(A, A_gpu); std::vector<double> x_host(A.size1(), 1); viennacl::vector<double> x(A.size1()); viennacl::copy(x_host.begin(), x_host.end(), x.begin()); b = viennacl::linalg::prod(A_gpu, x); std::vector<double> b_host(A.size1(), 1); viennacl::copy(b.begin(), b.end(), b_host.begin()); ================ When I change my MatTypeGPU to compressed_matrix this code works like a charm. But for coordinate matrix it crashes with the the error =========== terminate called after throwing an instance of 'viennacl::ocl::invalid_command_queue' what(): ViennaCL: FATAL ERROR: CL_INVALID_COMMAND_QUEUE. If you think that this is a bug in ViennaCL, please report it at viennaclsupport@... and supply at least the following information: * Operating System * Which OpenCL implementation (AMD, NVIDIA, etc.) * ViennaCL version Many thanks in advance! Aborted (core dumped) ============ Any thoughts? The same behavior if I start with a ublas matrix.  Evan Bollig bollig@... bollig@... 
From: Evan Bollig <bollig@gm...>  20120309 23:29:46

Hey Karl, I'm working on a few projects that leverage viennacl. Id like to include viennacl as a git submodule so I can easily fetch a new version on demand. Do you have a git repo? Evan  Evan Bollig bollig@... bollig@... 
From: Karl Rupp <rupp@iu...>  20120304 12:34:32

Hi Michael, great  thanks a lot! Best regards, Karli On 03/04/2012 01:30 PM, Michael Wild wrote: > Hi all > > I'm glad to announce that ViennaCL1.2.0 has been uploaded to Debian > unstable: > http://packages.qa.debian.org/v/viennacl/news/20120303T222333Z.html > > Cheers > > Michael > >  > Virtualization& Cloud Management Using Capacity Planning > Cloud computing makes use of virtualization  but cloud computing > also focuses on allowing computing to be delivered as a service. > http://www.accelacomm.com/jaw/sfnl/114/51521223/ > _______________________________________________ > ViennaCLsupport mailing list > ViennaCLsupport@... > https://lists.sourceforge.net/lists/listinfo/viennaclsupport > 
From: Michael Wild <themiwi@us...>  20120304 12:30:40

Hi all I'm glad to announce that ViennaCL1.2.0 has been uploaded to Debian unstable: http://packages.qa.debian.org/v/viennacl/news/20120303T222333Z.html Cheers Michael 