You can subscribe to this list here.
2012 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
(6) |
Aug
(30) |
Sep
(1) |
Oct
(10) |
Nov
(8) |
Dec
(1) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2013 |
Jan
|
Feb
(9) |
Mar
(3) |
Apr
(1) |
May
(2) |
Jun
(2) |
Jul
(73) |
Aug
(145) |
Sep
(32) |
Oct
(45) |
Nov
(4) |
Dec
(76) |
2014 |
Jan
(24) |
Feb
(92) |
Mar
(27) |
Apr
(15) |
May
(57) |
Jun
(49) |
Jul
(105) |
Aug
(125) |
Sep
(7) |
Oct
(19) |
Nov
(70) |
Dec
(4) |
2015 |
Jan
|
Feb
|
Mar
(3) |
Apr
|
May
(8) |
Jun
|
Jul
(40) |
Aug
(29) |
Sep
|
Oct
(8) |
Nov
(1) |
Dec
(7) |
2016 |
Jan
(12) |
Feb
(7) |
Mar
(8) |
Apr
(4) |
May
(20) |
Jun
(4) |
Jul
(38) |
Aug
(44) |
Sep
(11) |
Oct
(10) |
Nov
(13) |
Dec
(4) |
2017 |
Jan
|
Feb
(7) |
Mar
|
Apr
|
May
(1) |
Jun
|
Jul
|
Aug
(2) |
Sep
|
Oct
|
Nov
|
Dec
|
2018 |
Jan
(1) |
Feb
|
Mar
|
Apr
|
May
|
Jun
(4) |
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
From: Karl R. <ru...@iu...> - 2018-06-15 19:11:59
|
Hi, > I have discovered something rather odd. > > If I run an minimal example (as shown in eigen-with-viennacl.cpp) all > runs fine where I pass a compressed_matrix directly to the copy e.g. > > Eigen::SparseMatrix<float, Eigen::RowMajor> spAm; > > ... code to fill spAm; > > viennacl::matrix<float> A = viennacl::compressed_matrix(K, M) I assume you mean viennacl::compressed_matrix<float> A = viennacl::compressed_matrix<float>(K, M); here? This may result in an unnecessary temporary object, so I'd recommend > viennacl::copy(spAm, A); > > > However, if my compressed_matrix is wrapped up in a std::shared_ptr > (C++11) I don't seem to be able to copy even when dereferencing. > > std::shared_ptr<viennacl::compressed_matrix<T> > shptr = > std::make_shared<viennacl::compressed_matrix<float> > >(viennacl::compressed_matrix<T>(K,M)); > > viennacl::copy(spAm, *shptr); T == float? > > Strangely, this results in a series of memory errors, the top relevant > ones are here > > 0x00000000701CF5F6 (0x0000000018C5A670 0x000000001502E060 > 0x00000000043EB20B 0x00000000043EB230), > _ZN8viennacl7backend13memory_createERNS0_10mem_handleEyRKNS_7contextEPKv() > + 0x1C6 bytes(s) > 0x00000000701C00C0 (0x0000000000000000 0x0000000004627FD0 > 0x0000000000000010 0x0000000000000004), > _ZN8viennacl6detail9copy_implINS_5tools27const_sparse_matrix_adapterIdjEEdLj1EEEvRKT_RNS_17compressed_matrixIT0_XT1_EEEy() > + 0x3A0 bytes(s) > 0x00000000701BD721 (0x00000000043EB350 0x0000000004627FD0 > 0x0000000004530620 0x0000000007474070), > _ZN8viennacl4copyIdLi1ELj1EEEvRKN5Eigen12SparseMatrixIT_XT0_EiEERNS_17compressed_matrixIS3_XT1_EEE() > + 0x321 bytes(s) > > > which following some c++filt we get the following > > viennacl::backend::memory_create(viennacl::backend::mem_handle&, > unsigned long long, viennacl::context const&, void const*) > void > viennacl::detail::copy_impl<viennacl::tools::const_sparse_matrix_adapter<float, > unsigned int>, float, > 1u>(viennacl::tools::const_sparse_matrix_adapter<float, unsigned > int> const&, viennacl::compressed_matrix<float, 1u>&, unsigned long > long) > void viennacl::copy<float, 1, 1u>(Eigen::SparseMatrix<float, 1, int> > const&, viennacl::compressed_matrix<float, 1u>&) > > > Any insight as to why this would be would be appreciated. compressed_matrix has the copy-constructor implemented, so that should be okay. Maybe it doesn't copy *all* internal members. I'll try to reproduce the problem so that I can debug it. Best regards, Karli |
From: Karl R. <ru...@iu...> - 2018-06-15 19:03:21
|
Hi Charles, no, there is currently no way of extracting a submatrix from a sparse matrix. I agree that it is useful in some circumstances to *extract* a submatrix, but the type system and boilerplate required to fully deal with things like prod(matrix_range(A, rows, cols), matrix_range(B, rows, cols)); is outright scary. Best regards, Karli On 06/12/2018 07:24 PM, Charles Determan wrote: > Greetings, > > Is there a way to take proxy (i.e. matrix_range) subsets of > compressed_matrix objects? > > Currently, when I try to compile the g++ returns an error: > > error: no type named 'cpu_value_type' in 'class > viennacl::compressed_matrix<float>' > typedef matrix_base<typename MatrixType::cpu_value_type> base_type; > > I want to make sure if I am making a simple mistake or if this is > currently not possible. > > Thanks, > Chaz > > > ------------------------------------------------------------------------------ > Check out the vibrant tech community on one of the world's most > engaging tech sites, Slashdot.org! http://sdm.link/slashdot > > > > _______________________________________________ > ViennaCL-devel mailing list > Vie...@li... > https://lists.sourceforge.net/lists/listinfo/viennacl-devel > |
From: Charles D. <cde...@gm...> - 2018-06-12 20:58:11
|
Greetings, I have discovered something rather odd. If I run an minimal example (as shown in eigen-with-viennacl.cpp) all runs fine where I pass a compressed_matrix directly to the copy e.g. Eigen::SparseMatrix<float, Eigen::RowMajor> spAm; ... code to fill spAm; viennacl::matrix<float> A = viennacl::compressed_matrix(K, M) viennacl::copy(spAm, A); However, if my compressed_matrix is wrapped up in a std::shared_ptr (C++11) I don't seem to be able to copy even when dereferencing. std::shared_ptr<viennacl::compressed_matrix<T> > shptr = std::make_shared<viennacl::compressed_matrix<float> >(viennacl::compressed_matrix<T>(K,M)); viennacl::copy(spAm, *shptr); Strangely, this results in a series of memory errors, the top relevant ones are here 0x00000000701CF5F6 (0x0000000018C5A670 0x000000001502E060 0x00000000043EB20B 0x00000000043EB230), _ZN8viennacl7backend13memory_createERNS0_10mem_handleEyRKNS_7contextEPKv() + 0x1C6 bytes(s) 0x00000000701C00C0 (0x0000000000000000 0x0000000004627FD0 0x0000000000000010 0x0000000000000004), _ZN8viennacl6detail9copy_implINS_5tools27const_sparse_matrix_adapterIdjEEdLj1EEEvRKT_RNS_17compressed_matrixIT0_XT1_EEEy() + 0x3A0 bytes(s) 0x00000000701BD721 (0x00000000043EB350 0x0000000004627FD0 0x0000000004530620 0x0000000007474070), _ZN8viennacl4copyIdLi1ELj1EEEvRKN5Eigen12SparseMatrixIT_XT0_EiEERNS_17compressed_matrixIS3_XT1_EEE() + 0x321 bytes(s) which following some c++filt we get the following viennacl::backend::memory_create(viennacl::backend::mem_handle&, unsigned long long, viennacl::context const&, void const*) void viennacl::detail::copy_impl<viennacl::tools::const_sparse_matrix_adapter<float, unsigned int>, float, 1u>(viennacl::tools::const_sparse_matrix_adapter<float, unsigned int> const&, viennacl::compressed_matrix<float, 1u>&, unsigned long long) void viennacl::copy<float, 1, 1u>(Eigen::SparseMatrix<float, 1, int> const&, viennacl::compressed_matrix<float, 1u>&) Any insight as to why this would be would be appreciated. Regards, Charles |
From: Charles D. <cde...@gm...> - 2018-06-12 17:24:49
|
Greetings, Is there a way to take proxy (i.e. matrix_range) subsets of compressed_matrix objects? Currently, when I try to compile the g++ returns an error: error: no type named 'cpu_value_type' in 'class viennacl::compressed_matrix<float>' typedef matrix_base<typename MatrixType::cpu_value_type> base_type; I want to make sure if I am making a simple mistake or if this is currently not possible. Thanks, Chaz |
From: Charles D. <cde...@gm...> - 2018-01-25 19:11:43
|
Greetings, I am currently working on setting up an existing project which uses ViennaCL with the OpenCL backend to also compile with the CUDA backend. I have encountered two issues that I would like to ask about. 1. CUDA 'contexts' I previously was making sure that new viennacl::matrix objects were being created with the correct context. For example: // ctx_id previously passed as int viennacl::context ctx(viennacl::ocl::get_context(ctx_id)); // new matrix viennacl::matrix<T> vclMat(K,M, ctx = ctx); Now, obviously if I am using the CUDA backend, there is no `viennacl::ocl::get_contex` method. What is the recommendation in this instance? 2. Compilation is failing with Eigen::Map classes I thought this was previously resolved with a previous issue ( https://github.com/viennacl/viennacl-dev/issues/137) but when I try to compile with nvcc I get the following error (representative example of many): /usr/local/lib/R/site-library/RViennaCL/include/viennacl/traits/size.hpp(164): error: class "Eigen::Map<Eigen::Matrix<int, -1, -1, 0, -1, -1>, 0, Eigen::OuterStride<-1>>" has no member "size1" detected during: instantiation of "viennacl::vcl_size_t viennacl::traits::size1(const MatrixType &) [with MatrixType=Eigen::Map<Eigen::Matrix<int, -1, -1, 0, -1, -1>, 0, Eigen::OuterStride<-1>>]" /usr/local/lib/R/site-library/RViennaCL/include/viennacl/matrix.hpp(1178): here instantiation of "void viennacl::copy(const viennacl::matrix<NumericT, F, AlignmentV> &, CPUMatrixT &) [with CPUMatrixT=Eigen::Map<Eigen::Matrix<int, -1, -1, 0, -1, -1>, 0, Eigen::OuterStride<-1>>, NumericT=int, F=viennacl::row_major, AlignmentV=1U]" ../inst/include/gpuR/dynEigenMat.hpp(353): here instantiation of "void dynEigenMat<T>::to_host() [with T=int]" ../inst/include/gpuR/dynEigenMat.hpp(377): here Regards, Charles |
From: Karl R. <ru...@iu...> - 2017-08-11 17:15:25
|
Hi Charles, > I have another curious situation. I have installed pocl 0.14 on a > ubuntu 14.04 system. I can install and run clinfo without any > problems. However, when I compile and run my context.cpp file > (https://github.com/cdeterman/gpuR/blob/develop/src/context.cpp) and try > to run the initContexts function I keep getting a -1001 error for the > get_platforms call. > > Any idea why this script would be failing to find the platforms whereas > even a basic query file like this > (https://github.com/cdeterman/pocl_test/blob/master/clDeviceQuery.cpp) > can be compiled simply with > > g++ -o clDeviceQuery clDeviceQuery.cpp -lOpenCL > > and run without a problem. whenever I've seen problems with querying platform information it was due to a problem with the OpenCL environment: Either the GPU driver was not correctly installed (unlikely in your case) or the wrong libOpenCL.so was picked up (more likely here). Can you verify with ldd that the correct libOpenCL.so is picked up? clDeviceQuery and gpuR should pick up the same libOpenCL.so. Best regards, Karli |
From: Charles D. <cde...@gm...> - 2017-08-11 16:03:10
|
Greetings, I have another curious situation. I have installed pocl 0.14 on a ubuntu 14.04 system. I can install and run clinfo without any problems. However, when I compile and run my context.cpp file ( https://github.com/cdeterman/gpuR/blob/develop/src/context.cpp) and try to run the initContexts function I keep getting a -1001 error for the get_platforms call. Any idea why this script would be failing to find the platforms whereas even a basic query file like this ( https://github.com/cdeterman/pocl_test/blob/master/clDeviceQuery.cpp) can be compiled simply with g++ -o clDeviceQuery clDeviceQuery.cpp -lOpenCL and run without a problem. Thanks, Charles |
From: Charles D. <cde...@gm...> - 2017-05-03 19:45:25
|
Greetings, I am working on implementing some conversions from complex Eigen matrices to viennacl matrices. I understand there is no native complex data types but instead expanded to have real/imaginary components in every other column. That said, I have run in to a problem when trying to use the copy command. If I want to copy a matrix_slice object I am able to do it directly to a Eigen::Matrix object Eigen::MatrixXf Am; viennacl::matrix_slice<viennacl::matrix<float> > Av; viennacl::copy(Av, Am); However, if I try to copy to a 'real' component of an Eigen Matrix, which is essentially a MatrixXf object I get an error. Eigen::MatrixXcf Am; viennacl::matrix_slice<viennacl::matrix<float> > Av; viennacl::copy(Av, Am.real()); even explicitly casting the object as Eigen Matrix type results in error: viennacl::copy(Av, (Eigen::Matrix<float, Eigen::Dynamic, Eigen::Dynamic>)Am.real()); error: no matching function for call to ‘copy(viennacl::matrix_slice<viennacl::matrix<float, viennacl::row_major> >&, Eigen::Matrix<float, -1, -1>)’ viennacl::copy(A_sub, (Eigen::Matrix<float, Eigen::Dynamic, Eigen::Dynamic>)(Am.real())); Any thoughts why this would be the case? Thanks, Charles |
From: Sensei <sen...@gm...> - 2017-02-19 08:48:48
|
> On Feb 19, 2017, at 12:14am, Oswin Krause <Osw...@ru...> wrote: > > Hi, > > I would argue you have some serious problem with your setup if copying of 1MB takes time in the order of minutes (and even if its milliseconds, i would consider this as way too much). Hi Oswin, Yes, I believe there is a problem, but I don’t know what to debug. > I am also concerned that your new cpu timing is 0. Why? Caching is a good cause for just 1M items. > that resizing takes longer on a gpu is expected as you are doing something way more complicated when allocating storage on the gpu, compared to a cpu. > > Also consider doing some real work. computing the norm of a 1MB vector is not really something one would consider the gpu for. All simple vector operations/reductions are a pain to get fast on a gpu. compare matrix-matrix multiplications or similar. Thanks, I will try ASAP with more complex operations. Thank you! |
From: Oswin K. <Osw...@ru...> - 2017-02-18 23:39:45
|
Hi, I would argue you have some serious problem with your setup if copying of 1MB takes time in the order of minutes (and even if its milliseconds, i would consider this as way too much). I am also concerned that your new cpu timing is 0. that resizing takes longer on a gpu is expected as you are doing something way more complicated when allocating storage on the gpu, compared to a cpu. Also consider doing some real work. computing the norm of a 1MB vector is not really something one would consider the gpu for. All simple vector operations/reductions are a pain to get fast on a gpu. compare matrix-matrix multiplications or similar. On 2017-02-18 16:02, Sensei wrote: >> The GPU is an Intel HD Graphics 4000. >> >> What worries me is not only the warmup, but the copy phase. In >> release it can take a lot of time for a simple 1M items (169 >> seconds!). I suspect I should try to do everything on the GPU. Not >> only that, but this call: >> >> viennacl::vector<float> gpuv; >> gpuv.resize(size); >> >> is taking 6 times what it takes on the CPU. >> >> How can I avoid these overheads? I am quite new with OpenCL... >> >> Thank you! > > Sorry, I forgot the log for the twice-computed norm: > > COMPUTING NORM_1 ON GPU > COMPUTING NORM_1 ON GPU 347 > > COMPUTING NORM_1 ON GPU (AGAIN) > COMPUTING NORM_1 ON GPU 2 > > COMPUTING NORM_1 ON CPU > COMPUTING NORM_1 ON CPU 0 > > As you can see, yes, the OpenCL compilation takes a lot of time, but > still higher than the CPU (with cache and all, I know). > ------------------------------------------------------------------------------ > Check out the vibrant tech community on one of the world's most > engaging tech sites, SlashDot.org! http://sdm.link/slashdot > _______________________________________________ > ViennaCL-devel mailing list > Vie...@li... > https://lists.sourceforge.net/lists/listinfo/viennacl-devel |
From: Sensei <sen...@gm...> - 2017-02-18 15:02:08
|
> The GPU is an Intel HD Graphics 4000. > > What worries me is not only the warmup, but the copy phase. In release it can take a lot of time for a simple 1M items (169 seconds!). I suspect I should try to do everything on the GPU. Not only that, but this call: > > viennacl::vector<float> gpuv; > gpuv.resize(size); > > is taking 6 times what it takes on the CPU. > > How can I avoid these overheads? I am quite new with OpenCL... > > Thank you! > Sorry, I forgot the log for the twice-computed norm: Computing norm_1 on GPU Computing norm_1 on GPU 347 Computing norm_1 on GPU (again) Computing norm_1 on GPU 2 Computing norm_1 on CPU Computing norm_1 on CPU 0 As you can see, yes, the OpenCL compilation takes a lot of time, but still higher than the CPU (with cache and all, I know). |
From: Sensei <sen...@gm...> - 2017-02-18 14:42:46
|
> On Feb 17, 2017, at 8:05pm, Karl Rupp <ru...@iu...> wrote: > > Hi, > > I suspect that your timings include kernel compilation times. Please have a 'warmup' call of norm_1() outside your timing region. > > Which GPU do you have on your Macbook Pro? > > Best regards, > Karli Thanks Karl & Oswin, The GPU is an Intel HD Graphics 4000. What worries me is not only the warmup, but the copy phase. In release it can take a lot of time for a simple 1M items (169 seconds!). I suspect I should try to do everything on the GPU. Not only that, but this call: viennacl::vector<float> gpuv; gpuv.resize(size); is taking 6 times what it takes on the CPU. How can I avoid these overheads? I am quite new with OpenCL... Thank you! |
From: Karl R. <ru...@iu...> - 2017-02-17 19:22:32
|
Hi, I suspect that your timings include kernel compilation times. Please have a 'warmup' call of norm_1() outside your timing region. Which GPU do you have on your Macbook Pro? Best regards, Karli On 02/17/2017 05:27 PM, Sensei wrote: > Hi! > > I am new to the OpenCL/GPU world, and I probably expected too much from > it. I am computing the norm_1 of a vector, on the CPU and GPU, and I had > these results: > > *Platform Apple* > *Version OpenCL 1.2 (Jan 4 2017 22:35:59)* > ** > *> Device type CPU* > *Version OpenCL 1.2 (Jan 4 2017 22:35:59)* > *> Device type GPU* > *Version OpenCL 1.2 (Jan 4 2017 22:35:59)* > * > * > *STARTING, TIMES ARE IN MILLISECONDS* > * > * > *Reserving CPU vector * > *Reserving CPU vector 16* > ** > *Filling CPU vector * > *Filling CPU vector 9* > ** > *Reserving GPU vector * > *Reserving GPU vector 82* > ** > *Copying to GPU * > *Copying to GPU 158310* > ** > *Computing norm_1 on GPU * > *Computing norm_1 on GPU 333* > ** > *Computing norm_1 on CPU * > *Computing norm_1 on CPU 8* > ** > *GPU: 5e+11 CPU: 5.00000e+11* > *Program ended with exit code: 0* > * > * > As you can see, the GPU times are waaaaay higher than the CPU ones. My > code is really simple, and I am following the recommended conduct to > build values on the CPU and then copy them. > > Is this bad performance due to my platform? I am running on a MacBook > Pro now. My code follows. > > Thanks! > > > #include <iostream> > #include <vector> > #include <algorithm> > #include <cstdlib> > #include <numeric> > #include <chrono> > > #define CL_USE_DEPRECATED_OPENCL_1_1_APIS > #define __CL_ENABLE_EXCEPTIONS > > #define VIENNACL_WITH_OPENCL > > #include "cl.hpp" > #include "viennacl/scalar.hpp" > #include "viennacl/vector.hpp" > #include "viennacl/ocl/backend.hpp" > #include "viennacl/linalg/norm_1.hpp" > > intmain(intargc, constchar* argv[]) > { > // This is what vienna sees > autoviennaplatforms = viennacl::ocl::get_platforms(); > autoviennadevices = viennacl::ocl::platform().devices(); > > > > // See what standard OpenCL sees > std::vector<cl::Platform> platforms; > > > > // Get platform > cl::Platform::get(&platforms); > > // Temp > std::strings; > > > > // Where the GPU lies > cl::Devicegpudevice; > > > > // Found a GPU > boolgpufound = false; > > > > // See if we have a GPU > for(autop : platforms) > { > s.clear(); > p.getInfo(CL_PLATFORM_NAME, &s); > std::cout<< "Platform "<< s << std::endl; > > s.clear(); > p.getInfo(CL_PLATFORM_VERSION, &s); > std::cout<< "Version "<< s << std::endl; > > std::cout<< std::endl; > > > > std::vector<cl::Device> devices; > > p.getDevices(CL_DEVICE_TYPE_ALL, &devices); > > > > for(autod : devices) > { > std::size_ti = 4; > d.getInfo(CL_DEVICE_TYPE, &i); > > std::cout<< "> Device type "<< > (i & CL_DEVICE_TYPE_CPU? "CPU": "") << > (i & CL_DEVICE_TYPE_GPU? "GPU": "") << > (i & CL_DEVICE_TYPE_ACCELERATOR? "ACCELERATOR": > "") << > std::endl; > > > > if(i & CL_DEVICE_TYPE_GPU) > { > gpudevice = d; > gpufound = true; > } > > > > std::cout<< "Version "<< s << std::endl; > > } > } > > > > if(!gpufound) > { > std::cout<< "NO GPU FOUND. ABORTING."<< std::endl; > return1; > } > > // Size > intsize = 1* 1000* 1000; > > // Measuring time > autostart = std::chrono::steady_clock::now(); > > std::cout<< std::endl<< "STARTING, TIMES ARE IN MILLISECONDS"<< > std::endl << std::endl; > > > > std::cout<< "Reserving CPU vector "<< std::endl; > start = std::chrono::steady_clock::now(); > std::vector<double> cpuv; > cpuv.resize(size); > std::cout<< "Reserving CPU vector "<< > std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::steady_clock::now()-start).count() > << std::endl<< std::endl; > > > > std::cout<< "Filling CPU vector " << std::endl; > start = std::chrono::steady_clock::now(); > std::iota(cpuv.begin(), cpuv.end(), 1.0); > std::cout<< "Filling CPU vector "<< > std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::steady_clock::now()-start).count() > << std::endl<< std::endl; > > > > std::cout<< "Reserving GPU vector " << std::endl; > start = std::chrono::steady_clock::now(); > viennacl::vector<float> gpuv; > gpuv.resize(size); > std::cout<< "Reserving GPU vector "<< > std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::steady_clock::now()-start).count() > << std::endl<< std::endl; > > > > std::cout<< "Copying to GPU " << std::endl; > start = std::chrono::steady_clock::now(); > std::copy(cpuv.begin(), cpuv.end(), gpuv.begin()); > std::cout<< "Copying to GPU "<< > std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::steady_clock::now()-start).count() > << std::endl<< std::endl; > > std::cout<< "Computing norm_1 on GPU " << std::endl; > start = std::chrono::steady_clock::now(); > doublegpunorm1 = viennacl::linalg::norm_1(gpuv); > std::cout<< "Computing norm_1 on GPU "<< > std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::steady_clock::now()-start).count() > << std::endl<< std::endl; > > std::cout<< "Computing norm_1 on CPU " << std::endl; > start = std::chrono::steady_clock::now(); > doublecpunorm1 = std::accumulate(cpuv.begin(), cpuv.end(), 0.0, > [](doublea, doubleb){ returna + > std::abs(b); }); > std::cout<< "Computing norm_1 on CPU "<< > std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::steady_clock::now()-start).count() > << std::endl<< std::endl; > > > > std::cout<< "GPU: "<< gpunorm1 << " CPU: "<< cpunorm1 << std::endl; > > > > return0; > } > > > > > ------------------------------------------------------------------------------ > Check out the vibrant tech community on one of the world's most > engaging tech sites, SlashDot.org! http://sdm.link/slashdot > > > > _______________________________________________ > ViennaCL-devel mailing list > Vie...@li... > https://lists.sourceforge.net/lists/listinfo/viennacl-devel > |
From: Oswin K. <Osw...@ru...> - 2017-02-17 19:19:47
|
Hi! Compiling the kernels for opencl takes time. try to measure the second time you compute something. On 2017-02-17 17:27, Sensei wrote: > Hi! > > I am new to the OpenCL/GPU world, and I probably expected too much > from it. I am computing the norm_1 of a vector, on the CPU and GPU, > and I had these results: > > PLATFORM APPLE > VERSION OPENCL 1.2 (JAN 4 2017 22:35:59) > >> DEVICE TYPE CPU > VERSION OPENCL 1.2 (JAN 4 2017 22:35:59) >> DEVICE TYPE GPU > VERSION OPENCL 1.2 (JAN 4 2017 22:35:59) > > STARTING, TIMES ARE IN MILLISECONDS > > RESERVING CPU VECTOR > RESERVING CPU VECTOR 16 > > FILLING CPU VECTOR > FILLING CPU VECTOR 9 > > RESERVING GPU VECTOR > RESERVING GPU VECTOR 82 > > COPYING TO GPU > COPYING TO GPU 158310 > > COMPUTING NORM_1 ON GPU > COMPUTING NORM_1 ON GPU 333 > > COMPUTING NORM_1 ON CPU > COMPUTING NORM_1 ON CPU 8 > > GPU: 5E+11 CPU: 5.00000E+11 > PROGRAM ENDED WITH EXIT CODE: 0 > > As you can see, the GPU times are waaaaay higher than the CPU ones. My > code is really simple, and I am following the recommended conduct to > build values on the CPU and then copy them. > > Is this bad performance due to my platform? I am running on a MacBook > Pro now. My code follows. > > Thanks! > > #include <iostream> > #include <vector> > #include <algorithm> > #include <cstdlib> > #include <numeric> > #include <chrono> > > #define CL_USE_DEPRECATED_OPENCL_1_1_APIS > #define __CL_ENABLE_EXCEPTIONS > > #define VIENNACL_WITH_OPENCL > > #include "cl.hpp" > #include "viennacl/scalar.hpp" > #include "viennacl/vector.hpp" > #include "viennacl/ocl/backend.hpp" > #include "viennacl/linalg/norm_1.hpp" > > int main(int argc, const char * argv[]) > { > // This is what vienna sees > auto viennaplatforms = viennacl::ocl::get_platforms(); > auto viennadevices = viennacl::ocl::platform().devices(); > > // See what standard OpenCL sees > std::vector<cl::Platform> platforms; > > // Get platform > cl::Platform::get(&platforms); > > // Temp > std::string s; > > // Where the GPU lies > cl::Device gpudevice; > > // Found a GPU > bool gpufound = false; > > // See if we have a GPU > for (auto p : platforms) > { > s.clear(); > p.getInfo(CL_PLATFORM_NAME, &s); > std::cout << "Platform " << s << std::endl; > > s.clear(); > p.getInfo(CL_PLATFORM_VERSION, &s); > std::cout << "Version " << s << std::endl; > > std::cout << std::endl; > > std::vector<cl::Device> devices; > > p.getDevices(CL_DEVICE_TYPE_ALL, &devices); > > for (auto d : devices) > { > std::size_t i = 4; > d.getInfo(CL_DEVICE_TYPE, &i); > > std::cout << "> Device type " << > (i & CL_DEVICE_TYPE_CPU ? "CPU" : "") << > (i & CL_DEVICE_TYPE_GPU ? "GPU" : "") << > (i & CL_DEVICE_TYPE_ACCELERATOR ? > "ACCELERATOR" : "") << > std::endl; > > if (i & CL_DEVICE_TYPE_GPU) > { > gpudevice = d; > gpufound = true; > } > > std::cout << "Version " << s << std::endl; > > } > } > > if (!gpufound) > { > std::cout << "NO GPU FOUND. ABORTING." << std::endl; > return 1; > } > > // Size > int size = 1 * 1000 * 1000; > > // Measuring time > auto start = std::chrono::steady_clock::now(); > > std::cout << std::endl << "STARTING, TIMES ARE IN MILLISECONDS" << > std::endl << std::endl; > > std::cout << "Reserving CPU vector " << std::endl; > start = std::chrono::steady_clock::now(); > std::vector<double> cpuv; > cpuv.resize(size); > std::cout << "Reserving CPU vector " << > std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::steady_clock::now()-start).count() > << std::endl << std::endl; > > std::cout << "Filling CPU vector " << std::endl; > start = std::chrono::steady_clock::now(); > std::iota(cpuv.begin(), cpuv.end(), 1.0 ); > std::cout << "Filling CPU vector " << > std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::steady_clock::now()-start).count() > << std::endl << std::endl; > > std::cout << "Reserving GPU vector " << std::endl; > start = std::chrono::steady_clock::now(); > viennacl::vector<float> gpuv; > gpuv.resize(size); > std::cout << "Reserving GPU vector " << > std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::steady_clock::now()-start).count() > << std::endl << std::endl; > > std::cout << "Copying to GPU " << std::endl; > start = std::chrono::steady_clock::now(); > std::copy(cpuv.begin(), cpuv.end(), gpuv.begin()); > std::cout << "Copying to GPU " << > std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::steady_clock::now()-start).count() > << std::endl << std::endl; > > std::cout << "Computing norm_1 on GPU " << std::endl; > start = std::chrono::steady_clock::now(); > double gpunorm1 = viennacl::linalg::norm_1(gpuv); > std::cout << "Computing norm_1 on GPU " << > std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::steady_clock::now()-start).count() > << std::endl << std::endl; > > std::cout << "Computing norm_1 on CPU " << std::endl; > start = std::chrono::steady_clock::now(); > double cpunorm1 = std::accumulate(cpuv.begin(), cpuv.end(), 0.0, > [](double a, double b){ return a > + std::abs(b); }); > std::cout << "Computing norm_1 on CPU " << > std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::steady_clock::now()-start).count() > << std::endl << std::endl; > > std::cout << "GPU: " << gpunorm1 << " CPU: " << cpunorm1 << > std::endl; > > return 0; > } > ------------------------------------------------------------------------------ > Check out the vibrant tech community on one of the world's most > engaging tech sites, SlashDot.org! http://sdm.link/slashdot > _______________________________________________ > ViennaCL-devel mailing list > Vie...@li... > https://lists.sourceforge.net/lists/listinfo/viennacl-devel |
From: Sensei <sen...@gm...> - 2017-02-17 16:27:17
|
Hi! I am new to the OpenCL/GPU world, and I probably expected too much from it. I am computing the norm_1 of a vector, on the CPU and GPU, and I had these results: Platform Apple Version OpenCL 1.2 (Jan 4 2017 22:35:59) > Device type CPU Version OpenCL 1.2 (Jan 4 2017 22:35:59) > Device type GPU Version OpenCL 1.2 (Jan 4 2017 22:35:59) STARTING, TIMES ARE IN MILLISECONDS Reserving CPU vector Reserving CPU vector 16 Filling CPU vector Filling CPU vector 9 Reserving GPU vector Reserving GPU vector 82 Copying to GPU Copying to GPU 158310 Computing norm_1 on GPU Computing norm_1 on GPU 333 Computing norm_1 on CPU Computing norm_1 on CPU 8 GPU: 5e+11 CPU: 5.00000e+11 Program ended with exit code: 0 As you can see, the GPU times are waaaaay higher than the CPU ones. My code is really simple, and I am following the recommended conduct to build values on the CPU and then copy them. Is this bad performance due to my platform? I am running on a MacBook Pro now. My code follows. Thanks! #include <iostream> #include <vector> #include <algorithm> #include <cstdlib> #include <numeric> #include <chrono> #define CL_USE_DEPRECATED_OPENCL_1_1_APIS #define __CL_ENABLE_EXCEPTIONS #define VIENNACL_WITH_OPENCL #include "cl.hpp" #include "viennacl/scalar.hpp" #include "viennacl/vector.hpp" #include "viennacl/ocl/backend.hpp" #include "viennacl/linalg/norm_1.hpp" int main(int argc, const char * argv[]) { // This is what vienna sees auto viennaplatforms = viennacl::ocl::get_platforms(); auto viennadevices = viennacl::ocl::platform().devices(); // See what standard OpenCL sees std::vector<cl::Platform> platforms; // Get platform cl::Platform::get(&platforms); // Temp std::string s; // Where the GPU lies cl::Device gpudevice; // Found a GPU bool gpufound = false; // See if we have a GPU for (auto p : platforms) { s.clear(); p.getInfo(CL_PLATFORM_NAME, &s); std::cout << "Platform " << s << std::endl; s.clear(); p.getInfo(CL_PLATFORM_VERSION, &s); std::cout << "Version " << s << std::endl; std::cout << std::endl; std::vector<cl::Device> devices; p.getDevices(CL_DEVICE_TYPE_ALL, &devices); for (auto d : devices) { std::size_t i = 4; d.getInfo(CL_DEVICE_TYPE, &i); std::cout << "> Device type " << (i & CL_DEVICE_TYPE_CPU ? "CPU" : "") << (i & CL_DEVICE_TYPE_GPU ? "GPU" : "") << (i & CL_DEVICE_TYPE_ACCELERATOR ? "ACCELERATOR" : "") << std::endl; if (i & CL_DEVICE_TYPE_GPU) { gpudevice = d; gpufound = true; } std::cout << "Version " << s << std::endl; } } if (!gpufound) { std::cout << "NO GPU FOUND. ABORTING." << std::endl; return 1; } // Size int size = 1 * 1000 * 1000; // Measuring time auto start = std::chrono::steady_clock::now(); std::cout << std::endl << "STARTING, TIMES ARE IN MILLISECONDS" << std::endl << std::endl; std::cout << "Reserving CPU vector " << std::endl; start = std::chrono::steady_clock::now(); std::vector<double> cpuv; cpuv.resize(size); std::cout << "Reserving CPU vector " << std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::steady_clock::now()-start).count() << std::endl << std::endl; std::cout << "Filling CPU vector " << std::endl; start = std::chrono::steady_clock::now(); std::iota(cpuv.begin(), cpuv.end(), 1.0 ); std::cout << "Filling CPU vector " << std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::steady_clock::now()-start).count() << std::endl << std::endl; std::cout << "Reserving GPU vector " << std::endl; start = std::chrono::steady_clock::now(); viennacl::vector<float> gpuv; gpuv.resize(size); std::cout << "Reserving GPU vector " << std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::steady_clock::now()-start).count() << std::endl << std::endl; std::cout << "Copying to GPU " << std::endl; start = std::chrono::steady_clock::now(); std::copy(cpuv.begin(), cpuv.end(), gpuv.begin()); std::cout << "Copying to GPU " << std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::steady_clock::now()-start).count() << std::endl << std::endl; std::cout << "Computing norm_1 on GPU " << std::endl; start = std::chrono::steady_clock::now(); double gpunorm1 = viennacl::linalg::norm_1(gpuv); std::cout << "Computing norm_1 on GPU " << std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::steady_clock::now()-start).count() << std::endl << std::endl; std::cout << "Computing norm_1 on CPU " << std::endl; start = std::chrono::steady_clock::now(); double cpunorm1 = std::accumulate(cpuv.begin(), cpuv.end(), 0.0, [](double a, double b){ return a + std::abs(b); }); std::cout << "Computing norm_1 on CPU " << std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::steady_clock::now()-start).count() << std::endl << std::endl; std::cout << "GPU: " << gpunorm1 << " CPU: " << cpunorm1 << std::endl; return 0; } |
From: Charles D. <cde...@gm...> - 2016-12-15 14:03:00
|
Hi Karl, here is the kernel below. Regarding your second point, I would love to process all columns in one kernel but I want to avoid initializing another entire matrix of the same size. To avoid this I am trying to only initialize a vector of size = number of rows which can then be assigned to the source matrix. Ideally I would like to do the reordering 'inplace' so I can pass the row indexes I want them to be int. __kernel void set_row_order( __global const double *A, __global double *B, __global const int *indices, const int Mdim, const int globalCol, const int MdimPad) { // Get the index of the elements to be processed const int globalRow = get_global_id(0); // C Row ID //const int globalCol = get_global_id(1); // C Col ID // Do the operation if((globalRow <= Mdim)){ B[globalRow] = A[indices[globalRow] * MdimPad + globalCol]; } } On Thu, Dec 15, 2016 at 5:18 AM, Karl Rupp <ru...@iu...> wrote: > Hi Charles, > > can you please send us the kernel? Maybe there's something wrong with the > thread assignment there. > > Also, rather than looping from 0 to P-1, it would make much more sense to > process all columns in parallel in a single kernel. > > Best regards, > Karli > > > On 12/14/2016 06:01 PM, Charles Determan wrote: > >> A quick addition, it also only seems to crash when the number of rows in >> the input matrix match or exceed 1000 (i.e. it works with the trivial >> example with 100 rows). >> >> Charles >> >> On Wed, Dec 14, 2016 at 10:55 AM, Charles Determan >> <cde...@gm... <mailto:cde...@gm...>> wrote: >> >> I have a function where I use a custom opencl kernel. The function >> is below. The function runs without problem and provides the >> correct result after the *first time* I call it. However, if I try >> >> to call the function again it crashes right after the 'initialized' >> output where it is trying to add the kernel program. Any idea why >> it would be crashing here on subsequent calls? Is there some >> cleanup I should be doing at the end of this function? >> >> Thanks, >> Charles >> >> >> template<typename T> >> void >> cpp_vclMatrix_set_row_order( >> SEXP ptrA_, >> const bool AisVCL, >> Eigen::VectorXi indices, >> SEXP sourceCode_, >> const int max_local_size, >> const int ctx_id) >> { >> >> std::cout << "called" << std::endl; >> >> std::string my_kernel = as<std::string>(sourceCode_); >> >> viennacl::ocl::context ctx(viennacl::ocl::get_context(ctx_id)); >> >> viennacl::matrix<T> *vcl_A; >> // viennacl::matrix<T> *vcl_B; >> >> std::cout << "getting matrix" << std::endl; >> vcl_A = getVCLptr<T>(ptrA_, AisVCL, ctx_id); >> // vcl_B = getVCLptr<T>(ptrB_, BisVCL, ctx_id); >> >> unsigned int M = vcl_A->size1(); >> // // int N = vcl_B.size1(); >> unsigned int P = vcl_A->size2(); >> unsigned int M_internal = vcl_A->internal_size1(); >> unsigned int P_internal = vcl_A->internal_size2(); >> >> std::cout << "initialized" << std::endl; >> >> // add kernel to program >> viennacl::ocl::program & my_prog = ctx.add_program(my_kernel, >> "my_kernel"); >> >> std::cout << "program added" << std::endl; >> >> // get compiled kernel function >> viennacl::ocl::kernel & set_row_order = >> my_prog.get_kernel("set_row_order"); >> >> std::cout << "got kernel" << std::endl; >> >> // set global work sizes >> set_row_order.global_work_size(0, M_internal); >> set_row_order.global_work_size(1, P_internal); >> >> std::cout << "set global" << std::endl; >> >> // set local work sizes >> set_row_order.local_work_size(0, max_local_size); >> set_row_order.local_work_size(1, max_local_size); >> >> std::cout << "begin enqueue" << std::endl; >> >> { >> >> std::cout << "moving indexes" << std::endl; >> viennacl::vector<int> vcl_I(indices.size()); >> viennacl::copy(indices, vcl_I); >> >> std::cout << "creating dummy vector" << std::endl; >> viennacl::vector<T> vcl_V = viennacl::zero_vector<T>(M); >> >> viennacl::matrix_base<T> vcl_B(vcl_V.handle(), >> M, 0, 1, M, //row layout >> 1, 0, 1, 1, //column layout >> true); // row-major >> >> viennacl::range r(0, M); >> >> for(unsigned int i=0; i < P; i++){ >> >> viennacl::range c(i, i+1); >> >> viennacl::matrix_range<viennacl::matrix<T> > tmp(*vcl_A, >> r, c); >> >> // std::cout << tmp << std::endl; >> >> viennacl::ocl::enqueue(set_row_order(tmp, vcl_B, vcl_I, >> M, i, M_internal)); >> >> tmp = vcl_B; >> } >> } >> } >> >> >> >> >> ------------------------------------------------------------ >> ------------------ >> Check out the vibrant tech community on one of the world's most >> engaging tech sites, SlashDot.org! http://sdm.link/slashdot >> >> >> >> _______________________________________________ >> ViennaCL-devel mailing list >> Vie...@li... >> https://lists.sourceforge.net/lists/listinfo/viennacl-devel >> >> > |
From: Karl R. <ru...@iu...> - 2016-12-15 11:18:36
|
Hi Charles, can you please send us the kernel? Maybe there's something wrong with the thread assignment there. Also, rather than looping from 0 to P-1, it would make much more sense to process all columns in parallel in a single kernel. Best regards, Karli On 12/14/2016 06:01 PM, Charles Determan wrote: > A quick addition, it also only seems to crash when the number of rows in > the input matrix match or exceed 1000 (i.e. it works with the trivial > example with 100 rows). > > Charles > > On Wed, Dec 14, 2016 at 10:55 AM, Charles Determan > <cde...@gm... <mailto:cde...@gm...>> wrote: > > I have a function where I use a custom opencl kernel. The function > is below. The function runs without problem and provides the > correct result after the *first time* I call it. However, if I try > to call the function again it crashes right after the 'initialized' > output where it is trying to add the kernel program. Any idea why > it would be crashing here on subsequent calls? Is there some > cleanup I should be doing at the end of this function? > > Thanks, > Charles > > > template<typename T> > void > cpp_vclMatrix_set_row_order( > SEXP ptrA_, > const bool AisVCL, > Eigen::VectorXi indices, > SEXP sourceCode_, > const int max_local_size, > const int ctx_id) > { > > std::cout << "called" << std::endl; > > std::string my_kernel = as<std::string>(sourceCode_); > > viennacl::ocl::context ctx(viennacl::ocl::get_context(ctx_id)); > > viennacl::matrix<T> *vcl_A; > // viennacl::matrix<T> *vcl_B; > > std::cout << "getting matrix" << std::endl; > vcl_A = getVCLptr<T>(ptrA_, AisVCL, ctx_id); > // vcl_B = getVCLptr<T>(ptrB_, BisVCL, ctx_id); > > unsigned int M = vcl_A->size1(); > // // int N = vcl_B.size1(); > unsigned int P = vcl_A->size2(); > unsigned int M_internal = vcl_A->internal_size1(); > unsigned int P_internal = vcl_A->internal_size2(); > > std::cout << "initialized" << std::endl; > > // add kernel to program > viennacl::ocl::program & my_prog = ctx.add_program(my_kernel, > "my_kernel"); > > std::cout << "program added" << std::endl; > > // get compiled kernel function > viennacl::ocl::kernel & set_row_order = > my_prog.get_kernel("set_row_order"); > > std::cout << "got kernel" << std::endl; > > // set global work sizes > set_row_order.global_work_size(0, M_internal); > set_row_order.global_work_size(1, P_internal); > > std::cout << "set global" << std::endl; > > // set local work sizes > set_row_order.local_work_size(0, max_local_size); > set_row_order.local_work_size(1, max_local_size); > > std::cout << "begin enqueue" << std::endl; > > { > > std::cout << "moving indexes" << std::endl; > viennacl::vector<int> vcl_I(indices.size()); > viennacl::copy(indices, vcl_I); > > std::cout << "creating dummy vector" << std::endl; > viennacl::vector<T> vcl_V = viennacl::zero_vector<T>(M); > > viennacl::matrix_base<T> vcl_B(vcl_V.handle(), > M, 0, 1, M, //row layout > 1, 0, 1, 1, //column layout > true); // row-major > > viennacl::range r(0, M); > > for(unsigned int i=0; i < P; i++){ > > viennacl::range c(i, i+1); > > viennacl::matrix_range<viennacl::matrix<T> > tmp(*vcl_A, > r, c); > > // std::cout << tmp << std::endl; > > viennacl::ocl::enqueue(set_row_order(tmp, vcl_B, vcl_I, > M, i, M_internal)); > > tmp = vcl_B; > } > } > } > > > > > ------------------------------------------------------------------------------ > Check out the vibrant tech community on one of the world's most > engaging tech sites, SlashDot.org! http://sdm.link/slashdot > > > > _______________________________________________ > ViennaCL-devel mailing list > Vie...@li... > https://lists.sourceforge.net/lists/listinfo/viennacl-devel > |
From: Charles D. <cde...@gm...> - 2016-12-14 17:02:04
|
A quick addition, it also only seems to crash when the number of rows in the input matrix match or exceed 1000 (i.e. it works with the trivial example with 100 rows). Charles On Wed, Dec 14, 2016 at 10:55 AM, Charles Determan <cde...@gm...> wrote: > I have a function where I use a custom opencl kernel. The function is > below. The function runs without problem and provides the correct result > after the *first time* I call it. However, if I try to call the function > again it crashes right after the 'initialized' output where it is trying to > add the kernel program. Any idea why it would be crashing here on > subsequent calls? Is there some cleanup I should be doing at the end of > this function? > > Thanks, > Charles > > > template<typename T> > void > cpp_vclMatrix_set_row_order( > SEXP ptrA_, > const bool AisVCL, > Eigen::VectorXi indices, > SEXP sourceCode_, > const int max_local_size, > const int ctx_id) > { > > std::cout << "called" << std::endl; > > std::string my_kernel = as<std::string>(sourceCode_); > > viennacl::ocl::context ctx(viennacl::ocl::get_context(ctx_id)); > > viennacl::matrix<T> *vcl_A; > // viennacl::matrix<T> *vcl_B; > > std::cout << "getting matrix" << std::endl; > vcl_A = getVCLptr<T>(ptrA_, AisVCL, ctx_id); > // vcl_B = getVCLptr<T>(ptrB_, BisVCL, ctx_id); > > unsigned int M = vcl_A->size1(); > // // int N = vcl_B.size1(); > unsigned int P = vcl_A->size2(); > unsigned int M_internal = vcl_A->internal_size1(); > unsigned int P_internal = vcl_A->internal_size2(); > > std::cout << "initialized" << std::endl; > > // add kernel to program > viennacl::ocl::program & my_prog = ctx.add_program(my_kernel, > "my_kernel"); > > std::cout << "program added" << std::endl; > > // get compiled kernel function > viennacl::ocl::kernel & set_row_order = my_prog.get_kernel("set_row_ > order"); > > std::cout << "got kernel" << std::endl; > > // set global work sizes > set_row_order.global_work_size(0, M_internal); > set_row_order.global_work_size(1, P_internal); > > std::cout << "set global" << std::endl; > > // set local work sizes > set_row_order.local_work_size(0, max_local_size); > set_row_order.local_work_size(1, max_local_size); > > std::cout << "begin enqueue" << std::endl; > > { > > std::cout << "moving indexes" << std::endl; > viennacl::vector<int> vcl_I(indices.size()); > viennacl::copy(indices, vcl_I); > > std::cout << "creating dummy vector" << std::endl; > viennacl::vector<T> vcl_V = viennacl::zero_vector<T>(M); > > viennacl::matrix_base<T> vcl_B(vcl_V.handle(), > M, 0, 1, M, //row layout > 1, 0, 1, 1, //column layout > true); // row-major > > viennacl::range r(0, M); > > for(unsigned int i=0; i < P; i++){ > > viennacl::range c(i, i+1); > > viennacl::matrix_range<viennacl::matrix<T> > tmp(*vcl_A, r, > c); > > // std::cout << tmp << std::endl; > > viennacl::ocl::enqueue(set_row_order(tmp, vcl_B, vcl_I, M, i, > M_internal)); > > tmp = vcl_B; > } > } > } > |
From: Charles D. <cde...@gm...> - 2016-12-14 16:56:06
|
I have a function where I use a custom opencl kernel. The function is below. The function runs without problem and provides the correct result after the *first time* I call it. However, if I try to call the function again it crashes right after the 'initialized' output where it is trying to add the kernel program. Any idea why it would be crashing here on subsequent calls? Is there some cleanup I should be doing at the end of this function? Thanks, Charles template<typename T> void cpp_vclMatrix_set_row_order( SEXP ptrA_, const bool AisVCL, Eigen::VectorXi indices, SEXP sourceCode_, const int max_local_size, const int ctx_id) { std::cout << "called" << std::endl; std::string my_kernel = as<std::string>(sourceCode_); viennacl::ocl::context ctx(viennacl::ocl::get_context(ctx_id)); viennacl::matrix<T> *vcl_A; // viennacl::matrix<T> *vcl_B; std::cout << "getting matrix" << std::endl; vcl_A = getVCLptr<T>(ptrA_, AisVCL, ctx_id); // vcl_B = getVCLptr<T>(ptrB_, BisVCL, ctx_id); unsigned int M = vcl_A->size1(); // // int N = vcl_B.size1(); unsigned int P = vcl_A->size2(); unsigned int M_internal = vcl_A->internal_size1(); unsigned int P_internal = vcl_A->internal_size2(); std::cout << "initialized" << std::endl; // add kernel to program viennacl::ocl::program & my_prog = ctx.add_program(my_kernel, "my_kernel"); std::cout << "program added" << std::endl; // get compiled kernel function viennacl::ocl::kernel & set_row_order = my_prog.get_kernel("set_row_order"); std::cout << "got kernel" << std::endl; // set global work sizes set_row_order.global_work_size(0, M_internal); set_row_order.global_work_size(1, P_internal); std::cout << "set global" << std::endl; // set local work sizes set_row_order.local_work_size(0, max_local_size); set_row_order.local_work_size(1, max_local_size); std::cout << "begin enqueue" << std::endl; { std::cout << "moving indexes" << std::endl; viennacl::vector<int> vcl_I(indices.size()); viennacl::copy(indices, vcl_I); std::cout << "creating dummy vector" << std::endl; viennacl::vector<T> vcl_V = viennacl::zero_vector<T>(M); viennacl::matrix_base<T> vcl_B(vcl_V.handle(), M, 0, 1, M, //row layout 1, 0, 1, 1, //column layout true); // row-major viennacl::range r(0, M); for(unsigned int i=0; i < P; i++){ viennacl::range c(i, i+1); viennacl::matrix_range<viennacl::matrix<T> > tmp(*vcl_A, r, c); // std::cout << tmp << std::endl; viennacl::ocl::enqueue(set_row_order(tmp, vcl_B, vcl_I, M, i, M_internal)); tmp = vcl_B; } } } |
From: Philippe T. <phi...@gm...> - 2016-11-29 22:07:27
|
Ah, sorry about that. I thought it was still only necessary for GEMM (i.e., my fault :p). In my experience, padding of 4 along the leading dimension can be pretty useful. But the 128x128 padding required by the GEMM kernel was a big design mistake of mine, hence my desire to see this restriction annihilated. On Tue, Nov 29, 2016 at 5:03 PM, Karl Rupp <ru...@iu...> wrote: > Hi Philippe, > > I know I mentioned this to you already Karl, but ViennaCL could really >> benefit from using the GEMM code in Isaac, which has higher performance >> and doesn't require >> padding: https://github.com/ptillet/isaac/blob/master/lib/jit/generat >> ion/gemm.cpp >> (benchmarks on the main project page). >> > > Sure - yet the world does not only consist of GEMM ;-) But yes, the > padding was initially introduced because of GEMM. Yet it would not be wise > to give up on padding completely. There are a bunch of cases where this is > useful. > > Best regards, > Karli > > > On Tue, Nov 29, 2016 at 4:33 PM, Karl Rupp <ru...@iu... >> <mailto:ru...@iu...>> wrote: >> >> Hi, >> >> > I have been looking through the documentation and I can't find any >> > direction on how to create a matrix without the default padding. >> This >> > is providing to be a memory problem for me when working with very >> > 'narrow' matrices that are also quite long filling up the available >> > memory. If it is in the docs feel free to point it out but I have >> been >> > stumped at the moment. >> >> unfortunately there is no way to create a viennacl::matrix without >> padding, unless you provide your own host array, CUDA buffer, or >> OpenCL >> memory handle. There is not even a way of doing it through >> matrix_base. >> This is oversight and will be fixed asap. Allow for one more day :-) >> >> Best regards, >> Karli >> >> >> ------------------------------------------------------------ >> ------------------ >> _______________________________________________ >> ViennaCL-devel mailing list >> Vie...@li... >> <mailto:Vie...@li...> >> https://lists.sourceforge.net/lists/listinfo/viennacl-devel >> <https://lists.sourceforge.net/lists/listinfo/viennacl-devel> >> >> >> > |
From: Karl R. <ru...@iu...> - 2016-11-29 22:03:15
|
Hi Philippe, > I know I mentioned this to you already Karl, but ViennaCL could really > benefit from using the GEMM code in Isaac, which has higher performance > and doesn't require > padding: https://github.com/ptillet/isaac/blob/master/lib/jit/generation/gemm.cpp > (benchmarks on the main project page). Sure - yet the world does not only consist of GEMM ;-) But yes, the padding was initially introduced because of GEMM. Yet it would not be wise to give up on padding completely. There are a bunch of cases where this is useful. Best regards, Karli > On Tue, Nov 29, 2016 at 4:33 PM, Karl Rupp <ru...@iu... > <mailto:ru...@iu...>> wrote: > > Hi, > > > I have been looking through the documentation and I can't find any > > direction on how to create a matrix without the default padding. This > > is providing to be a memory problem for me when working with very > > 'narrow' matrices that are also quite long filling up the available > > memory. If it is in the docs feel free to point it out but I have been > > stumped at the moment. > > unfortunately there is no way to create a viennacl::matrix without > padding, unless you provide your own host array, CUDA buffer, or OpenCL > memory handle. There is not even a way of doing it through matrix_base. > This is oversight and will be fixed asap. Allow for one more day :-) > > Best regards, > Karli > > > ------------------------------------------------------------------------------ > _______________________________________________ > ViennaCL-devel mailing list > Vie...@li... > <mailto:Vie...@li...> > https://lists.sourceforge.net/lists/listinfo/viennacl-devel > <https://lists.sourceforge.net/lists/listinfo/viennacl-devel> > > |
From: Philippe T. <phi...@gm...> - 2016-11-29 21:39:34
|
Hi, I know I mentioned this to you already Karl, but ViennaCL could really benefit from using the GEMM code in Isaac, which has higher performance and doesn't require padding: https://github.com/ptillet/isaac/blob/master/lib/jit/generation/gemm.cpp (benchmarks on the main project page). Philippe On Tue, Nov 29, 2016 at 4:33 PM, Karl Rupp <ru...@iu...> wrote: > Hi, > > > I have been looking through the documentation and I can't find any > > direction on how to create a matrix without the default padding. This > > is providing to be a memory problem for me when working with very > > 'narrow' matrices that are also quite long filling up the available > > memory. If it is in the docs feel free to point it out but I have been > > stumped at the moment. > > unfortunately there is no way to create a viennacl::matrix without > padding, unless you provide your own host array, CUDA buffer, or OpenCL > memory handle. There is not even a way of doing it through matrix_base. > This is oversight and will be fixed asap. Allow for one more day :-) > > Best regards, > Karli > > > ------------------------------------------------------------ > ------------------ > _______________________________________________ > ViennaCL-devel mailing list > Vie...@li... > https://lists.sourceforge.net/lists/listinfo/viennacl-devel > |
From: Karl R. <ru...@iu...> - 2016-11-29 21:33:43
|
Hi, > I have been looking through the documentation and I can't find any > direction on how to create a matrix without the default padding. This > is providing to be a memory problem for me when working with very > 'narrow' matrices that are also quite long filling up the available > memory. If it is in the docs feel free to point it out but I have been > stumped at the moment. unfortunately there is no way to create a viennacl::matrix without padding, unless you provide your own host array, CUDA buffer, or OpenCL memory handle. There is not even a way of doing it through matrix_base. This is oversight and will be fixed asap. Allow for one more day :-) Best regards, Karli |
From: Charles D. <cde...@gm...> - 2016-11-28 18:06:15
|
I have been looking through the documentation and I can't find any direction on how to create a matrix without the default padding. This is providing to be a memory problem for me when working with very 'narrow' matrices that are also quite long filling up the available memory. If it is in the docs feel free to point it out but I have been stumped at the moment. Thanks, Charles |
From: Charles D. <cde...@gm...> - 2016-11-28 16:04:48
|
That works perfectly, thanks Karl. Regards, Charles On Wed, Nov 23, 2016 at 2:53 PM, Karl Rupp <ru...@iu...> wrote: > Hi Charles, > > Right now, if I want to take the negative of every element in a matrix I >> end up doing the following: >> >> // previously assigned >> viennacl::matrix<T> vcl_A; >> >> // matrix of zeros to subtract from >> viennacl::matrix<T> vcl_Z = >> viennacl::zero_matrix<T>(vcl_A.size1(),vcl_A.size2()); >> >> // subtract in-place >> vcl_Z -= vcl_A; >> vcl_A = vcl_Z; >> >> Is there a more efficient way to approach this? Allocating an >> additional entire matrix is proving quite wasteful in some of my >> benchmarks. >> > > What about just > > vcl_A = T(-1) * vcl_A; > > ? This is 'inplace' as requested :-) > > Best regards, > Karli > > > |