You can subscribe to this list here.
2010 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(1) |
Jun
(8) |
Jul
(16) |
Aug
(6) |
Sep
|
Oct
|
Nov
|
Dec
(5) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2011 |
Jan
(4) |
Feb
(3) |
Mar
(5) |
Apr
|
May
(24) |
Jun
|
Jul
(5) |
Aug
(17) |
Sep
|
Oct
(6) |
Nov
(9) |
Dec
(8) |
2012 |
Jan
(5) |
Feb
(14) |
Mar
(25) |
Apr
(7) |
May
(15) |
Jun
(12) |
Jul
(22) |
Aug
(4) |
Sep
(10) |
Oct
(10) |
Nov
(19) |
Dec
(17) |
2013 |
Jan
(8) |
Feb
(10) |
Mar
(16) |
Apr
(3) |
May
(16) |
Jun
(26) |
Jul
|
Aug
(9) |
Sep
|
Oct
(8) |
Nov
(17) |
Dec
(2) |
2014 |
Jan
(37) |
Feb
(15) |
Mar
(6) |
Apr
(9) |
May
(11) |
Jun
(11) |
Jul
(9) |
Aug
(9) |
Sep
(19) |
Oct
(4) |
Nov
(22) |
Dec
(21) |
2015 |
Jan
|
Feb
(7) |
Mar
(2) |
Apr
(17) |
May
(22) |
Jun
(11) |
Jul
(11) |
Aug
(6) |
Sep
(7) |
Oct
|
Nov
(5) |
Dec
|
2016 |
Jan
(1) |
Feb
(3) |
Mar
(4) |
Apr
(8) |
May
(8) |
Jun
(11) |
Jul
(2) |
Aug
|
Sep
|
Oct
|
Nov
|
Dec
(6) |
2017 |
Jan
|
Feb
(1) |
Mar
(2) |
Apr
(19) |
May
|
Jun
(7) |
Jul
(7) |
Aug
(2) |
Sep
(6) |
Oct
|
Nov
(3) |
Dec
|
2018 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
(4) |
Oct
|
Nov
(2) |
Dec
|
2019 |
Jan
(2) |
Feb
|
Mar
(1) |
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
(3) |
2020 |
Jan
|
Feb
(4) |
Mar
|
Apr
|
May
|
Jun
|
Jul
(2) |
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2021 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
(1) |
Jul
(2) |
Aug
|
Sep
(2) |
Oct
|
Nov
(31) |
Dec
(4) |
2024 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(2) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
From: Karl R. <ru...@iu...> - 2017-07-24 18:22:36
|
Hi, yes, it should work for both single and double precision. Did you encounter errors? Best regards, Karli On 07/24/2017 12:38 PM, Qusai Al Shidi wrote: > Does element_fabs() work for floats and doubles? > > Thank you, > -- > Qusai Al Shidi > ( http://qalshidi.net ) > > > ------------------------------------------------------------------------------ > Check out the vibrant tech community on one of the world's most > engaging tech sites, Slashdot.org! http://sdm.link/slashdot > > > > _______________________________________________ > ViennaCL-support mailing list > Vie...@li... > https://lists.sourceforge.net/lists/listinfo/viennacl-support > |
From: Qusai Al S. <aqu...@gm...> - 2017-07-24 17:38:23
|
Does element_fabs() work for floats and doubles? Thank you, -- Qusai Al Shidi ( http://qalshidi.net ) |
From: Qusai Al S. <Qus...@st...> - 2017-07-24 17:35:03
|
Does element_fabs work for both floats and doubles? Thanks, -- Qusai Al Shidi ( http://qalshidi.net ) |
From: Charles D. <cde...@gm...> - 2017-06-27 15:19:20
|
Hi Maciej, The memory management is still something that I am trying to improve. If you use a GPU monitoring program you can watch the GPU usage fill up. You are likely seeing crashes on 100, 1000 after the crash on 2000 because previous data is likely still on the GPU. If you call 'gc()' that should clear up the GPU and you can run your smaller matrices again. You need to keep in mind a few things here. 1. Your GPU isn't very large (~2GB from my calculations) 2. You are on Windows which usually is leveraging whatever card you have installed 3. You are using 'double' precision by default 4. Matrices are padded as well for performance reasons (so they are larger than they appear) 5. The RAM cleanup is only completed when R calls garbage collection or you do it manually (i.e. gc()) I believe this should help you for now. I assume you are using the current CRAN version of gpuR. I have been working a lot on the next version where I am leveraging smart pointers in C++ to hopefully help manage the memory persistence better. You can always try it out from my github with `devtools::install_github('cdeterman/gpuR', ref = 'develop')` Hope this helps, Charles On Tue, Jun 27, 2017 at 2:40 AM, Karl Rupp <ru...@iu...> wrote: > Hi Maciej, > > does the problem also show up for smaller values of k? 2000 should be > alright, but I've already seen cases where a video in the background caused > such kind of problems (because GPU-RAM was almost exhausted). > > I also CC: Chales Determan, who is the author of gpuR. > > Best regards, > Karli > > > On 06/27/2017 09:28 AM, Maciej Janiec wrote: > >> I was able to use the gpuR package just once. After the first time, it is >> crashing every time. >> >> System: Windows 10 >> GPU: GeForce GT 730 >> >> gpuMatrix is created, but the code crashed at gpuA %*% gpuA. >> >> This works: >> >> > k <- 2000 >> > >> > system.time( { >> + >> + gpuA <- gpuMatrix(rnorm(k^2), nrow=k, ncol=k) >> + # gpuB <- gpuA %*% gpuA >> + >> + } ) >> user system elapsed >> 0.39 0.05 0.44 >> >> >> This crashes: >> >> > k <- 2000 >> > >> > system.time( { >> + >> + gpuA <- gpuMatrix(rnorm(k^2), nrow=k, ncol=k) >> + gpuB <- gpuA %*% gpuA >> + >> + } ) >> ViennaCL: FATAL ERROR: Kernel start failed for 'assign_cpu'. >> ViennaCL: Smaller work sizes could not solve the problem. >> Show Traceback >> Rerun with Debug >> Error in cpp_gpuMatrix_gemm(A@address, B@address, C@address, 8L) : >> ViennaCL: FATAL ERROR: CL_MEM_OBJECT_ALLOCATION_FAILURE >> ViennaCL could not allocate memory on the device. Most likely the >> device simply ran out of memory. >> If you think that this is a bug in ViennaCL, please report it at >> vie...@li... <mailto:viennacl-support@lists >> .sourceforge.net> and supply at least the following information: >> * Operating System >> * Which OpenCL implementation (AMD, NVIDIA, etc.) >> * ViennaCL version >> Many thanks in advance! Timing stopped at: 0.42 0.07 0.5 >> >> System stats: >> >> > gpuInfo() >> $deviceName >> [1] "GeForce GT 730" >> >> $deviceVendor >> [1] "NVIDIA Corporation" >> >> $numberOfCores >> [1] 2 >> >> $maxWorkGroupSize >> [1] 1024 >> >> $maxWorkItemDim >> [1] 3 >> >> $maxWorkItemSizes >> [1] 1024 1024 64 >> >> $deviceMemory >> [1] 2147483648 >> >> $clockFreq >> [1] 1400 >> >> $localMem >> [1] 49152 >> >> $maxAllocatableMem >> [1] 536870912 >> >> $available >> [1] "yes" >> >> $deviceExtensions >> [1] "cl_khr_global_int32_base_atomics" "cl_khr_global_int32_extended_atomics" >> "cl_khr_local_int32_base_atomics" >> [4] "cl_khr_local_int32_extended_atomics" "cl_khr_fp64" >> "cl_khr_byte_addressable_store" >> [7] "cl_khr_icd" "cl_khr_gl_sharing" >> "cl_nv_compiler_options" >> [10] "cl_nv_device_attribute_query" "cl_nv_pragma_unroll" >> "cl_nv_d3d10_sharing" >> [13] "cl_khr_d3d10_sharing" "cl_nv_d3d11_sharing" >> "cl_nv_copy_opts" >> >> $double_support >> [1] TRUE >> >> > >> > detectPlatforms() >> [1] 1 >> > detectGPUs() >> [1] 1 >> >> MJ >> >> >> >> ------------------------------------------------------------ >> ------------------ >> Check out the vibrant tech community on one of the world's most >> engaging tech sites, Slashdot.org! http://sdm.link/slashdot >> >> >> >> _______________________________________________ >> ViennaCL-support mailing list >> Vie...@li... >> https://lists.sourceforge.net/lists/listinfo/viennacl-support >> >> |
From: Maciej J. <mj...@gm...> - 2017-06-27 08:12:30
|
[image: Inline image 1] MJ On Tue, Jun 27, 2017 at 10:08 AM, Maciej Janiec <mj...@gm...> wrote: > Now it worked for k= 100 & 1000. > > After I've increased k to 2000 it crashed. > > It is crashing on subsequent runs at 100 or 1000 now. > > I've restarted the system. Nothing heavy runs in the background. > > MJ > > On Tue, Jun 27, 2017 at 9:40 AM, Karl Rupp <ru...@iu...> wrote: > >> Hi Maciej, >> >> does the problem also show up for smaller values of k? 2000 should be >> alright, but I've already seen cases where a video in the background caused >> such kind of problems (because GPU-RAM was almost exhausted). >> >> I also CC: Chales Determan, who is the author of gpuR. >> >> Best regards, >> Karli >> >> >> >> On 06/27/2017 09:28 AM, Maciej Janiec wrote: >> >>> I was able to use the gpuR package just once. After the first time, it >>> is crashing every time. >>> >>> System: Windows 10 >>> GPU: GeForce GT 730 >>> >>> gpuMatrix is created, but the code crashed at gpuA %*% gpuA. >>> >>> This works: >>> >>> > k <- 2000 >>> > >>> > system.time( { >>> + >>> + gpuA <- gpuMatrix(rnorm(k^2), nrow=k, ncol=k) >>> + # gpuB <- gpuA %*% gpuA >>> + >>> + } ) >>> user system elapsed >>> 0.39 0.05 0.44 >>> >>> >>> This crashes: >>> >>> > k <- 2000 >>> > >>> > system.time( { >>> + >>> + gpuA <- gpuMatrix(rnorm(k^2), nrow=k, ncol=k) >>> + gpuB <- gpuA %*% gpuA >>> + >>> + } ) >>> ViennaCL: FATAL ERROR: Kernel start failed for 'assign_cpu'. >>> ViennaCL: Smaller work sizes could not solve the problem. >>> Show Traceback >>> Rerun with Debug >>> Error in cpp_gpuMatrix_gemm(A@address, B@address, C@address, 8L) : >>> ViennaCL: FATAL ERROR: CL_MEM_OBJECT_ALLOCATION_FAILURE >>> ViennaCL could not allocate memory on the device. Most likely the >>> device simply ran out of memory. >>> If you think that this is a bug in ViennaCL, please report it at >>> vie...@li... <mailto:viennacl-support@lists >>> .sourceforge.net> and supply at least the following information: >>> >>> * Operating System >>> * Which OpenCL implementation (AMD, NVIDIA, etc.) >>> * ViennaCL version >>> Many thanks in advance! Timing stopped at: 0.42 0.07 0.5 >>> >>> System stats: >>> >>> > gpuInfo() >>> $deviceName >>> [1] "GeForce GT 730" >>> >>> $deviceVendor >>> [1] "NVIDIA Corporation" >>> >>> $numberOfCores >>> [1] 2 >>> >>> $maxWorkGroupSize >>> [1] 1024 >>> >>> $maxWorkItemDim >>> [1] 3 >>> >>> $maxWorkItemSizes >>> [1] 1024 1024 64 >>> >>> $deviceMemory >>> [1] 2147483648 >>> >>> $clockFreq >>> [1] 1400 >>> >>> $localMem >>> [1] 49152 >>> >>> $maxAllocatableMem >>> [1] 536870912 >>> >>> $available >>> [1] "yes" >>> >>> $deviceExtensions >>> [1] "cl_khr_global_int32_base_atomics" >>> "cl_khr_global_int32_extended_atomics" "cl_khr_local_int32_base_atomi >>> cs" >>> [4] "cl_khr_local_int32_extended_atomics" "cl_khr_fp64" >>> "cl_khr_byte_addressable_store" >>> [7] "cl_khr_icd" "cl_khr_gl_sharing" >>> "cl_nv_compiler_options" >>> [10] "cl_nv_device_attribute_query" "cl_nv_pragma_unroll" >>> "cl_nv_d3d10_sharing" >>> [13] "cl_khr_d3d10_sharing" "cl_nv_d3d11_sharing" >>> "cl_nv_copy_opts" >>> >>> $double_support >>> [1] TRUE >>> >>> > >>> > detectPlatforms() >>> [1] 1 >>> > detectGPUs() >>> [1] 1 >>> >>> MJ >>> >>> >>> >>> ------------------------------------------------------------ >>> ------------------ >>> Check out the vibrant tech community on one of the world's most >>> engaging tech sites, Slashdot.org! http://sdm.link/slashdot >>> >>> >>> >>> _______________________________________________ >>> ViennaCL-support mailing list >>> Vie...@li... >>> https://lists.sourceforge.net/lists/listinfo/viennacl-support >>> >>> > |
From: Maciej J. <mj...@gm...> - 2017-06-27 08:08:56
|
Now it worked for k= 100 & 1000. After I've increased k to 2000 it crashed. It is crashing on subsequent runs at 100 or 1000 now. I've restarted the system. Nothing heavy runs in the background. MJ On Tue, Jun 27, 2017 at 9:40 AM, Karl Rupp <ru...@iu...> wrote: > Hi Maciej, > > does the problem also show up for smaller values of k? 2000 should be > alright, but I've already seen cases where a video in the background caused > such kind of problems (because GPU-RAM was almost exhausted). > > I also CC: Chales Determan, who is the author of gpuR. > > Best regards, > Karli > > > > On 06/27/2017 09:28 AM, Maciej Janiec wrote: > >> I was able to use the gpuR package just once. After the first time, it is >> crashing every time. >> >> System: Windows 10 >> GPU: GeForce GT 730 >> >> gpuMatrix is created, but the code crashed at gpuA %*% gpuA. >> >> This works: >> >> > k <- 2000 >> > >> > system.time( { >> + >> + gpuA <- gpuMatrix(rnorm(k^2), nrow=k, ncol=k) >> + # gpuB <- gpuA %*% gpuA >> + >> + } ) >> user system elapsed >> 0.39 0.05 0.44 >> >> >> This crashes: >> >> > k <- 2000 >> > >> > system.time( { >> + >> + gpuA <- gpuMatrix(rnorm(k^2), nrow=k, ncol=k) >> + gpuB <- gpuA %*% gpuA >> + >> + } ) >> ViennaCL: FATAL ERROR: Kernel start failed for 'assign_cpu'. >> ViennaCL: Smaller work sizes could not solve the problem. >> Show Traceback >> Rerun with Debug >> Error in cpp_gpuMatrix_gemm(A@address, B@address, C@address, 8L) : >> ViennaCL: FATAL ERROR: CL_MEM_OBJECT_ALLOCATION_FAILURE >> ViennaCL could not allocate memory on the device. Most likely the >> device simply ran out of memory. >> If you think that this is a bug in ViennaCL, please report it at >> vie...@li... <mailto:viennacl-support@lists >> .sourceforge.net> and supply at least the following information: >> >> * Operating System >> * Which OpenCL implementation (AMD, NVIDIA, etc.) >> * ViennaCL version >> Many thanks in advance! Timing stopped at: 0.42 0.07 0.5 >> >> System stats: >> >> > gpuInfo() >> $deviceName >> [1] "GeForce GT 730" >> >> $deviceVendor >> [1] "NVIDIA Corporation" >> >> $numberOfCores >> [1] 2 >> >> $maxWorkGroupSize >> [1] 1024 >> >> $maxWorkItemDim >> [1] 3 >> >> $maxWorkItemSizes >> [1] 1024 1024 64 >> >> $deviceMemory >> [1] 2147483648 >> >> $clockFreq >> [1] 1400 >> >> $localMem >> [1] 49152 >> >> $maxAllocatableMem >> [1] 536870912 >> >> $available >> [1] "yes" >> >> $deviceExtensions >> [1] "cl_khr_global_int32_base_atomics" "cl_khr_global_int32_extended_atomics" >> "cl_khr_local_int32_base_atomics" >> [4] "cl_khr_local_int32_extended_atomics" "cl_khr_fp64" >> "cl_khr_byte_addressable_store" >> [7] "cl_khr_icd" "cl_khr_gl_sharing" >> "cl_nv_compiler_options" >> [10] "cl_nv_device_attribute_query" "cl_nv_pragma_unroll" >> "cl_nv_d3d10_sharing" >> [13] "cl_khr_d3d10_sharing" "cl_nv_d3d11_sharing" >> "cl_nv_copy_opts" >> >> $double_support >> [1] TRUE >> >> > >> > detectPlatforms() >> [1] 1 >> > detectGPUs() >> [1] 1 >> >> MJ >> >> >> >> ------------------------------------------------------------ >> ------------------ >> Check out the vibrant tech community on one of the world's most >> engaging tech sites, Slashdot.org! http://sdm.link/slashdot >> >> >> >> _______________________________________________ >> ViennaCL-support mailing list >> Vie...@li... >> https://lists.sourceforge.net/lists/listinfo/viennacl-support >> >> |
From: Karl R. <ru...@iu...> - 2017-06-27 08:01:35
|
Hi, no, these operations are not available yet. Best regards, Karli On 06/27/2017 09:21 AM, 이정우 wrote: > Hello. > > I'm Jungwoo Lee from South Korea and I'm interested in parallel > computing for sparse matrices. > > I wonder is there scalar operation or binary operation for compressed > matrices? > > Ex. elementwise-scalar addition, scalar multiplication, addition for > two sparse matrices. > > Sincerely, > from Junwoo Lee. > > > ------------------------------------------------------------------------------ > Check out the vibrant tech community on one of the world's most > engaging tech sites, Slashdot.org! http://sdm.link/slashdot > > > > _______________________________________________ > ViennaCL-support mailing list > Vie...@li... > https://lists.sourceforge.net/lists/listinfo/viennacl-support > |
From: Karl R. <ru...@iu...> - 2017-06-27 08:01:33
|
Hi Maciej, does the problem also show up for smaller values of k? 2000 should be alright, but I've already seen cases where a video in the background caused such kind of problems (because GPU-RAM was almost exhausted). I also CC: Chales Determan, who is the author of gpuR. Best regards, Karli On 06/27/2017 09:28 AM, Maciej Janiec wrote: > I was able to use the gpuR package just once. After the first time, it > is crashing every time. > > System: Windows 10 > GPU: GeForce GT 730 > > gpuMatrix is created, but the code crashed at gpuA %*% gpuA. > > This works: > > > k <- 2000 > > > > system.time( { > + > + gpuA <- gpuMatrix(rnorm(k^2), nrow=k, ncol=k) > + # gpuB <- gpuA %*% gpuA > + > + } ) > user system elapsed > 0.39 0.05 0.44 > > > This crashes: > > > k <- 2000 > > > > system.time( { > + > + gpuA <- gpuMatrix(rnorm(k^2), nrow=k, ncol=k) > + gpuB <- gpuA %*% gpuA > + > + } ) > ViennaCL: FATAL ERROR: Kernel start failed for 'assign_cpu'. > ViennaCL: Smaller work sizes could not solve the problem. > Show Traceback > Rerun with Debug > Error in cpp_gpuMatrix_gemm(A@address, B@address, C@address, 8L) : > ViennaCL: FATAL ERROR: CL_MEM_OBJECT_ALLOCATION_FAILURE > ViennaCL could not allocate memory on the device. Most likely the > device simply ran out of memory. > If you think that this is a bug in ViennaCL, please report it at > vie...@li... > <mailto:vie...@li...> and supply at least the > following information: > * Operating System > * Which OpenCL implementation (AMD, NVIDIA, etc.) > * ViennaCL version > Many thanks in advance! Timing stopped at: 0.42 0.07 0.5 > > System stats: > > > gpuInfo() > $deviceName > [1] "GeForce GT 730" > > $deviceVendor > [1] "NVIDIA Corporation" > > $numberOfCores > [1] 2 > > $maxWorkGroupSize > [1] 1024 > > $maxWorkItemDim > [1] 3 > > $maxWorkItemSizes > [1] 1024 1024 64 > > $deviceMemory > [1] 2147483648 > > $clockFreq > [1] 1400 > > $localMem > [1] 49152 > > $maxAllocatableMem > [1] 536870912 > > $available > [1] "yes" > > $deviceExtensions > [1] "cl_khr_global_int32_base_atomics" > "cl_khr_global_int32_extended_atomics" "cl_khr_local_int32_base_atomics" > [4] "cl_khr_local_int32_extended_atomics" "cl_khr_fp64" > "cl_khr_byte_addressable_store" > [7] "cl_khr_icd" "cl_khr_gl_sharing" > "cl_nv_compiler_options" > [10] "cl_nv_device_attribute_query" "cl_nv_pragma_unroll" > "cl_nv_d3d10_sharing" > [13] "cl_khr_d3d10_sharing" "cl_nv_d3d11_sharing" > "cl_nv_copy_opts" > > $double_support > [1] TRUE > > > > > detectPlatforms() > [1] 1 > > detectGPUs() > [1] 1 > > MJ > > > > ------------------------------------------------------------------------------ > Check out the vibrant tech community on one of the world's most > engaging tech sites, Slashdot.org! http://sdm.link/slashdot > > > > _______________________________________________ > ViennaCL-support mailing list > Vie...@li... > https://lists.sourceforge.net/lists/listinfo/viennacl-support > |
From: Maciej J. <mj...@gm...> - 2017-06-27 07:29:11
|
I was able to use the gpuR package just once. After the first time, it is crashing every time. System: Windows 10 GPU: GeForce GT 730 gpuMatrix is created, but the code crashed at gpuA %*% gpuA. This works: > k <- 2000 > > system.time( { + + gpuA <- gpuMatrix(rnorm(k^2), nrow=k, ncol=k) + # gpuB <- gpuA %*% gpuA + + } ) user system elapsed 0.39 0.05 0.44 This crashes: > k <- 2000 > > system.time( { + + gpuA <- gpuMatrix(rnorm(k^2), nrow=k, ncol=k) + gpuB <- gpuA %*% gpuA + + } ) ViennaCL: FATAL ERROR: Kernel start failed for 'assign_cpu'. ViennaCL: Smaller work sizes could not solve the problem. Show Traceback Rerun with Debug Error in cpp_gpuMatrix_gemm(A@address, B@address, C@address, 8L) : ViennaCL: FATAL ERROR: CL_MEM_OBJECT_ALLOCATION_FAILURE ViennaCL could not allocate memory on the device. Most likely the device simply ran out of memory. If you think that this is a bug in ViennaCL, please report it at vie...@li... and supply at least the following information: * Operating System * Which OpenCL implementation (AMD, NVIDIA, etc.) * ViennaCL version Many thanks in advance! Timing stopped at: 0.42 0.07 0.5 System stats: > gpuInfo() $deviceName [1] "GeForce GT 730" $deviceVendor [1] "NVIDIA Corporation" $numberOfCores [1] 2 $maxWorkGroupSize [1] 1024 $maxWorkItemDim [1] 3 $maxWorkItemSizes [1] 1024 1024 64 $deviceMemory [1] 2147483648 $clockFreq [1] 1400 $localMem [1] 49152 $maxAllocatableMem [1] 536870912 $available [1] "yes" $deviceExtensions [1] "cl_khr_global_int32_base_atomics" "cl_khr_global_int32_extended_atomics" "cl_khr_local_int32_base_atomics" [4] "cl_khr_local_int32_extended_atomics" "cl_khr_fp64" "cl_khr_byte_addressable_store" [7] "cl_khr_icd" "cl_khr_gl_sharing" "cl_nv_compiler_options" [10] "cl_nv_device_attribute_query" "cl_nv_pragma_unroll" "cl_nv_d3d10_sharing" [13] "cl_khr_d3d10_sharing" "cl_nv_d3d11_sharing" "cl_nv_copy_opts" $double_support [1] TRUE > > detectPlatforms() [1] 1 > detectGPUs() [1] 1 MJ |
From: 이정우 <muo...@gm...> - 2017-06-27 07:21:46
|
Hello. I'm Jungwoo Lee from South Korea and I'm interested in parallel computing for sparse matrices. I wonder is there scalar operation or binary operation for compressed matrices? Ex. elementwise-scalar addition, scalar multiplication, addition for two sparse matrices. Sincerely, from Junwoo Lee. |
From: Chris M. <chr...@us...> - 2017-04-27 21:13:06
|
Hi Karl, This looks great! Thank you very much for this effort. I will attempt to implement around this next week. Cheers Chris On 25 April 2017 at 09:25, Karl Rupp <ru...@iu...> wrote: > Hi Chris, > > the copy-CTOR for compressed_matrix is now implemented: > https://github.com/viennacl/viennacl-dev/commit/0d62d8e0fb9a > 3eefc37aa225b5eb7195256181c9 > > You should get the desired behavior of just updating numerical values on > the GPU with code similar to the following: > > viennacl::context host_ctx(viennacl::MAIN_MEMORY); > viennacl::compressed_matrix<T> A(N,N, host_ctx); //your 'host matrix' > /* fill A here */ > > viennacl::compressed_matrix<T> B(A); //create copy of A > viennacl::context gpu_ctx(viennacl::CUDA_MEMORY); > B.switch_memory_context(gpu_ctx); //migrate B to CUDA memory > > // write to B, starting at offset 0, copy 'nnz' elements > // use host data from nonzero floating point values of A > viennacl::backend::memory_write(B.handle(), 0, sizeof(T) * A.nnz(), > A.handle().ram_handle().get()); > > Just repeat the last line every time you need to update the numerical > values on the GPU. > > Please let me know how this turns out. > > Best regards, > Karli > > > On 04/21/2017 09:06 PM, Chris Marsh wrote: > >> Karl, >> >> No problem, the copy-constructor sounds like a perfect solution. Thanks >> for doing this. >> >> How big is your system? >> >> The sparse matrix is approx 10^10 with about 1 million total non-zero >> elements. >> >> >> 2.5min for 5 time steps sounds a lot to me. >> >> I should have been more clear, sorry. The 2.5min includes a bunch of >> other routines that are being run for the timestep, so it is more than >> just the matrix solve. However, that 12s is entirely attributable to the >> difference between STL and the copy and the opterator() access. Also, >> running on a single laptop core instead of a cluster like it should be! >> >> However, one still has to compare against the available column indices >> >> Makes sense. In my case, I think I can just say I need the 3rd, or 4th >> non-zero row item as I "know" where things are. but that's a non-generic >> case. >> >> Cheers >> Chris >> >> >> >> On 21 April 2017 at 04:34, Karl Rupp <ru...@iu... >> <mailto:ru...@iu...>> wrote: >> >> Hi Chris, >> >> please apologize my late reply. >> >> >> This is a local search operation >> >> >> Oh, that isn't at all what I expected. I assumed with the >> row, col >> offset it could just index the CSR array directly? >> >> >> when you call operator(), you pass the row and column index. The row >> index jumps at the beginning of nonzeros for that row in the CSR >> array. However, one still has to compare against the available >> column indices to finally pick the correct entry (or create a new >> one...). Only for dense matrices you can locate the respective entry >> in the matrix directly. >> >> >> >> By how much does your code slow down? >> >> >> The "optimization"? Over 5 time steps or so it was 12 s >> slower, out >> of a total of 2.5min or so. So enough that when I run it for >> 15000 >> time steps it adds up! >> >> >> So it's 10 percent. How big is your system? 2.5min for 5 time steps >> sounds a lot to me. >> >> >> Also, do you fill the CSR matrix by increasing row >> index, or is >> your code filling rows at random? >> >> >> I'm filling the CSR via operator(), and that is by >> increasing row >> index. >> >> >> Ok, this should be acceptable in terms of performance. >> >> >> However, when it is run in parallel with openmp, it will >> effectively be random. >> >> >> In parallel you should really fill the CSR array directly (possibly >> with the exception of the first time step, where you build the >> sparsity pattern) >> >> >> What are you trying to accomplish? >> >> >> With a OpenMP backend, I want to avoid the copy from STL -> >> compressed_matrix. So my idea is to pre-allocate A, a >> compressed_matrix on the host, regardless of what backend >> I'm using >> (instead of the STL variant). Then I want to either solve >> directly >> using A, or I want to copy A to a GPU and solve it on the GPU >> if >> configured. For the former, this is currently working well, >> barring >> the operator() issues we are discussing above. The problem >> arises >> with the 2nd case. I could do the context change, but once >> it's been >> copied to the GPU I have to copy it *back* to take advantage >> of the >> pre-allocated matrix. That is, I'd like to avoid any >> additional >> memory allocations. I would like to just copy(A,gpu_A) when >> gpu is >> available. However, there is no copy for compressed_matrix to >> comprssed_matrix. >> >> >> Thanks, that helps me with understanding the setting better. Let me >> add a copy-constructor for compressed_matrix for you, so you can >> avoid the unnecessary copy back to the host. Copying the numerical >> entries for a fixed sparsity pattern can be done efficiently; I'll >> send you a code snippet when I'm done with the copy-constructor. >> >> Best regards, >> Karli >> >> >> |
From: Karl R. <ru...@iu...> - 2017-04-25 15:26:09
|
Hi Chris, the copy-CTOR for compressed_matrix is now implemented: https://github.com/viennacl/viennacl-dev/commit/0d62d8e0fb9a3eefc37aa225b5eb7195256181c9 You should get the desired behavior of just updating numerical values on the GPU with code similar to the following: viennacl::context host_ctx(viennacl::MAIN_MEMORY); viennacl::compressed_matrix<T> A(N,N, host_ctx); //your 'host matrix' /* fill A here */ viennacl::compressed_matrix<T> B(A); //create copy of A viennacl::context gpu_ctx(viennacl::CUDA_MEMORY); B.switch_memory_context(gpu_ctx); //migrate B to CUDA memory // write to B, starting at offset 0, copy 'nnz' elements // use host data from nonzero floating point values of A viennacl::backend::memory_write(B.handle(), 0, sizeof(T) * A.nnz(), A.handle().ram_handle().get()); Just repeat the last line every time you need to update the numerical values on the GPU. Please let me know how this turns out. Best regards, Karli On 04/21/2017 09:06 PM, Chris Marsh wrote: > Karl, > > No problem, the copy-constructor sounds like a perfect solution. Thanks > for doing this. > > How big is your system? > > The sparse matrix is approx 10^10 with about 1 million total non-zero > elements. > > > 2.5min for 5 time steps sounds a lot to me. > > I should have been more clear, sorry. The 2.5min includes a bunch of > other routines that are being run for the timestep, so it is more than > just the matrix solve. However, that 12s is entirely attributable to the > difference between STL and the copy and the opterator() access. Also, > running on a single laptop core instead of a cluster like it should be! > > However, one still has to compare against the available column indices > > Makes sense. In my case, I think I can just say I need the 3rd, or 4th > non-zero row item as I "know" where things are. but that's a non-generic > case. > > Cheers > Chris > > > > On 21 April 2017 at 04:34, Karl Rupp <ru...@iu... > <mailto:ru...@iu...>> wrote: > > Hi Chris, > > please apologize my late reply. > > > This is a local search operation > > > Oh, that isn't at all what I expected. I assumed with the > row, col > offset it could just index the CSR array directly? > > > when you call operator(), you pass the row and column index. The row > index jumps at the beginning of nonzeros for that row in the CSR > array. However, one still has to compare against the available > column indices to finally pick the correct entry (or create a new > one...). Only for dense matrices you can locate the respective entry > in the matrix directly. > > > > By how much does your code slow down? > > > The "optimization"? Over 5 time steps or so it was 12 s > slower, out > of a total of 2.5min or so. So enough that when I run it for > 15000 > time steps it adds up! > > > So it's 10 percent. How big is your system? 2.5min for 5 time steps > sounds a lot to me. > > > Also, do you fill the CSR matrix by increasing row > index, or is > your code filling rows at random? > > > I'm filling the CSR via operator(), and that is by > increasing row > index. > > > Ok, this should be acceptable in terms of performance. > > > However, when it is run in parallel with openmp, it will > effectively be random. > > > In parallel you should really fill the CSR array directly (possibly > with the exception of the first time step, where you build the > sparsity pattern) > > > What are you trying to accomplish? > > > With a OpenMP backend, I want to avoid the copy from STL -> > compressed_matrix. So my idea is to pre-allocate A, a > compressed_matrix on the host, regardless of what backend > I'm using > (instead of the STL variant). Then I want to either solve > directly > using A, or I want to copy A to a GPU and solve it on the GPU if > configured. For the former, this is currently working well, > barring > the operator() issues we are discussing above. The problem > arises > with the 2nd case. I could do the context change, but once > it's been > copied to the GPU I have to copy it *back* to take advantage > of the > pre-allocated matrix. That is, I'd like to avoid any additional > memory allocations. I would like to just copy(A,gpu_A) when > gpu is > available. However, there is no copy for compressed_matrix to > comprssed_matrix. > > > Thanks, that helps me with understanding the setting better. Let me > add a copy-constructor for compressed_matrix for you, so you can > avoid the unnecessary copy back to the host. Copying the numerical > entries for a fixed sparsity pattern can be done efficiently; I'll > send you a code snippet when I'm done with the copy-constructor. > > Best regards, > Karli > > |
From: Chris M. <chr...@us...> - 2017-04-21 19:07:03
|
Karl, No problem, the copy-constructor sounds like a perfect solution. Thanks for doing this. How big is your system? The sparse matrix is approx 10^10 with about 1 million total non-zero elements. > 2.5min for 5 time steps sounds a lot to me. I should have been more clear, sorry. The 2.5min includes a bunch of other routines that are being run for the timestep, so it is more than just the matrix solve. However, that 12s is entirely attributable to the difference between STL and the copy and the opterator() access. Also, running on a single laptop core instead of a cluster like it should be! However, one still has to compare against the available column indices Makes sense. In my case, I think I can just say I need the 3rd, or 4th non-zero row item as I "know" where things are. but that's a non-generic case. Cheers Chris On 21 April 2017 at 04:34, Karl Rupp <ru...@iu...> wrote: > Hi Chris, > > please apologize my late reply. > > > This is a local search operation >> >> >> Oh, that isn't at all what I expected. I assumed with the row, col >> offset it could just index the CSR array directly? >> > > when you call operator(), you pass the row and column index. The row index > jumps at the beginning of nonzeros for that row in the CSR array. However, > one still has to compare against the available column indices to finally > pick the correct entry (or create a new one...). Only for dense matrices > you can locate the respective entry in the matrix directly. > > > > By how much does your code slow down? >> >> >> The "optimization"? Over 5 time steps or so it was 12 s slower, out >> of a total of 2.5min or so. So enough that when I run it for 15000 >> time steps it adds up! >> > > So it's 10 percent. How big is your system? 2.5min for 5 time steps sounds > a lot to me. > > > Also, do you fill the CSR matrix by increasing row index, or is >> your code filling rows at random? >> >> >> I'm filling the CSR via operator(), and that is by increasing row >> index. >> > > Ok, this should be acceptable in terms of performance. > > > However, when it is run in parallel with openmp, it will >> effectively be random. >> > > In parallel you should really fill the CSR array directly (possibly with > the exception of the first time step, where you build the sparsity pattern) > > > What are you trying to accomplish? >> >> >> With a OpenMP backend, I want to avoid the copy from STL -> >> compressed_matrix. So my idea is to pre-allocate A, a >> compressed_matrix on the host, regardless of what backend I'm using >> (instead of the STL variant). Then I want to either solve directly >> using A, or I want to copy A to a GPU and solve it on the GPU if >> configured. For the former, this is currently working well, barring >> the operator() issues we are discussing above. The problem arises >> with the 2nd case. I could do the context change, but once it's been >> copied to the GPU I have to copy it *back* to take advantage of the >> pre-allocated matrix. That is, I'd like to avoid any additional >> memory allocations. I would like to just copy(A,gpu_A) when gpu is >> available. However, there is no copy for compressed_matrix to >> comprssed_matrix. >> > > Thanks, that helps me with understanding the setting better. Let me add a > copy-constructor for compressed_matrix for you, so you can avoid the > unnecessary copy back to the host. Copying the numerical entries for a > fixed sparsity pattern can be done efficiently; I'll send you a code > snippet when I'm done with the copy-constructor. > > Best regards, > Karli > |
From: Karl R. <ru...@iu...> - 2017-04-21 10:35:04
|
Hi Chris, please apologize my late reply. > This is a local search operation > > > Oh, that isn't at all what I expected. I assumed with the row, col > offset it could just index the CSR array directly? when you call operator(), you pass the row and column index. The row index jumps at the beginning of nonzeros for that row in the CSR array. However, one still has to compare against the available column indices to finally pick the correct entry (or create a new one...). Only for dense matrices you can locate the respective entry in the matrix directly. > By how much does your code slow down? > > > The "optimization"? Over 5 time steps or so it was 12 s slower, out > of a total of 2.5min or so. So enough that when I run it for 15000 > time steps it adds up! So it's 10 percent. How big is your system? 2.5min for 5 time steps sounds a lot to me. > Also, do you fill the CSR matrix by increasing row index, or is > your code filling rows at random? > > > I'm filling the CSR via operator(), and that is by increasing row > index. Ok, this should be acceptable in terms of performance. > However, when it is run in parallel with openmp, it will > effectively be random. In parallel you should really fill the CSR array directly (possibly with the exception of the first time step, where you build the sparsity pattern) > What are you trying to accomplish? > > > With a OpenMP backend, I want to avoid the copy from STL -> > compressed_matrix. So my idea is to pre-allocate A, a > compressed_matrix on the host, regardless of what backend I'm using > (instead of the STL variant). Then I want to either solve directly > using A, or I want to copy A to a GPU and solve it on the GPU if > configured. For the former, this is currently working well, barring > the operator() issues we are discussing above. The problem arises > with the 2nd case. I could do the context change, but once it's been > copied to the GPU I have to copy it *back* to take advantage of the > pre-allocated matrix. That is, I'd like to avoid any additional > memory allocations. I would like to just copy(A,gpu_A) when gpu is > available. However, there is no copy for compressed_matrix to > comprssed_matrix. Thanks, that helps me with understanding the setting better. Let me add a copy-constructor for compressed_matrix for you, so you can avoid the unnecessary copy back to the host. Copying the numerical entries for a fixed sparsity pattern can be done efficiently; I'll send you a code snippet when I'm done with the copy-constructor. Best regards, Karli |
From: Chris M. <chr...@us...> - 2017-04-18 16:06:27
|
Hi Karl, I was wondering if you had any thoughts on how I should proceed with the copy? Cheers Chris Lewis On Wed, 12 Apr 2017 at 18:46 Chris Marsh <chr...@us...> wrote: Hi Karl, > > This is a local search operation > > > Oh, that isn't at all what I expected. I assumed with the row, col offset > it could just index the CSR array directly? > > By how much does your code slow down? > > > The "optimization"? Over 5 time steps or so it was 12 s slower, out of a > total of 2.5min or so. So enough that when I run it for 15000 time steps it > adds up! > > Also, do you fill the CSR matrix by increasing row index, or is your code >> filling rows at random? > > > I'm filling the CSR via operator(), and that is by increasing row index. > However, when it is run in parallel with openmp, it will effectively be > random. > > What are you trying to accomplish? > > > With a OpenMP backend, I want to avoid the copy from STL -> > compressed_matrix. So my idea is to pre-allocate A, a compressed_matrix on > the host, regardless of what backend I'm using (instead of the STL > variant). Then I want to either solve directly using A, or I want to copy A > to a GPU and solve it on the GPU if configured. For the former, this is > currently working well, barring the operator() issues we are discussing > above. The problem arises with the 2nd case. I could do the context > change, but once it's been copied to the GPU I have to copy it *back* to > take advantage of the pre-allocated matrix. That is, I'd like to avoid any > additional memory allocations. I would like to just copy(A,gpu_A) when gpu > is available. However, there is no copy for compressed_matrix to > comprssed_matrix. > > Cheers > Chris > > On 12 April 2017 at 04:19, Karl Rupp <ru...@iu...> wrote: > >> Hi Chris, >> >> >> I'm an earth scientist so the way this code works is I have many time >>> steps (e.g., 1hr) of observation data (e.g., wind) that I use for >>> solving, amongst many things, a transport equation. You can imagine it >>> as a tight loop over the observations where the inside (build the FVM) >>> is done many, many times. Therefore the copy from STL to >>> compressed_matrix is showing up in my profiling (using Intel VTune). The >>> lack of performance increase is in the construction of the linear >>> system; everything else has remained constant. >>> >>> I'm surprised operator(), and by extension entry_proxy, is that much >>> slower. Where is it incurring the overhead? >>> >> >> With each call to operator(), it needs to look up the respective entry in >> the system matrix. This is a local search operation, hence takes much more >> time than 'just working on the CSR arrays directly'. At this point you >> really pay for the convenience of operator(), and I see no way of >> completely avoiding those costs. >> >> By how much does your code slow down? I see some room for optimizing the >> existing implementation for the host-based backend. Also, do you fill the >> CSR matrix by increasing row index, or is your code filling rows at random? >> >> >> >> As a point of clarification, compressed_matrix when created like >>> A(viennacl::context::context(viennacl::MAIN_MEMORY)) >>> really does exist on the host, correct? ALL of the internal code calls >>> it gpu_matrix... >>> >> >> Yes, it really creates the buffers on the host. The internal use of >> 'gpu_matrix' is a historic relic from a time when ViennaCL only supported >> OpenCL. >> >> >> Lastly, I've run into a bit of a problem. There is no copy for >>> compressed_matrix (host) -> compressed_matrix (gpu). >>> Am I missing something? >>> >> >> What are you trying to accomplish? If you just want to shift your data >> over to CUDA or OpenCL or from CUDA/OpenCL back to the host, use >> A.switch_memory_context(new_ctx). >> >> Best regards, >> Karli >> > > |
From: Chris M. <chr...@us...> - 2017-04-13 00:47:24
|
Hi Karl, This is a local search operation Oh, that isn't at all what I expected. I assumed with the row, col offset it could just index the CSR array directly? By how much does your code slow down? The "optimization"? Over 5 time steps or so it was 12 s slower, out of a total of 2.5min or so. So enough that when I run it for 15000 time steps it adds up! Also, do you fill the CSR matrix by increasing row index, or is your code > filling rows at random? I'm filling the CSR via operator(), and that is by increasing row index. However, when it is run in parallel with openmp, it will effectively be random. What are you trying to accomplish? With a OpenMP backend, I want to avoid the copy from STL -> compressed_matrix. So my idea is to pre-allocate A, a compressed_matrix on the host, regardless of what backend I'm using (instead of the STL variant). Then I want to either solve directly using A, or I want to copy A to a GPU and solve it on the GPU if configured. For the former, this is currently working well, barring the operator() issues we are discussing above. The problem arises with the 2nd case. I could do the context change, but once it's been copied to the GPU I have to copy it *back* to take advantage of the pre-allocated matrix. That is, I'd like to avoid any additional memory allocations. I would like to just copy(A,gpu_A) when gpu is available. However, there is no copy for compressed_matrix to comprssed_matrix. Cheers Chris On 12 April 2017 at 04:19, Karl Rupp <ru...@iu...> wrote: > Hi Chris, > > > I'm an earth scientist so the way this code works is I have many time >> steps (e.g., 1hr) of observation data (e.g., wind) that I use for >> solving, amongst many things, a transport equation. You can imagine it >> as a tight loop over the observations where the inside (build the FVM) >> is done many, many times. Therefore the copy from STL to >> compressed_matrix is showing up in my profiling (using Intel VTune). The >> lack of performance increase is in the construction of the linear >> system; everything else has remained constant. >> >> I'm surprised operator(), and by extension entry_proxy, is that much >> slower. Where is it incurring the overhead? >> > > With each call to operator(), it needs to look up the respective entry in > the system matrix. This is a local search operation, hence takes much more > time than 'just working on the CSR arrays directly'. At this point you > really pay for the convenience of operator(), and I see no way of > completely avoiding those costs. > > By how much does your code slow down? I see some room for optimizing the > existing implementation for the host-based backend. Also, do you fill the > CSR matrix by increasing row index, or is your code filling rows at random? > > > > As a point of clarification, compressed_matrix when created like >> A(viennacl::context::context(viennacl::MAIN_MEMORY)) >> really does exist on the host, correct? ALL of the internal code calls >> it gpu_matrix... >> > > Yes, it really creates the buffers on the host. The internal use of > 'gpu_matrix' is a historic relic from a time when ViennaCL only supported > OpenCL. > > > Lastly, I've run into a bit of a problem. There is no copy for >> compressed_matrix (host) -> compressed_matrix (gpu). >> Am I missing something? >> > > What are you trying to accomplish? If you just want to shift your data > over to CUDA or OpenCL or from CUDA/OpenCL back to the host, use > A.switch_memory_context(new_ctx). > > Best regards, > Karli > |
From: Karl R. <ru...@iu...> - 2017-04-12 10:20:01
|
Hi Chris, > I'm an earth scientist so the way this code works is I have many time > steps (e.g., 1hr) of observation data (e.g., wind) that I use for > solving, amongst many things, a transport equation. You can imagine it > as a tight loop over the observations where the inside (build the FVM) > is done many, many times. Therefore the copy from STL to > compressed_matrix is showing up in my profiling (using Intel VTune). The > lack of performance increase is in the construction of the linear > system; everything else has remained constant. > > I'm surprised operator(), and by extension entry_proxy, is that much > slower. Where is it incurring the overhead? With each call to operator(), it needs to look up the respective entry in the system matrix. This is a local search operation, hence takes much more time than 'just working on the CSR arrays directly'. At this point you really pay for the convenience of operator(), and I see no way of completely avoiding those costs. By how much does your code slow down? I see some room for optimizing the existing implementation for the host-based backend. Also, do you fill the CSR matrix by increasing row index, or is your code filling rows at random? > As a point of clarification, compressed_matrix when created like > A(viennacl::context::context(viennacl::MAIN_MEMORY)) > really does exist on the host, correct? ALL of the internal code calls > it gpu_matrix... Yes, it really creates the buffers on the host. The internal use of 'gpu_matrix' is a historic relic from a time when ViennaCL only supported OpenCL. > Lastly, I've run into a bit of a problem. There is no copy for > compressed_matrix (host) -> compressed_matrix (gpu). > Am I missing something? What are you trying to accomplish? If you just want to shift your data over to CUDA or OpenCL or from CUDA/OpenCL back to the host, use A.switch_memory_context(new_ctx). Best regards, Karli |
From: Chris M. <chr...@us...> - 2017-04-11 15:57:50
|
Thanks for the detailed reply. I'm an earth scientist so the way this code works is I have many time steps (e.g., 1hr) of observation data (e.g., wind) that I use for solving, amongst many things, a transport equation. You can imagine it as a tight loop over the observations where the inside (build the FVM) is done many, many times. Therefore the copy from STL to compressed_matrix is showing up in my profiling (using Intel VTune). The lack of performance increase is in the construction of the linear system; everything else has remained constant. I'm surprised operator(), and by extension entry_proxy, is that much slower. Where is it incurring the overhead? As a point of clarification, compressed_matrix when created like A(viennacl::context::context(viennacl::MAIN_MEMORY)) really does exist on the host, correct? ALL of the internal code calls it gpu_matrix... Lastly, I've run into a bit of a problem. There is no copy for compressed_matrix (host) -> compressed_matrix (gpu). Am I missing something? Cheers Chris On 11 April 2017 at 02:29, Karl Rupp <ru...@iu...> wrote: > Hi Chris, > > > Ok, this seemed to work very well. I can then modify the element >> internal vector to zero out the matrix for my finite volume >> implementation, &c and preserve the sparsity information so-as to use >> operator() quickly. >> > > it's still best to avoid operator() if you aim for maximum performance, > but instead work on the CSR arrays directly. Chances are, however, that > more time in already spent on other parts of your finite volume > application, in which case case there's no need for further optimizing this > part. > > > The whole point of doing this was to avoid 2 sets of copies from main >> memory STL format to main memory compressed_matrix format when using >> OpenMP. >> > > Yes, that's definitely the right way to do. > > > However I'm not seeing any performance increase, and rather I am >> seeing a performance decrease! >> >> Is this to be expected? >> > > In which part do you see the performance decrease? If it's in the > assembly, then work on the CSR arrays directly. Or are you referring to > other parts, e.g. sparse matrix-vector products? > > Best regards, > Karli > > > > On 7 April 2017 at 10:12, Chris Marsh <chr...@us... >> <mailto:chr...@us...>> wrote: >> >> Hi, >> >> Right, it's the sparsity pattern that you have no way of knowing a >> priori during allocation. The parallel insert is then of course an >> issue without the 2 passes... >> I have to build a new A and b many, many times (during some >> timestepping) so 2 passes is probably not much faster than what I'm >> getting with copy. The sparsity pattern will stay constant. If I >> initialize the sparsity, then operator() should work, correct? And >> make my parallel code faster, i.e., not require 2 passes. >> >> Following this further: if I use a std::map< ... > sparse >> representation, and copy it to a compressed_matrix, it should set up >> the sparse structure for me. Then, I can use operator() without slow >> down, and access in parallel as the sparsity will be correctly >> setup. Reasonable approach for host only? For GPU, I obviously will >> still need to copy. But this approach, if it works, should also >> reduce code duplication..... >> >> (I'm trying to avoid learning CSR at the moment, have a time crunch!) >> >> Cheers >> Chris >> >> >> On 7 April 2017 at 00:21, Karl Rupp <ru...@iu... >> <mailto:ru...@iu...>> wrote: >> >> Hey, >> >> On 04/06/2017 11:48 PM, Chris Marsh wrote: >> >> Unless you are changing only a few entries, this is >> likely to be too >> slow. >> >> Big time :) >> >> Ok, so even though it is pre allocated for the right number >> of nnz >> values, operator() still incurs the cost? Must admit that is >> not what >> I'd have expected. >> >> >> Well, this is a sparse matrix. Since operator() deals with a >> single entry, there is no way this could be fast (note that CSR >> has requirements on entries from the same row being located >> consecutively in memory) >> >> >> When I obtain those CSR buffers, they will be the correct >> size, and I >> should be able to insert into them in parallel, correct? >> >> >> Yes, exactly. >> You may need to populate the matrix in two passes: The first >> determines the sparsity pattern, the second writes the actual >> numerical values. >> >> Best regards, >> Karli >> >> >> >> On 6 April 2017 at 13:13, Karl Rupp <ru...@iu... >> <mailto:ru...@iu...> >> <mailto:ru...@iu... >> >> <mailto:ru...@iu...>>> wrote: >> >> Hi! >> >> >> >> On 04/06/2017 06:44 PM, Chris Marsh wrote: >> >> Hi, >> >> I know the number of non-zero entries for a sparse >> matrix so I >> am trying >> to pre-allocate it with >> >> viennacl::compressed_matrix<vcl_scalar_type> >> vl_C(row, col, nnz); >> >> >> At this point your matrix is still empty (i.e. no >> nonzeros). It only >> preallocated an array to hold up to 'nnz' entries. >> >> >> and access it with vl_C.operator(). >> >> >> Unless you are changing only a few entries, this is >> likely to be too >> slow. >> >> >> I am using host only memory context, with ViennaCL >> 1.7.1 from >> homebrew. >> >> How should I proceed with this? >> >> >> To fill the CSR format efficiencly, have a look here: >> >> https://sourceforge.net/p/viennacl/discussion/1143678/thread >> /325a937c/?limit=25#d6f0 >> <https://sourceforge.net/p/viennacl/discussion/1143678/threa >> d/325a937c/?limit=25#d6f0> >> >> <https://sourceforge.net/p/viennacl/discussion/1143678/threa >> d/325a937c/?limit=25#d6f0 >> <https://sourceforge.net/p/viennacl/discussion/1143678/threa >> d/325a937c/?limit=25#d6f0>> >> >> For host-based memory, an example of how to get pointers >> to the >> three CSR arrays is here: >> >> https://github.com/viennacl/viennacl-dev/blob/master/viennac >> l/linalg/host_based/sparse_matrix_operations.hpp#L115 >> <https://github.com/viennacl/viennacl-dev/blob/master/vienna >> cl/linalg/host_based/sparse_matrix_operations.hpp#L115> >> >> <https://github.com/viennacl/viennacl-dev/blob/master/vienna >> cl/linalg/host_based/sparse_matrix_operations.hpp#L115 >> <https://github.com/viennacl/viennacl-dev/blob/master/vienna >> cl/linalg/host_based/sparse_matrix_operations.hpp#L115>> >> >> Best regards, >> Karli >> >> >> >> >> |
From: Karl R. <ru...@iu...> - 2017-04-11 08:29:11
|
Hi Chris, > Ok, this seemed to work very well. I can then modify the element > internal vector to zero out the matrix for my finite volume > implementation, &c and preserve the sparsity information so-as to use > operator() quickly. it's still best to avoid operator() if you aim for maximum performance, but instead work on the CSR arrays directly. Chances are, however, that more time in already spent on other parts of your finite volume application, in which case case there's no need for further optimizing this part. > The whole point of doing this was to avoid 2 sets of copies from main > memory STL format to main memory compressed_matrix format when using > OpenMP. Yes, that's definitely the right way to do. > However I'm not seeing any performance increase, and rather I am > seeing a performance decrease! > > Is this to be expected? In which part do you see the performance decrease? If it's in the assembly, then work on the CSR arrays directly. Or are you referring to other parts, e.g. sparse matrix-vector products? Best regards, Karli > On 7 April 2017 at 10:12, Chris Marsh <chr...@us... > <mailto:chr...@us...>> wrote: > > Hi, > > Right, it's the sparsity pattern that you have no way of knowing a > priori during allocation. The parallel insert is then of course an > issue without the 2 passes... > I have to build a new A and b many, many times (during some > timestepping) so 2 passes is probably not much faster than what I'm > getting with copy. The sparsity pattern will stay constant. If I > initialize the sparsity, then operator() should work, correct? And > make my parallel code faster, i.e., not require 2 passes. > > Following this further: if I use a std::map< ... > sparse > representation, and copy it to a compressed_matrix, it should set up > the sparse structure for me. Then, I can use operator() without slow > down, and access in parallel as the sparsity will be correctly > setup. Reasonable approach for host only? For GPU, I obviously will > still need to copy. But this approach, if it works, should also > reduce code duplication..... > > (I'm trying to avoid learning CSR at the moment, have a time crunch!) > > Cheers > Chris > > > On 7 April 2017 at 00:21, Karl Rupp <ru...@iu... > <mailto:ru...@iu...>> wrote: > > Hey, > > On 04/06/2017 11:48 PM, Chris Marsh wrote: > > Unless you are changing only a few entries, this is > likely to be too > slow. > > Big time :) > > Ok, so even though it is pre allocated for the right number > of nnz > values, operator() still incurs the cost? Must admit that is > not what > I'd have expected. > > > Well, this is a sparse matrix. Since operator() deals with a > single entry, there is no way this could be fast (note that CSR > has requirements on entries from the same row being located > consecutively in memory) > > > When I obtain those CSR buffers, they will be the correct > size, and I > should be able to insert into them in parallel, correct? > > > Yes, exactly. > You may need to populate the matrix in two passes: The first > determines the sparsity pattern, the second writes the actual > numerical values. > > Best regards, > Karli > > > > On 6 April 2017 at 13:13, Karl Rupp <ru...@iu... > <mailto:ru...@iu...> > <mailto:ru...@iu... > <mailto:ru...@iu...>>> wrote: > > Hi! > > > > On 04/06/2017 06:44 PM, Chris Marsh wrote: > > Hi, > > I know the number of non-zero entries for a sparse > matrix so I > am trying > to pre-allocate it with > > viennacl::compressed_matrix<vcl_scalar_type> > vl_C(row, col, nnz); > > > At this point your matrix is still empty (i.e. no > nonzeros). It only > preallocated an array to hold up to 'nnz' entries. > > > and access it with vl_C.operator(). > > > Unless you are changing only a few entries, this is > likely to be too > slow. > > > I am using host only memory context, with ViennaCL > 1.7.1 from > homebrew. > > How should I proceed with this? > > > To fill the CSR format efficiencly, have a look here: > > https://sourceforge.net/p/viennacl/discussion/1143678/thread/325a937c/?limit=25#d6f0 > <https://sourceforge.net/p/viennacl/discussion/1143678/thread/325a937c/?limit=25#d6f0> > > <https://sourceforge.net/p/viennacl/discussion/1143678/thread/325a937c/?limit=25#d6f0 > <https://sourceforge.net/p/viennacl/discussion/1143678/thread/325a937c/?limit=25#d6f0>> > > For host-based memory, an example of how to get pointers > to the > three CSR arrays is here: > > https://github.com/viennacl/viennacl-dev/blob/master/viennacl/linalg/host_based/sparse_matrix_operations.hpp#L115 > <https://github.com/viennacl/viennacl-dev/blob/master/viennacl/linalg/host_based/sparse_matrix_operations.hpp#L115> > > <https://github.com/viennacl/viennacl-dev/blob/master/viennacl/linalg/host_based/sparse_matrix_operations.hpp#L115 > <https://github.com/viennacl/viennacl-dev/blob/master/viennacl/linalg/host_based/sparse_matrix_operations.hpp#L115>> > > Best regards, > Karli > > > > |
From: Chris M. <chr...@us...> - 2017-04-10 20:49:09
|
Ok, this seemed to work very well. I can then modify the element internal vector to zero out the matrix for my finite volume implementation, &c and preserve the sparsity information so-as to use operator() quickly. The whole point of doing this was to avoid 2 sets of copies from main memory STL format to main memory compressed_matrix format when using OpenMP. However I'm not seeing any performance increase, and rather I am seeing a performance decrease! Is this to be expected? Cheers Chris On 7 April 2017 at 10:12, Chris Marsh <chr...@us...> wrote: > Hi, > > Right, it's the sparsity pattern that you have no way of knowing a priori > during allocation. The parallel insert is then of course an issue without > the 2 passes... > I have to build a new A and b many, many times (during some timestepping) > so 2 passes is probably not much faster than what I'm getting with copy. > The sparsity pattern will stay constant. If I initialize the sparsity, then > operator() should work, correct? And make my parallel code faster, i.e., > not require 2 passes. > > Following this further: if I use a std::map< ... > sparse representation, > and copy it to a compressed_matrix, it should set up the sparse structure > for me. Then, I can use operator() without slow down, and access in > parallel as the sparsity will be correctly setup. Reasonable approach for > host only? For GPU, I obviously will still need to copy. But this approach, > if it works, should also reduce code duplication..... > > (I'm trying to avoid learning CSR at the moment, have a time crunch!) > > Cheers > Chris > > > On 7 April 2017 at 00:21, Karl Rupp <ru...@iu...> wrote: > >> Hey, >> >> On 04/06/2017 11:48 PM, Chris Marsh wrote: >> >>> Unless you are changing only a few entries, this is likely to be too >>> slow. >>> >>> Big time :) >>> >>> Ok, so even though it is pre allocated for the right number of nnz >>> values, operator() still incurs the cost? Must admit that is not what >>> I'd have expected. >>> >> >> Well, this is a sparse matrix. Since operator() deals with a single >> entry, there is no way this could be fast (note that CSR has requirements >> on entries from the same row being located consecutively in memory) >> >> >> When I obtain those CSR buffers, they will be the correct size, and I >>> should be able to insert into them in parallel, correct? >>> >> >> Yes, exactly. >> You may need to populate the matrix in two passes: The first determines >> the sparsity pattern, the second writes the actual numerical values. >> >> Best regards, >> Karli >> >> >> >> On 6 April 2017 at 13:13, Karl Rupp <ru...@iu... >>> <mailto:ru...@iu...>> wrote: >>> >>> Hi! >>> >>> >>> >>> On 04/06/2017 06:44 PM, Chris Marsh wrote: >>> >>> Hi, >>> >>> I know the number of non-zero entries for a sparse matrix so I >>> am trying >>> to pre-allocate it with >>> >>> viennacl::compressed_matrix<vcl_scalar_type> vl_C(row, col, >>> nnz); >>> >>> >>> At this point your matrix is still empty (i.e. no nonzeros). It only >>> preallocated an array to hold up to 'nnz' entries. >>> >>> >>> and access it with vl_C.operator(). >>> >>> >>> Unless you are changing only a few entries, this is likely to be too >>> slow. >>> >>> >>> I am using host only memory context, with ViennaCL 1.7.1 from >>> homebrew. >>> >>> How should I proceed with this? >>> >>> >>> To fill the CSR format efficiencly, have a look here: >>> https://sourceforge.net/p/viennacl/discussion/1143678/thread >>> /325a937c/?limit=25#d6f0 >>> <https://sourceforge.net/p/viennacl/discussion/1143678/threa >>> d/325a937c/?limit=25#d6f0> >>> >>> For host-based memory, an example of how to get pointers to the >>> three CSR arrays is here: >>> https://github.com/viennacl/viennacl-dev/blob/master/viennac >>> l/linalg/host_based/sparse_matrix_operations.hpp#L115 >>> <https://github.com/viennacl/viennacl-dev/blob/master/vienna >>> cl/linalg/host_based/sparse_matrix_operations.hpp#L115> >>> >>> Best regards, >>> Karli >>> >>> >>> > |
From: Chris M. <chr...@us...> - 2017-04-07 16:13:12
|
Hi, Right, it's the sparsity pattern that you have no way of knowing a priori during allocation. The parallel insert is then of course an issue without the 2 passes... I have to build a new A and b many, many times (during some timestepping) so 2 passes is probably not much faster than what I'm getting with copy. The sparsity pattern will stay constant. If I initialize the sparsity, then operator() should work, correct? And make my parallel code faster, i.e., not require 2 passes. Following this further: if I use a std::map< ... > sparse representation, and copy it to a compressed_matrix, it should set up the sparse structure for me. Then, I can use operator() without slow down, and access in parallel as the sparsity will be correctly setup. Reasonable approach for host only? For GPU, I obviously will still need to copy. But this approach, if it works, should also reduce code duplication..... (I'm trying to avoid learning CSR at the moment, have a time crunch!) Cheers Chris On 7 April 2017 at 00:21, Karl Rupp <ru...@iu...> wrote: > Hey, > > On 04/06/2017 11:48 PM, Chris Marsh wrote: > >> Unless you are changing only a few entries, this is likely to be too >> slow. >> >> Big time :) >> >> Ok, so even though it is pre allocated for the right number of nnz >> values, operator() still incurs the cost? Must admit that is not what >> I'd have expected. >> > > Well, this is a sparse matrix. Since operator() deals with a single entry, > there is no way this could be fast (note that CSR has requirements on > entries from the same row being located consecutively in memory) > > > When I obtain those CSR buffers, they will be the correct size, and I >> should be able to insert into them in parallel, correct? >> > > Yes, exactly. > You may need to populate the matrix in two passes: The first determines > the sparsity pattern, the second writes the actual numerical values. > > Best regards, > Karli > > > > On 6 April 2017 at 13:13, Karl Rupp <ru...@iu... >> <mailto:ru...@iu...>> wrote: >> >> Hi! >> >> >> >> On 04/06/2017 06:44 PM, Chris Marsh wrote: >> >> Hi, >> >> I know the number of non-zero entries for a sparse matrix so I >> am trying >> to pre-allocate it with >> >> viennacl::compressed_matrix<vcl_scalar_type> vl_C(row, col, >> nnz); >> >> >> At this point your matrix is still empty (i.e. no nonzeros). It only >> preallocated an array to hold up to 'nnz' entries. >> >> >> and access it with vl_C.operator(). >> >> >> Unless you are changing only a few entries, this is likely to be too >> slow. >> >> >> I am using host only memory context, with ViennaCL 1.7.1 from >> homebrew. >> >> How should I proceed with this? >> >> >> To fill the CSR format efficiencly, have a look here: >> https://sourceforge.net/p/viennacl/discussion/1143678/thread >> /325a937c/?limit=25#d6f0 >> <https://sourceforge.net/p/viennacl/discussion/1143678/threa >> d/325a937c/?limit=25#d6f0> >> >> For host-based memory, an example of how to get pointers to the >> three CSR arrays is here: >> https://github.com/viennacl/viennacl-dev/blob/master/viennac >> l/linalg/host_based/sparse_matrix_operations.hpp#L115 >> <https://github.com/viennacl/viennacl-dev/blob/master/vienna >> cl/linalg/host_based/sparse_matrix_operations.hpp#L115> >> >> Best regards, >> Karli >> >> >> |
From: Karl R. <ru...@iu...> - 2017-04-07 06:21:38
|
Hey, On 04/06/2017 11:48 PM, Chris Marsh wrote: > Unless you are changing only a few entries, this is likely to be too > slow. > > Big time :) > > Ok, so even though it is pre allocated for the right number of nnz > values, operator() still incurs the cost? Must admit that is not what > I'd have expected. Well, this is a sparse matrix. Since operator() deals with a single entry, there is no way this could be fast (note that CSR has requirements on entries from the same row being located consecutively in memory) > When I obtain those CSR buffers, they will be the correct size, and I > should be able to insert into them in parallel, correct? Yes, exactly. You may need to populate the matrix in two passes: The first determines the sparsity pattern, the second writes the actual numerical values. Best regards, Karli > On 6 April 2017 at 13:13, Karl Rupp <ru...@iu... > <mailto:ru...@iu...>> wrote: > > Hi! > > > > On 04/06/2017 06:44 PM, Chris Marsh wrote: > > Hi, > > I know the number of non-zero entries for a sparse matrix so I > am trying > to pre-allocate it with > > viennacl::compressed_matrix<vcl_scalar_type> vl_C(row, col, nnz); > > > At this point your matrix is still empty (i.e. no nonzeros). It only > preallocated an array to hold up to 'nnz' entries. > > > and access it with vl_C.operator(). > > > Unless you are changing only a few entries, this is likely to be too > slow. > > > I am using host only memory context, with ViennaCL 1.7.1 from > homebrew. > > How should I proceed with this? > > > To fill the CSR format efficiencly, have a look here: > https://sourceforge.net/p/viennacl/discussion/1143678/thread/325a937c/?limit=25#d6f0 > <https://sourceforge.net/p/viennacl/discussion/1143678/thread/325a937c/?limit=25#d6f0> > > For host-based memory, an example of how to get pointers to the > three CSR arrays is here: > https://github.com/viennacl/viennacl-dev/blob/master/viennacl/linalg/host_based/sparse_matrix_operations.hpp#L115 > <https://github.com/viennacl/viennacl-dev/blob/master/viennacl/linalg/host_based/sparse_matrix_operations.hpp#L115> > > Best regards, > Karli > > |
From: Chris M. <chr...@us...> - 2017-04-06 21:49:25
|
> > Unless you are changing only a few entries, this is likely to be too slow. Big time :) Ok, so even though it is pre allocated for the right number of nnz values, operator() still incurs the cost? Must admit that is not what I'd have expected. When I obtain those CSR buffers, they will be the correct size, and I should be able to insert into them in parallel, correct? On 6 April 2017 at 13:13, Karl Rupp <ru...@iu...> wrote: > Hi! > > > > On 04/06/2017 06:44 PM, Chris Marsh wrote: > >> Hi, >> >> I know the number of non-zero entries for a sparse matrix so I am trying >> to pre-allocate it with >> >> viennacl::compressed_matrix<vcl_scalar_type> vl_C(row, col, nnz); >> > > At this point your matrix is still empty (i.e. no nonzeros). It only > preallocated an array to hold up to 'nnz' entries. > > > and access it with vl_C.operator(). >> > > Unless you are changing only a few entries, this is likely to be too slow. > > > I am using host only memory context, with ViennaCL 1.7.1 from homebrew. >> >> How should I proceed with this? >> > > To fill the CSR format efficiencly, have a look here: > https://sourceforge.net/p/viennacl/discussion/1143678/thread > /325a937c/?limit=25#d6f0 > > For host-based memory, an example of how to get pointers to the three CSR > arrays is here: > https://github.com/viennacl/viennacl-dev/blob/master/viennac > l/linalg/host_based/sparse_matrix_operations.hpp#L115 > > Best regards, > Karli > > |
From: Karl R. <ru...@iu...> - 2017-04-06 19:16:06
|
Hi! On 04/06/2017 06:44 PM, Chris Marsh wrote: > Hi, > > I know the number of non-zero entries for a sparse matrix so I am trying > to pre-allocate it with > > viennacl::compressed_matrix<vcl_scalar_type> vl_C(row, col, nnz); At this point your matrix is still empty (i.e. no nonzeros). It only preallocated an array to hold up to 'nnz' entries. > and access it with vl_C.operator(). Unless you are changing only a few entries, this is likely to be too slow. > I am using host only memory context, with ViennaCL 1.7.1 from homebrew. > > How should I proceed with this? To fill the CSR format efficiencly, have a look here: https://sourceforge.net/p/viennacl/discussion/1143678/thread/325a937c/?limit=25#d6f0 For host-based memory, an example of how to get pointers to the three CSR arrays is here: https://github.com/viennacl/viennacl-dev/blob/master/viennacl/linalg/host_based/sparse_matrix_operations.hpp#L115 Best regards, Karli |
From: Chris M. <chr...@us...> - 2017-04-06 17:00:06
|
Hi, I know the number of non-zero entries for a sparse matrix so I am trying to pre-allocate it with viennacl::compressed_matrix<vcl_scalar_type> vl_C(row, col, nnz); and access it with vl_C.operator(). However, after stepping through the VCL code on the call to operator(), it appears that it is not finding the index, and thus falling back to the very slow re-allocation. See here: https://github.com/viennacl/viennacl-dev/blob/master/viennacl/compressed_matrix.hpp#L1014 I am using host only memory context, with ViennaCL 1.7.1 from homebrew. How should I proceed with this? Cheers Chris |