ViennaCL / Discussion / General Discussion: Cannot run tutorial examples on AMD HD4000 series

Mr Dan - 2014-05-14

Hi,

I am trying to setup the tutorial shipped with V1.5.1. When trying to run the "Blas3" example on the CPU, everything goes fine, but when switching to my GPU (ATI Radeon HD 4350) I get errors at runtime. For switching I employ the commands:

viennacl::ocl::set_context_platform_index(0,1);
viennacl::ocl::switch_context(0);

which successfully makes ViennaCL select the ATI card. The console output looks as follows:

--- Computing matrix-matrix product using ublas ---
- Execution time: 0.15302

--- Computing matrix-matrix product on each available compute device using Vienn
aCL ---
- Device Name: ATI RV710
Build Scalar: Err = -11 Status = -2
Log: Error: Requested compile size is bigger than the required workgroup size of
32¬♣ elements
Error: Creating kernel kernel_0_0 failed!

Sources: #if defined(cl_khr_fp64)

pragma OPENCL EXTENSION cl_khr_fp64: enable

elif defined(cl_amd_fp64)

pragma OPENCL EXTENSION cl_amd_fp64: enable

endif

attribute((reqd_work_group_size(8,8,1)))
kernel void kernel_0_0(
unsigned int M,unsigned int N,unsigned int K,global float arg0,global float
* arg1,global float arg2,float arg3,float arg4
)
{
float res0_0 = (float)(0) ;
float res0_1 = (float)(0) ;
float res0_2 = (float)(0) ;
float res0_3 = (float)(0) ;
float res1_0 = (float)(0) ;
float res1_1 = (float)(0) ;
float res1_2 = (float)(0) ;
float res1_3 = (float)(0) ;
float res2_0 = (float)(0) ;
float res2_1 = (float)(0) ;
float res2_2 = (float)(0) ;
float res2_3 = (float)(0) ;
float res3_0 = (float)(0) ;
float res3_1 = (float)(0) ;
float res3_2 = (float)(0) ;
float res3_3 = (float)(0) ;
local float lhs_buf[1056];
global float global_lhs_ptr = arg1 + (get_group_id(0)32)K;
global float rhs_ptr_0 = arg2 + (get_local_id(1)4 + get_group_id(1)32
+ 0)K;
global float rhs_ptr_1 = arg2 + (get_local_id(1)4 + get_group_id(1)32
+ 1)K;
global float rhs_ptr_2 = arg2 + (get_local_id(1)4 + get_group_id(1)32
+ 2)K;
global float rhs_ptr_3 = arg2 + (get_local_id(1)4 + get_group_id(1)32
+ 3)K;
for(unsigned int bl=0 ; bl<K/32 ; ++bl){
barrier(CLK_LOCAL_MEM_FENCE);
{
float val;
val = (global_lhs_ptr + (get_local_id(1) + 0) + K(get_local_id(0) + 0)
);
lhs_buf[(get_local_id(0) + 0)33 + (get_local_id(1) + 0)1] = val;
val = (global_lhs_ptr + (get_local_id(1) + 0) + K(get_local_id(0) + 8)
);
lhs_buf[(get_local_id(0) + 8)33 + (get_local_id(1) + 0)1] = val;
val = (global_lhs_ptr + (get_local_id(1) + 0) + K(get_local_id(0) + 16
));
lhs_buf[(get_local_id(0) + 16)33 + (get_local_id(1) + 0)1] = val;
val = (global_lhs_ptr + (get_local_id(1) + 0) + K(get_local_id(0) + 24
));
lhs_buf[(get_local_id(0) + 24)33 + (get_local_id(1) + 0)1] = val;
val = (global_lhs_ptr + (get_local_id(1) + 8) + K(get_local_id(0) + 0)
);
lhs_buf[(get_local_id(0) + 0)33 + (get_local_id(1) + 8)1] = val;
val = (global_lhs_ptr + (get_local_id(1) + 8) + K(get_local_id(0) + 8)
);
lhs_buf[(get_local_id(0) + 8)33 + (get_local_id(1) + 8)1] = val;
val = (global_lhs_ptr + (get_local_id(1) + 8) + K(get_local_id(0) + 16
));
lhs_buf[(get_local_id(0) + 16)33 + (get_local_id(1) + 8)1] = val;
val = (global_lhs_ptr + (get_local_id(1) + 8) + K(get_local_id(0) + 24
));
lhs_buf[(get_local_id(0) + 24)33 + (get_local_id(1) + 8)1] = val;
val = (global_lhs_ptr + (get_local_id(1) + 16) + K(get_local_id(0) + 0
));
lhs_buf[(get_local_id(0) + 0)33 + (get_local_id(1) + 16)1] = val;
val = (global_lhs_ptr + (get_local_id(1) + 16) + K(get_local_id(0) + 8
));
lhs_buf[(get_local_id(0) + 8)33 + (get_local_id(1) + 16)1] = val;
val = (global_lhs_ptr + (get_local_id(1) + 16) + K(get_local_id(0) + 1
6));
lhs_buf[(get_local_id(0) + 16)33 + (get_local_id(1) + 16)1] = val;
val = (global_lhs_ptr + (get_local_id(1) + 16) + K(get_local_id(0) + 2
4));
lhs_buf[(get_local_id(0) + 24)33 + (get_local_id(1) + 16)1] = val;
val = (global_lhs_ptr + (get_local_id(1) + 24) + K(get_local_id(0) + 0
));
lhs_buf[(get_local_id(0) + 0)33 + (get_local_id(1) + 24)1] = val;
val = (global_lhs_ptr + (get_local_id(1) + 24) + K(get_local_id(0) + 8
));
lhs_buf[(get_local_id(0) + 8)33 + (get_local_id(1) + 24)1] = val;
val = (global_lhs_ptr + (get_local_id(1) + 24) + K(get_local_id(0) + 1
6));
lhs_buf[(get_local_id(0) + 16)33 + (get_local_id(1) + 24)1] = val;
val = (global_lhs_ptr + (get_local_id(1) + 24) + K(get_local_id(0) + 2
4));
lhs_buf[(get_local_id(0) + 24)33 + (get_local_id(1) + 24)1] = val;
}
barrier(CLK_LOCAL_MEM_FENCE);
local float lhs_ptr_0 = lhs_buf + (get_local_id(0)4+0)*33;
local float lhs_ptr_1 = lhs_buf + (get_local_id(0)4+1)33;
local float lhs_ptr_2 = lhs_buf + (get_local_id(0)4+2)*33;
local float lhs_ptr_3 = lhs_buf + (get_local_id(0)4+3)33;
for(unsigned int bs=0 ; bs < 8 ; ++bs){
float val_rhs_0_0 = * rhs_ptr_0++;
float val_rhs_0_1 = * rhs_ptr_1++;
float val_rhs_0_2 = * rhs_ptr_2++;
float val_rhs_0_3 = * rhs_ptr_3++;
float val_rhs_1_0 = * rhs_ptr_0++;
float val_rhs_1_1 = * rhs_ptr_1++;
float val_rhs_1_2 = * rhs_ptr_2++;
float val_rhs_1_3 = * rhs_ptr_3++;
float val_rhs_2_0 = * rhs_ptr_0++;
float val_rhs_2_1 = * rhs_ptr_1++;
float val_rhs_2_2 = * rhs_ptr_2++;
float val_rhs_2_3 = * rhs_ptr_3++;
float val_rhs_3_0 = * rhs_ptr_0++;
float val_rhs_3_1 = * rhs_ptr_1++;
float val_rhs_3_2 = * rhs_ptr_2++;
float val_rhs_3_3 = * rhs_ptr_3++;
float val_lhs_0_0 = * lhs_ptr_0++;
float val_lhs_1_0 = * lhs_ptr_1++;
float val_lhs_2_0 = * lhs_ptr_2++;
float val_lhs_3_0 = * lhs_ptr_3++;
float val_lhs_0_1 = * lhs_ptr_0++;
float val_lhs_1_1 = * lhs_ptr_1++;
float val_lhs_2_1 = * lhs_ptr_2++;
float val_lhs_3_1 = * lhs_ptr_3++;
float val_lhs_0_2 = * lhs_ptr_0++;
float val_lhs_1_2 = * lhs_ptr_1++;
float val_lhs_2_2 = * lhs_ptr_2++;
float val_lhs_3_2 = * lhs_ptr_3++;
float val_lhs_0_3 = * lhs_ptr_0++;
float val_lhs_1_3 = * lhs_ptr_1++;
float val_lhs_2_3 = * lhs_ptr_2++;
float val_lhs_3_3 = * lhs_ptr_3++;
res0_0+=val_lhs_0_0val_rhs_0_0;
res1_0+=val_lhs_1_0val_rhs_0_0;
res2_0+=val_lhs_2_0val_rhs_0_0;
res3_0+=val_lhs_3_0val_rhs_0_0;
res0_0+=val_lhs_0_1val_rhs_1_0;
res1_0+=val_lhs_1_1val_rhs_1_0;
res2_0+=val_lhs_2_1val_rhs_1_0;
res3_0+=val_lhs_3_1val_rhs_1_0;
res0_0+=val_lhs_0_2val_rhs_2_0;
res1_0+=val_lhs_1_2val_rhs_2_0;
res2_0+=val_lhs_2_2val_rhs_2_0;
res3_0+=val_lhs_3_2val_rhs_2_0;
res0_0+=val_lhs_0_3val_rhs_3_0;
res1_0+=val_lhs_1_3val_rhs_3_0;
res2_0+=val_lhs_2_3val_rhs_3_0;
res3_0+=val_lhs_3_3val_rhs_3_0;
res0_1+=val_lhs_0_0val_rhs_0_1;
res1_1+=val_lhs_1_0val_rhs_0_1;
res2_1+=val_lhs_2_0val_rhs_0_1;
res3_1+=val_lhs_3_0val_rhs_0_1;
res0_1+=val_lhs_0_1val_rhs_1_1;
res1_1+=val_lhs_1_1val_rhs_1_1;
res2_1+=val_lhs_2_1val_rhs_1_1;
res3_1+=val_lhs_3_1val_rhs_1_1;
res0_1+=val_lhs_0_2val_rhs_2_1;
res1_1+=val_lhs_1_2val_rhs_2_1;
res2_1+=val_lhs_2_2val_rhs_2_1;
res3_1+=val_lhs_3_2val_rhs_2_1;
res0_1+=val_lhs_0_3val_rhs_3_1;
res1_1+=val_lhs_1_3val_rhs_3_1;
res2_1+=val_lhs_2_3val_rhs_3_1;
res3_1+=val_lhs_3_3val_rhs_3_1;
res0_2+=val_lhs_0_0val_rhs_0_2;
res1_2+=val_lhs_1_0val_rhs_0_2;
res2_2+=val_lhs_2_0val_rhs_0_2;
res3_2+=val_lhs_3_0val_rhs_0_2;
res0_2+=val_lhs_0_1val_rhs_1_2;
res1_2+=val_lhs_1_1val_rhs_1_2;
res2_2+=val_lhs_2_1val_rhs_1_2;
res3_2+=val_lhs_3_1val_rhs_1_2;
res0_2+=val_lhs_0_2val_rhs_2_2;
res1_2+=val_lhs_1_2val_rhs_2_2;
res2_2+=val_lhs_2_2val_rhs_2_2;
res3_2+=val_lhs_3_2val_rhs_2_2;
res0_2+=val_lhs_0_3val_rhs_3_2;
res1_2+=val_lhs_1_3val_rhs_3_2;
res2_2+=val_lhs_2_3val_rhs_3_2;
res3_2+=val_lhs_3_3val_rhs_3_2;
res0_3+=val_lhs_0_0val_rhs_0_3;
res1_3+=val_lhs_1_0val_rhs_0_3;
res2_3+=val_lhs_2_0val_rhs_0_3;
res3_3+=val_lhs_3_0val_rhs_0_3;
res0_3+=val_lhs_0_1val_rhs_1_3;
res1_3+=val_lhs_1_1val_rhs_1_3;
res2_3+=val_lhs_2_1val_rhs_1_3;
res3_3+=val_lhs_3_1val_rhs_1_3;
res0_3+=val_lhs_0_2val_rhs_2_3;
res1_3+=val_lhs_1_2val_rhs_2_3;
res2_3+=val_lhs_2_2val_rhs_2_3;
res3_3+=val_lhs_3_2val_rhs_2_3;
res0_3+=val_lhs_0_3val_rhs_3_3;
res1_3+=val_lhs_1_3val_rhs_3_3;
res2_3+=val_lhs_2_3val_rhs_3_3;
res3_3+=val_lhs_3_3val_rhs_3_3;
}
global_lhs_ptr += 32;
}
(arg0[(get_global_id(0)4+0)N+ (get_global_id(1)*4+0)]=(((res0_0)arg3)+(ar
g0[(get_global_id(0)4+0)N+ (get_global_id(1)*4+0)]arg4)));
(arg0[(get_global_id(0)4+0)N+ (get_global_id(1)*4+1)]=(((res0_1)arg3)+(ar
g0[(get_global_id(0)4+0)N+ (get_global_id(1)*4+1)]arg4)));
(arg0[(get_global_id(0)4+0)N+ (get_global_id(1)*4+2)]=(((res0_2)arg3)+(ar
g0[(get_global_id(0)4+0)N+ (get_global_id(1)*4+2)]arg4)));
(arg0[(get_global_id(0)4+0)N+ (get_global_id(1)*4+3)]=(((res0_3)arg3)+(ar
g0[(get_global_id(0)4+0)N+ (get_global_id(1)*4+3)]arg4)));
(arg0[(get_global_id(0)4+1)N+ (get_global_id(1)*4+0)]=(((res1_0)arg3)+(ar
g0[(get_global_id(0)4+1)N+ (get_global_id(1)*4+0)]arg4)));
(arg0[(get_global_id(0)4+1)N+ (get_global_id(1)*4+1)]=(((res1_1)arg3)+(ar
g0[(get_global_id(0)4+1)N+ (get_global_id(1)*4+1)]arg4)));
(arg0[(get_global_id(0)4+1)N+ (get_global_id(1)*4+2)]=(((res1_2)arg3)+(ar
g0[(get_global_id(0)4+1)N+ (get_global_id(1)*4+2)]arg4)));
(arg0[(get_global_id(0)4+1)N+ (get_global_id(1)*4+3)]=(((res1_3)arg3)+(ar
g0[(get_global_id(0)4+1)N+ (get_global_id(1)*4+3)]arg4)));
(arg0[(get_global_id(0)4+2)N+ (get_global_id(1)*4+0)]=(((res2_0)arg3)+(ar
g0[(get_global_id(0)4+2)N+ (get_global_id(1)*4+0)]arg4)));
(arg0[(get_global_id(0)4+2)N+ (get_global_id(1)*4+1)]=(((res2_1)arg3)+(ar
g0[(get_global_id(0)4+2)N+ (get_global_id(1)*4+1)]arg4)));
(arg0[(get_global_id(0)4+2)N+ (get_global_id(1)*4+2)]=(((res2_2)arg3)+(ar
g0[(get_global_id(0)4+2)N+ (get_global_id(1)*4+2)]arg4)));
(arg0[(get_global_id(0)4+2)N+ (get_global_id(1)*4+3)]=(((res2_3)arg3)+(ar
g0[(get_global_id(0)4+2)N+ (get_global_id(1)*4+3)]arg4)));
(arg0[(get_global_id(0)4+3)N+ (get_global_id(1)*4+0)]=(((res3_0)arg3)+(ar
g0[(get_global_id(0)4+3)N+ (get_global_id(1)*4+0)]arg4)));
(arg0[(get_global_id(0)4+3)N+ (get_global_id(1)*4+1)]=(((res3_1)arg3)+(ar
g0[(get_global_id(0)4+3)N+ (get_global_id(1)*4+1)]arg4)));
(arg0[(get_global_id(0)4+3)N+ (get_global_id(1)*4+2)]=(((res3_2)arg3)+(ar
g0[(get_global_id(0)4+3)N+ (get_global_id(1)*4+2)]arg4)));
(arg0[(get_global_id(0)4+3)N+ (get_global_id(1)*4+3)]=(((res3_3)arg3)+(ar
g0[(get_global_id(0)4+3)N+ (get_global_id(1)*4+3)]*arg4)));
}

Can anybody help?

Thanks,

Daniel

Last edit: Mr Dan 2014-05-14

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Karl Rupp - 2014-05-14

Hi Daniel,

could you please send us the output of examples/tutorial/viennacl-info? Your GPU is fairly old and low-end, so I suspect that it e.g. has insufficient local memory.

Thanks and best regards,
Karli

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Mr Dan - 2014-05-15
  
  Hi Karli,
  
  thanks for the quick reply!
  
  The viennacl-info.log contains the following information:
  
  Der Buildvorgang wurde am 15.05.2014 09:50:50 gestartet.
  1>Projekt "F:\user\DRp\ViennaCL-1.5.1\build64\AMD\examples\tutorial\viennacl-info.vcxproj" auf Knoten "4", Rebuild Ziel(e).
  1>CustomBuild:
  Building Custom Rule F:/user/DRp/ViennaCL-1.5.1/examples/tutorial/CMakeLists.txt
  CMake does not need to re-run because F:\user\DRp\ViennaCL-1.5.1\build64\AMD\examples\tutorial\CMakeFiles\generate.stamp is up-to-date.
  ClCompile:
  C:\Program Files (x86)\Microsoft Visual Studio 11.0\VC\bin\x86_amd64\CL.exe /c /IF:\user\DRp\dependencies\boost_1_55_0 /I"F:\user\DRp\dependencies\MTL-all-4.0.9540-Linux\usr\include" /I"F:\user\DRp\ViennaCL-1.5.1" /I"C:\Program Files (x86)\AMD APP\include" /I"F:\user\DRp\dependencies\eigen-eigen-6b38706d90a9" /I"F:\user\DRp\ViennaCL-1.5.1\libviennacl\include" /Zi /nologo /W3 /WX- /Od /Ob0 /D WIN32 /D _WINDOWS /D VIENNACL_WITH_OPENMP /D _DEBUG /D VIENNACL_WITH_OPENCL /D "CMAKE_INTDIR=\"Debug\"" /D _MBCS /Gm- /EHsc /RTC1 /MDd /GS /fp:precise /Zc:wchar_t /Zc:forScope /GR /openmp /Fo"viennacl-info.dir\Debug\" /Fd"viennacl-info.dir\Debug\vc110.pdb" /Gd /TP /wd4996 /errorReport:prompt "........\examples\tutorial\viennacl-info.cpp"
  viennacl-info.cpp
  Link:
  C:\Program Files (x86)\Microsoft Visual Studio 11.0\VC\bin\x86_amd64\link.exe /ERRORREPORT:PROMPT /OUT:"F:\user\DRp\ViennaCL-1.5.1\build64\AMD\examples\tutorial\Debug\viennacl-info.exe" /INCREMENTAL /NOLOGO /LIBPATH:"F:/user/DRp/dependencies/boost_1_55_0/lib64-msvc-11.0" /LIBPATH:"F:/user/DRp/dependencies/boost_1_55_0/lib64-msvc-11.0/Debug" kernel32.lib user32.lib gdi32.lib winspool.lib shell32.lib ole32.lib oleaut32.lib uuid.lib comdlg32.lib advapi32.lib "C:\Program Files (x86)\AMD APP\lib\x86_64\OpenCL.lib" /MANIFEST /MANIFESTUAC:"level='asInvoker' uiAccess='false'" /manifest:embed /DEBUG /PDB:"F:/user/DRp/ViennaCL-1.5.1/build64/AMD/examples/tutorial/Debug/viennacl-info.pdb" /SUBSYSTEM:CONSOLE /TLBID:1 /DYNAMICBASE /NXCOMPAT /IMPLIB:"F:/user/DRp/ViennaCL-1.5.1/build64/AMD/examples/tutorial/Debug/viennacl-info.lib" /MACHINE:X64 /machine:x64 /debug "viennacl-info.dir\Debug\viennacl-info.obj"
  viennacl-info.vcxproj -> F:\user\DRp\ViennaCL-1.5.1\build64\AMD\examples\tutorial\Debug\viennacl-info.exe
  1>Die Erstellung des Projekts "F:\user\DRp\ViennaCL-1.5.1\build64\AMD\examples\tutorial\viennacl-info.vcxproj" ist abgeschlossen, Rebuild Ziel(e).
  
  Build erfolgreich.
  
  Verstrichene Zeit 00:00:01.97
  
  However, I am not sure if this is exactly the information you are looking for. Do you need the complete folder content?
  
  Regarding my card, I am little bit confused. I agree that it is a fairly old one but the manual indicates compatibility (in case of single precision). So I suppose someone at some place had better luck with this GPU.
  
  Thanks again and best regards,
  
  Daniel
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Karl Rupp - 2014-05-15

Thanks, Dan. please provide the output when running viennacl-info, not the compiler messages. This gives us all the valuable hints about local memory sizes, etc.

The error you observed is in the matrix-matrix multiplication routine, which is particularly sensitive to the underlying hardware. Most of the other kernels presumably run fine. Anyway, we want ViennaCL to work on as much hardware out there as possible, so any help on getting this to work is appreciated.

Thanks and best regards,
Karli

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Mr Dan - 2014-05-15
  
  Hi Karl,
  
  sorry for the misinterpretation. Running viennacl-info creates the following output:
  
  =========================================
  
  Platform Information
  
  =========================================
  
  Vendor and version: Intel(R) Corporation: OpenCL 1.2
  
  ViennaCL uses this OpenCL platform by default.
  
  Available Devices:
  
  Address Bits: 64
  Available: 1
  Compiler Available: 1
  Endian Little: 1
  Error Correction Support: 0
  Execution Capabilities: CL_EXEC_KERNEL CL_EXEC_NATIVE_KERNEL
  Extensions: cl_khr_fp64 cl_khr_icd cl_khr_global_int32_base_a
  tomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_k
  hr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_intel_printf cl
  _ext_device_fission cl_intel_exec_by_local_thread cl_khr_gl_sharing cl_intel_dx9
  _media_sharing cl_khr_dx9_media_sharing cl_khr_d3d11_sharing
  Global Mem Cache Size: 262144 Bytes
  Global Mem Cache Type: CL_READ_WRITE_CACHE
  Global Mem Cacheline Size: 64 Bytes
  Global Mem Size: 8585207808 Bytes
  Host Unified Memory: 1
  Image Support: 1
  Image2D Max Height: 16384
  Image2D Max Width: 16384
  Image3D Max Depth: 2048
  Image3D Max Height: 2048
  Image3D Max Width: 2048
  Local Mem Size: 32768 Bytes
  Local Mem Type: CL_GLOBAL
  Max Clock Frequency: 3200 MHz
  Max Compute Units: 4
  Max Constant Args: 480
  Max Constant Buffer Size: 131072 Bytes
  Max Mem Alloc Size: 2146301952 Bytes
  Max Parameter Size: 3840 Bytes
  Max Read Image Args: 480
  Max Samplers: 480
  Max Work Group Size: 1024
  Max Work Item Dimensions: 3
  Max Work Item Sizes: 1024 1024 1024
  Max Write Image Args: 480
  Mem Base Addr Align: 1024
  Min Data Type Align Size: 128 Bytes
  Name: Intel(R) Core(TM) i5 CPU 650 @ 3.20GHz
  Native Vector Width char: 16
  Native Vector Width short: 8
  Native Vector Width int: 4
  Native Vector Width long: 2
  Native Vector Width float: 4
  Native Vector Width double: 2
  Native Vector Width half: 0
  OpenCL C Version: OpenCL C 1.2
  Platform: 00000000006E1800
  Preferred Vector Width char: 1
  Preferred Vector Width short: 1
  Preferred Vector Width int: 1
  Preferred Vector Width long: 1
  Preferred Vector Width float: 1
  Preferred Vector Width double: 1
  Preferred Vector Width half: 0
  Profile: FULL_PROFILE
  Profiling Timer Resolution: 320 ns
  Queue Properties: CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE CL_QUEUE_P
  ROFILING_ENABLE
  Single FP Config: CL_FP_DENORM CL_FP_INF_NAN CL_FP_ROUND_TO_NEAREST
  
  Type: CPU
  Vendor: Intel(R) Corporation
  Vendor ID: 32902
  Version: OpenCL 1.2 (Build 80752)
  Driver Version: 3.0.1.15216
  
  =========================================
  
  Platform Information
  
  =========================================
  
  Vendor and version: Advanced Micro Devices, Inc.: OpenCL 1.2 AMD-APP (937.2)
  
  Available Devices:
  
  Address Bits: 32
  Available: 1
  Compiler Available: 1
  Endian Little: 1
  Error Correction Support: 0
  Execution Capabilities: CL_EXEC_KERNEL
  Extensions: cl_khr_gl_sharing cl_amd_device_attribute_query c
  l_khr_d3d10_sharing
  Global Mem Cache Size: 0 Bytes
  Global Mem Cache Type: CL_NONE
  Global Mem Cacheline Size: 0 Bytes
  Global Mem Size: 536870912 Bytes
  Host Unified Memory: 0
  Image Support: 0
  Image2D Max Height: 0
  Image2D Max Width: 0
  Image3D Max Depth: 0
  Image3D Max Height: 0
  Image3D Max Width: 0
  Local Mem Size: 16384 Bytes
  Local Mem Type: CL_GLOBAL
  Max Clock Frequency: 600 MHz
  Max Compute Units: 2
  Max Constant Args: 8
  Max Constant Buffer Size: 65536 Bytes
  Max Mem Alloc Size: 134217728 Bytes
  Max Parameter Size: 1024 Bytes
  Max Read Image Args: 0
  Max Samplers: 0
  Max Work Group Size: 128
  Max Work Item Dimensions: 3
  Max Work Item Sizes: 128 128 128
  Max Write Image Args: 0
  Mem Base Addr Align: 2048
  Min Data Type Align Size: 128 Bytes
  Name: ATI RV710
  Native Vector Width char: 16
  Native Vector Width short: 8
  Native Vector Width int: 4
  Native Vector Width long: 2
  Native Vector Width float: 4
  Native Vector Width double: 0
  Native Vector Width half: 0
  OpenCL C Version: OpenCL C 1.0
  Platform: 000007FEEEF92A08
  Preferred Vector Width char: 16
  Preferred Vector Width short: 8
  Preferred Vector Width int: 4
  Preferred Vector Width long: 2
  Preferred Vector Width float: 4
  Preferred Vector Width double: 0
  Preferred Vector Width half: 0
  Profile: FULL_PROFILE
  Profiling Timer Resolution: 1 ns
  Queue Properties: CL_QUEUE_PROFILING_ENABLE
  Single FP Config: CL_FP_INF_NAN CL_FP_ROUND_TO_NEAREST CL_FP_ROUND_
  TO_ZERO CL_FP_ROUND_TO_INF CL_FP_FMA
  Type: GPU
  Vendor: Advanced Micro Devices, Inc.
  Vendor ID: 4098
  Version: OpenCL 1.0 AMD-APP (937.2)
  Driver Version: CAL 1.4.1734
  
  Address Bits: 64
  Available: 1
  Compiler Available: 1
  Endian Little: 1
  Error Correction Support: 0
  Execution Capabilities: CL_EXEC_KERNEL CL_EXEC_NATIVE_KERNEL
  Extensions: cl_khr_fp64 cl_amd_fp64 cl_khr_global_int32_base_
  atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_
  khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended
  _atomics cl_khr_byte_addressable_store cl_khr_gl_sharing cl_ext_device_fission c
  l_amd_device_attribute_query cl_amd_vec3 cl_amd_printf cl_amd_media_ops cl_amd_p
  opcnt cl_khr_d3d10_sharing
  Global Mem Cache Size: 32768 Bytes
  Global Mem Cache Type: CL_READ_WRITE_CACHE
  Global Mem Cacheline Size: 64 Bytes
  Global Mem Size: 8585207808 Bytes
  Host Unified Memory: 1
  Image Support: 1
  Image2D Max Height: 8192
  Image2D Max Width: 8192
  Image3D Max Depth: 2048
  Image3D Max Height: 2048
  Image3D Max Width: 2048
  Local Mem Size: 32768 Bytes
  Local Mem Type: CL_GLOBAL
  Max Clock Frequency: 3197 MHz
  Max Compute Units: 4
  Max Constant Args: 8
  Max Constant Buffer Size: 65536 Bytes
  Max Mem Alloc Size: 2147483648 Bytes
  Max Parameter Size: 4096 Bytes
  Max Read Image Args: 128
  Max Samplers: 16
  Max Work Group Size: 1024
  Max Work Item Dimensions: 3
  Max Work Item Sizes: 1024 1024 1024
  Max Write Image Args: 8
  Mem Base Addr Align: 1024
  Min Data Type Align Size: 128 Bytes
  Name: Intel(R) Core(TM) i5 CPU 650 @ 3.20GHz
  Native Vector Width char: 16
  Native Vector Width short: 8
  Native Vector Width int: 4
  Native Vector Width long: 2
  Native Vector Width float: 4
  Native Vector Width double: 0
  Native Vector Width half: 0
  OpenCL C Version: OpenCL C 1.2
  Platform: 000007FEEEF92A08
  Preferred Vector Width char: 16
  Preferred Vector Width short: 8
  Preferred Vector Width int: 4
  Preferred Vector Width long: 2
  Preferred Vector Width float: 4
  Preferred Vector Width double: 0
  Preferred Vector Width half: 0
  Profile: FULL_PROFILE
  Profiling Timer Resolution: 320 ns
  Queue Properties: CL_QUEUE_PROFILING_ENABLE
  Single FP Config: CL_FP_DENORM CL_FP_INF_NAN CL_FP_ROUND_TO_NEAREST
  CL_FP_ROUND_TO_ZERO CL_FP_ROUND_TO_INF CL_FP_FMA
  Type: CPU
  Vendor: GenuineIntel
  Vendor ID: 4098
  Version: OpenCL 1.2 AMD-APP (937.2)
  Driver Version: 2.0 (sse2)
  
  Please provide some feedback about this issue, in order to properly select a suitable card in the future.
  
  Thanks,
  
  Daniel
  
  Last edit: Mr Dan 2014-05-15
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Philippe Tillet - 2014-05-15

Hi Daniel,

Thanks for the feedback!
The default GEMM kernel requires less than 16kiB of local memory, and comforms with the OpenCL standards. However, it seems that in your case the kernel doesn't JIT-compile, and according to the log message we cannot require work group sizes of 8 x 8 x 1. It seems like your GPU reports max_work_group_size = 128, though! Perhaps the value reported is incorrect.
Could you go to viennacl/generator/autotuning/profiles.hpp, line 108-111, 119 - 122, and replace :

1,8,32,8,4,4,4,1,0

bye

1,4,32,4,4,4,4,1,0 ?

This will set a different profile for the GEMM kernels.

Thanks!

Last edit: Philippe Tillet 2014-05-15

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Mr Dan - 2014-05-15
  
  Hi Philippe,
  
  thanks for this hack, it does the trick!
  
  So, "Blas3" is running on the GPU. I will now examine further examples, hopefully with success!
  
  Thanks again and best Regards,
  
  Daniel
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Karl Rupp - 2014-05-15

Thanks, Daniel! So apparently this is indeed a resource issue. Philippe, let's discuss this on our IRC meeting on Saturday.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Philippe Tillet - 2014-05-15

If the SDKs don't report the right value for the maximum work-group size, we'll have a tough time :P

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Mr Dan - 2014-05-20
  
  Hi Karli/ Philippe,
  
  it seems that this old ATI card is giving ViennaCL a hard time: When running the "custom-kernel" example, the program crashes again with similar output as above in "blas3":
  
  Unhandled exception at at 0x000007FEFDC5940D in custom-kernels.exe: Microsoft C++ exception: viennacl::ocl::invalid_work_group_size at memory location 0x00000000001DF180.
  
  vec1: [10](0,1,2,3,4,5,6,7,8,9) vec2: [10](10,9,8,7,6,5,4,3,2,1)
  
  vec1 . vec2: 10
  vec1 / vec2: 10
  ViennaCL: FATAL ERROR: Kernel start failed for 'norm'.
  ViennaCL: Smaller work sizes could not solve the problem.
  
  If you are interested, I can offer further investigation of the issue (with assistance). As for me I'll rather stick with a more recent graphics card.
  
  Thanks and Best Regards,
  
  Daniel
  
  Last edit: Mr Dan 2014-05-20
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Karl Rupp - 2014-05-20

Hi Daniel,

okay, the GPU is indeed very low on resources, so the default work group size of 128 is too big for anything but the most trivial kernels. The 'fix' is again to adjust yet another default work group size, which will hopefully be addressed 'automatically' in the course of populating our device database for the next release. If you need a fix earlier, please let us know.

Best regards,
Karli

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Cannot run tutorial examples on AMD HD4000 series

Linear algebra and solver library using CUDA, OpenCL, and OpenMP

Forums

Help

Cannot run tutorial examples on AMD HD4000 series

pragma OPENCL EXTENSION cl_khr_fp64: enable

elif defined(cl_amd_fp64)

pragma OPENCL EXTENSION cl_amd_fp64: enable

endif

=========================================

Platform Information

=========================================

Vendor and version: Intel(R) Corporation: OpenCL 1.2

ViennaCL uses this OpenCL platform by default.

Available Devices:

=========================================

Platform Information

=========================================

Vendor and version: Advanced Micro Devices, Inc.: OpenCL 1.2 AMD-APP (937.2)

Available Devices:

Cannot run tutorial examples on AMD HD4000 series

Linear algebra and solver library using CUDA, OpenCL, and OpenMP

Forums

Help

Cannot run tutorial examples on AMD HD4000 series document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

pragma OPENCL EXTENSION cl_khr_fp64: enable

elif defined(cl_amd_fp64)

pragma OPENCL EXTENSION cl_amd_fp64: enable

endif

=========================================

Platform Information

=========================================

Vendor and version: Intel(R) Corporation: OpenCL 1.2

ViennaCL uses this OpenCL platform by default.

Available Devices:

=========================================

Platform Information

=========================================

Vendor and version: Advanced Micro Devices, Inc.: OpenCL 1.2 AMD-APP (937.2)

Available Devices:

Cannot run tutorial examples on AMD HD4000 series