I am trying to setup the tutorial shipped with V1.5.1. When trying to run the "Blas3" example on the CPU, everything goes fine, but when switching to my GPU (ATI Radeon HD 4350) I get errors at runtime. For switching I employ the commands:
--- Computing matrix-matrix product on each available compute device using Vienn
aCL ---
- Device Name: ATI RV710
Build Scalar: Err = -11 Status = -2
Log: Error: Requested compile size is bigger than the required workgroup size of
32¬♣ elements
Error: Creating kernel kernel_0_0 failed!
could you please send us the output of examples/tutorial/viennacl-info? Your GPU is fairly old and low-end, so I suspect that it e.g. has insufficient local memory.
Thanks and best regards,
Karli
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
The viennacl-info.log contains the following information:
Der Buildvorgang wurde am 15.05.2014 09:50:50 gestartet.
1>Projekt "F:\user\DRp\ViennaCL-1.5.1\build64\AMD\examples\tutorial\viennacl-info.vcxproj" auf Knoten "4", Rebuild Ziel(e).
1>CustomBuild:
Building Custom Rule F:/user/DRp/ViennaCL-1.5.1/examples/tutorial/CMakeLists.txt
CMake does not need to re-run because F:\user\DRp\ViennaCL-1.5.1\build64\AMD\examples\tutorial\CMakeFiles\generate.stamp is up-to-date.
ClCompile:
C:\Program Files (x86)\Microsoft Visual Studio 11.0\VC\bin\x86_amd64\CL.exe /c /IF:\user\DRp\dependencies\boost_1_55_0 /I"F:\user\DRp\dependencies\MTL-all-4.0.9540-Linux\usr\include" /I"F:\user\DRp\ViennaCL-1.5.1" /I"C:\Program Files (x86)\AMD APP\include" /I"F:\user\DRp\dependencies\eigen-eigen-6b38706d90a9" /I"F:\user\DRp\ViennaCL-1.5.1\libviennacl\include" /Zi /nologo /W3 /WX- /Od /Ob0 /D WIN32 /D _WINDOWS /D VIENNACL_WITH_OPENMP /D _DEBUG /D VIENNACL_WITH_OPENCL /D "CMAKE_INTDIR=\"Debug\"" /D _MBCS /Gm- /EHsc /RTC1 /MDd /GS /fp:precise /Zc:wchar_t /Zc:forScope /GR /openmp /Fo"viennacl-info.dir\Debug\" /Fd"viennacl-info.dir\Debug\vc110.pdb" /Gd /TP /wd4996 /errorReport:prompt "........\examples\tutorial\viennacl-info.cpp"
viennacl-info.cpp
Link:
C:\Program Files (x86)\Microsoft Visual Studio 11.0\VC\bin\x86_amd64\link.exe /ERRORREPORT:PROMPT /OUT:"F:\user\DRp\ViennaCL-1.5.1\build64\AMD\examples\tutorial\Debug\viennacl-info.exe" /INCREMENTAL /NOLOGO /LIBPATH:"F:/user/DRp/dependencies/boost_1_55_0/lib64-msvc-11.0" /LIBPATH:"F:/user/DRp/dependencies/boost_1_55_0/lib64-msvc-11.0/Debug" kernel32.lib user32.lib gdi32.lib winspool.lib shell32.lib ole32.lib oleaut32.lib uuid.lib comdlg32.lib advapi32.lib "C:\Program Files (x86)\AMD APP\lib\x86_64\OpenCL.lib" /MANIFEST /MANIFESTUAC:"level='asInvoker' uiAccess='false'" /manifest:embed /DEBUG /PDB:"F:/user/DRp/ViennaCL-1.5.1/build64/AMD/examples/tutorial/Debug/viennacl-info.pdb" /SUBSYSTEM:CONSOLE /TLBID:1 /DYNAMICBASE /NXCOMPAT /IMPLIB:"F:/user/DRp/ViennaCL-1.5.1/build64/AMD/examples/tutorial/Debug/viennacl-info.lib" /MACHINE:X64 /machine:x64 /debug "viennacl-info.dir\Debug\viennacl-info.obj"
viennacl-info.vcxproj -> F:\user\DRp\ViennaCL-1.5.1\build64\AMD\examples\tutorial\Debug\viennacl-info.exe
1>Die Erstellung des Projekts "F:\user\DRp\ViennaCL-1.5.1\build64\AMD\examples\tutorial\viennacl-info.vcxproj" ist abgeschlossen, Rebuild Ziel(e).
Build erfolgreich.
Verstrichene Zeit 00:00:01.97
However, I am not sure if this is exactly the information you are looking for. Do you need the complete folder content?
Regarding my card, I am little bit confused. I agree that it is a fairly old one but the manual indicates compatibility (in case of single precision). So I suppose someone at some place had better luck with this GPU.
Thanks again and best regards,
Daniel
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Thanks, Dan. please provide the output when running viennacl-info, not the compiler messages. This gives us all the valuable hints about local memory sizes, etc.
The error you observed is in the matrix-matrix multiplication routine, which is particularly sensitive to the underlying hardware. Most of the other kernels presumably run fine. Anyway, we want ViennaCL to work on as much hardware out there as possible, so any help on getting this to work is appreciated.
Thanks and best regards,
Karli
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Thanks for the feedback!
The default GEMM kernel requires less than 16kiB of local memory, and comforms with the OpenCL standards. However, it seems that in your case the kernel doesn't JIT-compile, and according to the log message we cannot require work group sizes of 8 x 8 x 1. It seems like your GPU reports max_work_group_size = 128, though! Perhaps the value reported is incorrect.
Could you go to viennacl/generator/autotuning/profiles.hpp, line 108-111, 119 - 122, and replace :
1,8,32,8,4,4,4,1,0
bye
1,4,32,4,4,4,4,1,0 ?
This will set a different profile for the GEMM kernels.
Thanks!
Last edit: Philippe Tillet 2014-05-15
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
it seems that this old ATI card is giving ViennaCL a hard time: When running the "custom-kernel" example, the program crashes again with similar output as above in "blas3":
Unhandled exception at at 0x000007FEFDC5940D in custom-kernels.exe: Microsoft C++ exception: viennacl::ocl::invalid_work_group_size at memory location 0x00000000001DF180.
okay, the GPU is indeed very low on resources, so the default work group size of 128 is too big for anything but the most trivial kernels. The 'fix' is again to adjust yet another default work group size, which will hopefully be addressed 'automatically' in the course of populating our device database for the next release. If you need a fix earlier, please let us know.
Best regards,
Karli
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi,
I am trying to setup the tutorial shipped with V1.5.1. When trying to run the "Blas3" example on the CPU, everything goes fine, but when switching to my GPU (ATI Radeon HD 4350) I get errors at runtime. For switching I employ the commands:
viennacl::ocl::set_context_platform_index(0,1);
viennacl::ocl::switch_context(0);
which successfully makes ViennaCL select the ATI card. The console output looks as follows:
--- Computing matrix-matrix product using ublas ---
- Execution time: 0.15302
--- Computing matrix-matrix product on each available compute device using Vienn
aCL ---
- Device Name: ATI RV710
Build Scalar: Err = -11 Status = -2
Log: Error: Requested compile size is bigger than the required workgroup size of
32¬♣ elements
Error: Creating kernel kernel_0_0 failed!
Sources: #if defined(cl_khr_fp64)
pragma OPENCL EXTENSION cl_khr_fp64: enable
elif defined(cl_amd_fp64)
pragma OPENCL EXTENSION cl_amd_fp64: enable
endif
attribute((reqd_work_group_size(8,8,1)))
kernel void kernel_0_0(
unsigned int M,unsigned int N,unsigned int K,global float arg0,global float
* arg1,global float arg2,float arg3,float arg4
)
{
float res0_0 = (float)(0) ;
float res0_1 = (float)(0) ;
float res0_2 = (float)(0) ;
float res0_3 = (float)(0) ;
float res1_0 = (float)(0) ;
float res1_1 = (float)(0) ;
float res1_2 = (float)(0) ;
float res1_3 = (float)(0) ;
float res2_0 = (float)(0) ;
float res2_1 = (float)(0) ;
float res2_2 = (float)(0) ;
float res2_3 = (float)(0) ;
float res3_0 = (float)(0) ;
float res3_1 = (float)(0) ;
float res3_2 = (float)(0) ;
float res3_3 = (float)(0) ;
local float lhs_buf[1056];
global float global_lhs_ptr = arg1 + (get_group_id(0)32)K;
global float rhs_ptr_0 = arg2 + (get_local_id(1)4 + get_group_id(1)32
+ 0)K;
global float rhs_ptr_1 = arg2 + (get_local_id(1)4 + get_group_id(1)32
+ 1)K;
global float rhs_ptr_2 = arg2 + (get_local_id(1)4 + get_group_id(1)32
+ 2)K;
global float rhs_ptr_3 = arg2 + (get_local_id(1)4 + get_group_id(1)32
+ 3)K;
for(unsigned int bl=0 ; bl<K/32 ; ++bl){
barrier(CLK_LOCAL_MEM_FENCE);
{
float val;
val = (global_lhs_ptr + (get_local_id(1) + 0) + K(get_local_id(0) + 0)
);
lhs_buf[(get_local_id(0) + 0)33 + (get_local_id(1) + 0)1] = val;
val = (global_lhs_ptr + (get_local_id(1) + 0) + K(get_local_id(0) + 8)
);
lhs_buf[(get_local_id(0) + 8)33 + (get_local_id(1) + 0)1] = val;
val = (global_lhs_ptr + (get_local_id(1) + 0) + K(get_local_id(0) + 16
));
lhs_buf[(get_local_id(0) + 16)33 + (get_local_id(1) + 0)1] = val;
val = (global_lhs_ptr + (get_local_id(1) + 0) + K(get_local_id(0) + 24
));
lhs_buf[(get_local_id(0) + 24)33 + (get_local_id(1) + 0)1] = val;
val = (global_lhs_ptr + (get_local_id(1) + 8) + K(get_local_id(0) + 0)
);
lhs_buf[(get_local_id(0) + 0)33 + (get_local_id(1) + 8)1] = val;
val = (global_lhs_ptr + (get_local_id(1) + 8) + K(get_local_id(0) + 8)
);
lhs_buf[(get_local_id(0) + 8)33 + (get_local_id(1) + 8)1] = val;
val = (global_lhs_ptr + (get_local_id(1) + 8) + K(get_local_id(0) + 16
));
lhs_buf[(get_local_id(0) + 16)33 + (get_local_id(1) + 8)1] = val;
val = (global_lhs_ptr + (get_local_id(1) + 8) + K(get_local_id(0) + 24
));
lhs_buf[(get_local_id(0) + 24)33 + (get_local_id(1) + 8)1] = val;
val = (global_lhs_ptr + (get_local_id(1) + 16) + K(get_local_id(0) + 0
));
lhs_buf[(get_local_id(0) + 0)33 + (get_local_id(1) + 16)1] = val;
val = (global_lhs_ptr + (get_local_id(1) + 16) + K(get_local_id(0) + 8
));
lhs_buf[(get_local_id(0) + 8)33 + (get_local_id(1) + 16)1] = val;
val = (global_lhs_ptr + (get_local_id(1) + 16) + K(get_local_id(0) + 1
6));
lhs_buf[(get_local_id(0) + 16)33 + (get_local_id(1) + 16)1] = val;
val = (global_lhs_ptr + (get_local_id(1) + 16) + K(get_local_id(0) + 2
4));
lhs_buf[(get_local_id(0) + 24)33 + (get_local_id(1) + 16)1] = val;
val = (global_lhs_ptr + (get_local_id(1) + 24) + K(get_local_id(0) + 0
));
lhs_buf[(get_local_id(0) + 0)33 + (get_local_id(1) + 24)1] = val;
val = (global_lhs_ptr + (get_local_id(1) + 24) + K(get_local_id(0) + 8
));
lhs_buf[(get_local_id(0) + 8)33 + (get_local_id(1) + 24)1] = val;
val = (global_lhs_ptr + (get_local_id(1) + 24) + K(get_local_id(0) + 1
6));
lhs_buf[(get_local_id(0) + 16)33 + (get_local_id(1) + 24)1] = val;
val = (global_lhs_ptr + (get_local_id(1) + 24) + K(get_local_id(0) + 2
4));
lhs_buf[(get_local_id(0) + 24)33 + (get_local_id(1) + 24)1] = val;
}
barrier(CLK_LOCAL_MEM_FENCE);
local float lhs_ptr_0 = lhs_buf + (get_local_id(0)4+0)*33;
local float lhs_ptr_1 = lhs_buf + (get_local_id(0)4+1)33;
local float lhs_ptr_2 = lhs_buf + (get_local_id(0)4+2)*33;
local float lhs_ptr_3 = lhs_buf + (get_local_id(0)4+3)33;
for(unsigned int bs=0 ; bs < 8 ; ++bs){
float val_rhs_0_0 = * rhs_ptr_0++;
float val_rhs_0_1 = * rhs_ptr_1++;
float val_rhs_0_2 = * rhs_ptr_2++;
float val_rhs_0_3 = * rhs_ptr_3++;
float val_rhs_1_0 = * rhs_ptr_0++;
float val_rhs_1_1 = * rhs_ptr_1++;
float val_rhs_1_2 = * rhs_ptr_2++;
float val_rhs_1_3 = * rhs_ptr_3++;
float val_rhs_2_0 = * rhs_ptr_0++;
float val_rhs_2_1 = * rhs_ptr_1++;
float val_rhs_2_2 = * rhs_ptr_2++;
float val_rhs_2_3 = * rhs_ptr_3++;
float val_rhs_3_0 = * rhs_ptr_0++;
float val_rhs_3_1 = * rhs_ptr_1++;
float val_rhs_3_2 = * rhs_ptr_2++;
float val_rhs_3_3 = * rhs_ptr_3++;
float val_lhs_0_0 = * lhs_ptr_0++;
float val_lhs_1_0 = * lhs_ptr_1++;
float val_lhs_2_0 = * lhs_ptr_2++;
float val_lhs_3_0 = * lhs_ptr_3++;
float val_lhs_0_1 = * lhs_ptr_0++;
float val_lhs_1_1 = * lhs_ptr_1++;
float val_lhs_2_1 = * lhs_ptr_2++;
float val_lhs_3_1 = * lhs_ptr_3++;
float val_lhs_0_2 = * lhs_ptr_0++;
float val_lhs_1_2 = * lhs_ptr_1++;
float val_lhs_2_2 = * lhs_ptr_2++;
float val_lhs_3_2 = * lhs_ptr_3++;
float val_lhs_0_3 = * lhs_ptr_0++;
float val_lhs_1_3 = * lhs_ptr_1++;
float val_lhs_2_3 = * lhs_ptr_2++;
float val_lhs_3_3 = * lhs_ptr_3++;
res0_0+=val_lhs_0_0val_rhs_0_0;
res1_0+=val_lhs_1_0val_rhs_0_0;
res2_0+=val_lhs_2_0val_rhs_0_0;
res3_0+=val_lhs_3_0val_rhs_0_0;
res0_0+=val_lhs_0_1val_rhs_1_0;
res1_0+=val_lhs_1_1val_rhs_1_0;
res2_0+=val_lhs_2_1val_rhs_1_0;
res3_0+=val_lhs_3_1val_rhs_1_0;
res0_0+=val_lhs_0_2val_rhs_2_0;
res1_0+=val_lhs_1_2val_rhs_2_0;
res2_0+=val_lhs_2_2val_rhs_2_0;
res3_0+=val_lhs_3_2val_rhs_2_0;
res0_0+=val_lhs_0_3val_rhs_3_0;
res1_0+=val_lhs_1_3val_rhs_3_0;
res2_0+=val_lhs_2_3val_rhs_3_0;
res3_0+=val_lhs_3_3val_rhs_3_0;
res0_1+=val_lhs_0_0val_rhs_0_1;
res1_1+=val_lhs_1_0val_rhs_0_1;
res2_1+=val_lhs_2_0val_rhs_0_1;
res3_1+=val_lhs_3_0val_rhs_0_1;
res0_1+=val_lhs_0_1val_rhs_1_1;
res1_1+=val_lhs_1_1val_rhs_1_1;
res2_1+=val_lhs_2_1val_rhs_1_1;
res3_1+=val_lhs_3_1val_rhs_1_1;
res0_1+=val_lhs_0_2val_rhs_2_1;
res1_1+=val_lhs_1_2val_rhs_2_1;
res2_1+=val_lhs_2_2val_rhs_2_1;
res3_1+=val_lhs_3_2val_rhs_2_1;
res0_1+=val_lhs_0_3val_rhs_3_1;
res1_1+=val_lhs_1_3val_rhs_3_1;
res2_1+=val_lhs_2_3val_rhs_3_1;
res3_1+=val_lhs_3_3val_rhs_3_1;
res0_2+=val_lhs_0_0val_rhs_0_2;
res1_2+=val_lhs_1_0val_rhs_0_2;
res2_2+=val_lhs_2_0val_rhs_0_2;
res3_2+=val_lhs_3_0val_rhs_0_2;
res0_2+=val_lhs_0_1val_rhs_1_2;
res1_2+=val_lhs_1_1val_rhs_1_2;
res2_2+=val_lhs_2_1val_rhs_1_2;
res3_2+=val_lhs_3_1val_rhs_1_2;
res0_2+=val_lhs_0_2val_rhs_2_2;
res1_2+=val_lhs_1_2val_rhs_2_2;
res2_2+=val_lhs_2_2val_rhs_2_2;
res3_2+=val_lhs_3_2val_rhs_2_2;
res0_2+=val_lhs_0_3val_rhs_3_2;
res1_2+=val_lhs_1_3val_rhs_3_2;
res2_2+=val_lhs_2_3val_rhs_3_2;
res3_2+=val_lhs_3_3val_rhs_3_2;
res0_3+=val_lhs_0_0val_rhs_0_3;
res1_3+=val_lhs_1_0val_rhs_0_3;
res2_3+=val_lhs_2_0val_rhs_0_3;
res3_3+=val_lhs_3_0val_rhs_0_3;
res0_3+=val_lhs_0_1val_rhs_1_3;
res1_3+=val_lhs_1_1val_rhs_1_3;
res2_3+=val_lhs_2_1val_rhs_1_3;
res3_3+=val_lhs_3_1val_rhs_1_3;
res0_3+=val_lhs_0_2val_rhs_2_3;
res1_3+=val_lhs_1_2val_rhs_2_3;
res2_3+=val_lhs_2_2val_rhs_2_3;
res3_3+=val_lhs_3_2val_rhs_2_3;
res0_3+=val_lhs_0_3val_rhs_3_3;
res1_3+=val_lhs_1_3val_rhs_3_3;
res2_3+=val_lhs_2_3val_rhs_3_3;
res3_3+=val_lhs_3_3val_rhs_3_3;
}
global_lhs_ptr += 32;
}
(arg0[(get_global_id(0)4+0)N+ (get_global_id(1)*4+0)]=(((res0_0)arg3)+(ar
g0[(get_global_id(0)4+0)N+ (get_global_id(1)*4+0)]arg4)));
(arg0[(get_global_id(0)4+0)N+ (get_global_id(1)*4+1)]=(((res0_1)arg3)+(ar
g0[(get_global_id(0)4+0)N+ (get_global_id(1)*4+1)]arg4)));
(arg0[(get_global_id(0)4+0)N+ (get_global_id(1)*4+2)]=(((res0_2)arg3)+(ar
g0[(get_global_id(0)4+0)N+ (get_global_id(1)*4+2)]arg4)));
(arg0[(get_global_id(0)4+0)N+ (get_global_id(1)*4+3)]=(((res0_3)arg3)+(ar
g0[(get_global_id(0)4+0)N+ (get_global_id(1)*4+3)]arg4)));
(arg0[(get_global_id(0)4+1)N+ (get_global_id(1)*4+0)]=(((res1_0)arg3)+(ar
g0[(get_global_id(0)4+1)N+ (get_global_id(1)*4+0)]arg4)));
(arg0[(get_global_id(0)4+1)N+ (get_global_id(1)*4+1)]=(((res1_1)arg3)+(ar
g0[(get_global_id(0)4+1)N+ (get_global_id(1)*4+1)]arg4)));
(arg0[(get_global_id(0)4+1)N+ (get_global_id(1)*4+2)]=(((res1_2)arg3)+(ar
g0[(get_global_id(0)4+1)N+ (get_global_id(1)*4+2)]arg4)));
(arg0[(get_global_id(0)4+1)N+ (get_global_id(1)*4+3)]=(((res1_3)arg3)+(ar
g0[(get_global_id(0)4+1)N+ (get_global_id(1)*4+3)]arg4)));
(arg0[(get_global_id(0)4+2)N+ (get_global_id(1)*4+0)]=(((res2_0)arg3)+(ar
g0[(get_global_id(0)4+2)N+ (get_global_id(1)*4+0)]arg4)));
(arg0[(get_global_id(0)4+2)N+ (get_global_id(1)*4+1)]=(((res2_1)arg3)+(ar
g0[(get_global_id(0)4+2)N+ (get_global_id(1)*4+1)]arg4)));
(arg0[(get_global_id(0)4+2)N+ (get_global_id(1)*4+2)]=(((res2_2)arg3)+(ar
g0[(get_global_id(0)4+2)N+ (get_global_id(1)*4+2)]arg4)));
(arg0[(get_global_id(0)4+2)N+ (get_global_id(1)*4+3)]=(((res2_3)arg3)+(ar
g0[(get_global_id(0)4+2)N+ (get_global_id(1)*4+3)]arg4)));
(arg0[(get_global_id(0)4+3)N+ (get_global_id(1)*4+0)]=(((res3_0)arg3)+(ar
g0[(get_global_id(0)4+3)N+ (get_global_id(1)*4+0)]arg4)));
(arg0[(get_global_id(0)4+3)N+ (get_global_id(1)*4+1)]=(((res3_1)arg3)+(ar
g0[(get_global_id(0)4+3)N+ (get_global_id(1)*4+1)]arg4)));
(arg0[(get_global_id(0)4+3)N+ (get_global_id(1)*4+2)]=(((res3_2)arg3)+(ar
g0[(get_global_id(0)4+3)N+ (get_global_id(1)*4+2)]arg4)));
(arg0[(get_global_id(0)4+3)N+ (get_global_id(1)*4+3)]=(((res3_3)arg3)+(ar
g0[(get_global_id(0)4+3)N+ (get_global_id(1)*4+3)]*arg4)));
}
Can anybody help?
Thanks,
Daniel
Last edit: Mr Dan 2014-05-14
Hi Daniel,
could you please send us the output of examples/tutorial/viennacl-info? Your GPU is fairly old and low-end, so I suspect that it e.g. has insufficient local memory.
Thanks and best regards,
Karli
Hi Karli,
thanks for the quick reply!
The viennacl-info.log contains the following information:
Der Buildvorgang wurde am 15.05.2014 09:50:50 gestartet.
1>Projekt "F:\user\DRp\ViennaCL-1.5.1\build64\AMD\examples\tutorial\viennacl-info.vcxproj" auf Knoten "4", Rebuild Ziel(e).
1>CustomBuild:
Building Custom Rule F:/user/DRp/ViennaCL-1.5.1/examples/tutorial/CMakeLists.txt
CMake does not need to re-run because F:\user\DRp\ViennaCL-1.5.1\build64\AMD\examples\tutorial\CMakeFiles\generate.stamp is up-to-date.
ClCompile:
C:\Program Files (x86)\Microsoft Visual Studio 11.0\VC\bin\x86_amd64\CL.exe /c /IF:\user\DRp\dependencies\boost_1_55_0 /I"F:\user\DRp\dependencies\MTL-all-4.0.9540-Linux\usr\include" /I"F:\user\DRp\ViennaCL-1.5.1" /I"C:\Program Files (x86)\AMD APP\include" /I"F:\user\DRp\dependencies\eigen-eigen-6b38706d90a9" /I"F:\user\DRp\ViennaCL-1.5.1\libviennacl\include" /Zi /nologo /W3 /WX- /Od /Ob0 /D WIN32 /D _WINDOWS /D VIENNACL_WITH_OPENMP /D _DEBUG /D VIENNACL_WITH_OPENCL /D "CMAKE_INTDIR=\"Debug\"" /D _MBCS /Gm- /EHsc /RTC1 /MDd /GS /fp:precise /Zc:wchar_t /Zc:forScope /GR /openmp /Fo"viennacl-info.dir\Debug\" /Fd"viennacl-info.dir\Debug\vc110.pdb" /Gd /TP /wd4996 /errorReport:prompt "........\examples\tutorial\viennacl-info.cpp"
viennacl-info.cpp
Link:
C:\Program Files (x86)\Microsoft Visual Studio 11.0\VC\bin\x86_amd64\link.exe /ERRORREPORT:PROMPT /OUT:"F:\user\DRp\ViennaCL-1.5.1\build64\AMD\examples\tutorial\Debug\viennacl-info.exe" /INCREMENTAL /NOLOGO /LIBPATH:"F:/user/DRp/dependencies/boost_1_55_0/lib64-msvc-11.0" /LIBPATH:"F:/user/DRp/dependencies/boost_1_55_0/lib64-msvc-11.0/Debug" kernel32.lib user32.lib gdi32.lib winspool.lib shell32.lib ole32.lib oleaut32.lib uuid.lib comdlg32.lib advapi32.lib "C:\Program Files (x86)\AMD APP\lib\x86_64\OpenCL.lib" /MANIFEST /MANIFESTUAC:"level='asInvoker' uiAccess='false'" /manifest:embed /DEBUG /PDB:"F:/user/DRp/ViennaCL-1.5.1/build64/AMD/examples/tutorial/Debug/viennacl-info.pdb" /SUBSYSTEM:CONSOLE /TLBID:1 /DYNAMICBASE /NXCOMPAT /IMPLIB:"F:/user/DRp/ViennaCL-1.5.1/build64/AMD/examples/tutorial/Debug/viennacl-info.lib" /MACHINE:X64 /machine:x64 /debug "viennacl-info.dir\Debug\viennacl-info.obj"
viennacl-info.vcxproj -> F:\user\DRp\ViennaCL-1.5.1\build64\AMD\examples\tutorial\Debug\viennacl-info.exe
1>Die Erstellung des Projekts "F:\user\DRp\ViennaCL-1.5.1\build64\AMD\examples\tutorial\viennacl-info.vcxproj" ist abgeschlossen, Rebuild Ziel(e).
Build erfolgreich.
Verstrichene Zeit 00:00:01.97
However, I am not sure if this is exactly the information you are looking for. Do you need the complete folder content?
Regarding my card, I am little bit confused. I agree that it is a fairly old one but the manual indicates compatibility (in case of single precision). So I suppose someone at some place had better luck with this GPU.
Thanks again and best regards,
Daniel
Thanks, Dan. please provide the output when running viennacl-info, not the compiler messages. This gives us all the valuable hints about local memory sizes, etc.
The error you observed is in the matrix-matrix multiplication routine, which is particularly sensitive to the underlying hardware. Most of the other kernels presumably run fine. Anyway, we want ViennaCL to work on as much hardware out there as possible, so any help on getting this to work is appreciated.
Thanks and best regards,
Karli
Hi Karl,
sorry for the misinterpretation. Running viennacl-info creates the following output:
=========================================
Platform Information
=========================================
Vendor and version: Intel(R) Corporation: OpenCL 1.2
ViennaCL uses this OpenCL platform by default.
Available Devices:
Address Bits: 64
Available: 1
Compiler Available: 1
Endian Little: 1
Error Correction Support: 0
Execution Capabilities: CL_EXEC_KERNEL CL_EXEC_NATIVE_KERNEL
Extensions: cl_khr_fp64 cl_khr_icd cl_khr_global_int32_base_a
tomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_k
hr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_intel_printf cl
_ext_device_fission cl_intel_exec_by_local_thread cl_khr_gl_sharing cl_intel_dx9
_media_sharing cl_khr_dx9_media_sharing cl_khr_d3d11_sharing
Global Mem Cache Size: 262144 Bytes
Global Mem Cache Type: CL_READ_WRITE_CACHE
Global Mem Cacheline Size: 64 Bytes
Global Mem Size: 8585207808 Bytes
Host Unified Memory: 1
Image Support: 1
Image2D Max Height: 16384
Image2D Max Width: 16384
Image3D Max Depth: 2048
Image3D Max Height: 2048
Image3D Max Width: 2048
Local Mem Size: 32768 Bytes
Local Mem Type: CL_GLOBAL
Max Clock Frequency: 3200 MHz
Max Compute Units: 4
Max Constant Args: 480
Max Constant Buffer Size: 131072 Bytes
Max Mem Alloc Size: 2146301952 Bytes
Max Parameter Size: 3840 Bytes
Max Read Image Args: 480
Max Samplers: 480
Max Work Group Size: 1024
Max Work Item Dimensions: 3
Max Work Item Sizes: 1024 1024 1024
Max Write Image Args: 480
Mem Base Addr Align: 1024
Min Data Type Align Size: 128 Bytes
Name: Intel(R) Core(TM) i5 CPU 650 @ 3.20GHz
Native Vector Width char: 16
Native Vector Width short: 8
Native Vector Width int: 4
Native Vector Width long: 2
Native Vector Width float: 4
Native Vector Width double: 2
Native Vector Width half: 0
OpenCL C Version: OpenCL C 1.2
Platform: 00000000006E1800
Preferred Vector Width char: 1
Preferred Vector Width short: 1
Preferred Vector Width int: 1
Preferred Vector Width long: 1
Preferred Vector Width float: 1
Preferred Vector Width double: 1
Preferred Vector Width half: 0
Profile: FULL_PROFILE
Profiling Timer Resolution: 320 ns
Queue Properties: CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE CL_QUEUE_P
ROFILING_ENABLE
Single FP Config: CL_FP_DENORM CL_FP_INF_NAN CL_FP_ROUND_TO_NEAREST
Type: CPU
Vendor: Intel(R) Corporation
Vendor ID: 32902
Version: OpenCL 1.2 (Build 80752)
Driver Version: 3.0.1.15216
=========================================
Platform Information
=========================================
Vendor and version: Advanced Micro Devices, Inc.: OpenCL 1.2 AMD-APP (937.2)
Available Devices:
Address Bits: 32
Available: 1
Compiler Available: 1
Endian Little: 1
Error Correction Support: 0
Execution Capabilities: CL_EXEC_KERNEL
Extensions: cl_khr_gl_sharing cl_amd_device_attribute_query c
l_khr_d3d10_sharing
Global Mem Cache Size: 0 Bytes
Global Mem Cache Type: CL_NONE
Global Mem Cacheline Size: 0 Bytes
Global Mem Size: 536870912 Bytes
Host Unified Memory: 0
Image Support: 0
Image2D Max Height: 0
Image2D Max Width: 0
Image3D Max Depth: 0
Image3D Max Height: 0
Image3D Max Width: 0
Local Mem Size: 16384 Bytes
Local Mem Type: CL_GLOBAL
Max Clock Frequency: 600 MHz
Max Compute Units: 2
Max Constant Args: 8
Max Constant Buffer Size: 65536 Bytes
Max Mem Alloc Size: 134217728 Bytes
Max Parameter Size: 1024 Bytes
Max Read Image Args: 0
Max Samplers: 0
Max Work Group Size: 128
Max Work Item Dimensions: 3
Max Work Item Sizes: 128 128 128
Max Write Image Args: 0
Mem Base Addr Align: 2048
Min Data Type Align Size: 128 Bytes
Name: ATI RV710
Native Vector Width char: 16
Native Vector Width short: 8
Native Vector Width int: 4
Native Vector Width long: 2
Native Vector Width float: 4
Native Vector Width double: 0
Native Vector Width half: 0
OpenCL C Version: OpenCL C 1.0
Platform: 000007FEEEF92A08
Preferred Vector Width char: 16
Preferred Vector Width short: 8
Preferred Vector Width int: 4
Preferred Vector Width long: 2
Preferred Vector Width float: 4
Preferred Vector Width double: 0
Preferred Vector Width half: 0
Profile: FULL_PROFILE
Profiling Timer Resolution: 1 ns
Queue Properties: CL_QUEUE_PROFILING_ENABLE
Single FP Config: CL_FP_INF_NAN CL_FP_ROUND_TO_NEAREST CL_FP_ROUND_
TO_ZERO CL_FP_ROUND_TO_INF CL_FP_FMA
Type: GPU
Vendor: Advanced Micro Devices, Inc.
Vendor ID: 4098
Version: OpenCL 1.0 AMD-APP (937.2)
Driver Version: CAL 1.4.1734
Address Bits: 64
Available: 1
Compiler Available: 1
Endian Little: 1
Error Correction Support: 0
Execution Capabilities: CL_EXEC_KERNEL CL_EXEC_NATIVE_KERNEL
Extensions: cl_khr_fp64 cl_amd_fp64 cl_khr_global_int32_base_
atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_
khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended
_atomics cl_khr_byte_addressable_store cl_khr_gl_sharing cl_ext_device_fission c
l_amd_device_attribute_query cl_amd_vec3 cl_amd_printf cl_amd_media_ops cl_amd_p
opcnt cl_khr_d3d10_sharing
Global Mem Cache Size: 32768 Bytes
Global Mem Cache Type: CL_READ_WRITE_CACHE
Global Mem Cacheline Size: 64 Bytes
Global Mem Size: 8585207808 Bytes
Host Unified Memory: 1
Image Support: 1
Image2D Max Height: 8192
Image2D Max Width: 8192
Image3D Max Depth: 2048
Image3D Max Height: 2048
Image3D Max Width: 2048
Local Mem Size: 32768 Bytes
Local Mem Type: CL_GLOBAL
Max Clock Frequency: 3197 MHz
Max Compute Units: 4
Max Constant Args: 8
Max Constant Buffer Size: 65536 Bytes
Max Mem Alloc Size: 2147483648 Bytes
Max Parameter Size: 4096 Bytes
Max Read Image Args: 128
Max Samplers: 16
Max Work Group Size: 1024
Max Work Item Dimensions: 3
Max Work Item Sizes: 1024 1024 1024
Max Write Image Args: 8
Mem Base Addr Align: 1024
Min Data Type Align Size: 128 Bytes
Name: Intel(R) Core(TM) i5 CPU 650 @ 3.20GHz
Native Vector Width char: 16
Native Vector Width short: 8
Native Vector Width int: 4
Native Vector Width long: 2
Native Vector Width float: 4
Native Vector Width double: 0
Native Vector Width half: 0
OpenCL C Version: OpenCL C 1.2
Platform: 000007FEEEF92A08
Preferred Vector Width char: 16
Preferred Vector Width short: 8
Preferred Vector Width int: 4
Preferred Vector Width long: 2
Preferred Vector Width float: 4
Preferred Vector Width double: 0
Preferred Vector Width half: 0
Profile: FULL_PROFILE
Profiling Timer Resolution: 320 ns
Queue Properties: CL_QUEUE_PROFILING_ENABLE
Single FP Config: CL_FP_DENORM CL_FP_INF_NAN CL_FP_ROUND_TO_NEAREST
CL_FP_ROUND_TO_ZERO CL_FP_ROUND_TO_INF CL_FP_FMA
Type: CPU
Vendor: GenuineIntel
Vendor ID: 4098
Version: OpenCL 1.2 AMD-APP (937.2)
Driver Version: 2.0 (sse2)
Please provide some feedback about this issue, in order to properly select a suitable card in the future.
Thanks,
Daniel
Last edit: Mr Dan 2014-05-15
Hi Daniel,
Thanks for the feedback!
The default GEMM kernel requires less than 16kiB of local memory, and comforms with the OpenCL standards. However, it seems that in your case the kernel doesn't JIT-compile, and according to the log message we cannot require work group sizes of 8 x 8 x 1. It seems like your GPU reports max_work_group_size = 128, though! Perhaps the value reported is incorrect.
Could you go to viennacl/generator/autotuning/profiles.hpp, line 108-111, 119 - 122, and replace :
1,8,32,8,4,4,4,1,0
bye
1,4,32,4,4,4,4,1,0 ?
This will set a different profile for the GEMM kernels.
Thanks!
Last edit: Philippe Tillet 2014-05-15
Hi Philippe,
thanks for this hack, it does the trick!
So, "Blas3" is running on the GPU. I will now examine further examples, hopefully with success!
Thanks again and best Regards,
Daniel
Thanks, Daniel! So apparently this is indeed a resource issue. Philippe, let's discuss this on our IRC meeting on Saturday.
If the SDKs don't report the right value for the maximum work-group size, we'll have a tough time :P
Hi Karli/ Philippe,
it seems that this old ATI card is giving ViennaCL a hard time: When running the "custom-kernel" example, the program crashes again with similar output as above in "blas3":
Unhandled exception at at 0x000007FEFDC5940D in custom-kernels.exe: Microsoft C++ exception: viennacl::ocl::invalid_work_group_size at memory location 0x00000000001DF180.
vec1 . vec2: 10
vec1 / vec2: 10
ViennaCL: FATAL ERROR: Kernel start failed for 'norm'.
ViennaCL: Smaller work sizes could not solve the problem.
If you are interested, I can offer further investigation of the issue (with assistance). As for me I'll rather stick with a more recent graphics card.
Thanks and Best Regards,
Daniel
Last edit: Mr Dan 2014-05-20
Hi Daniel,
okay, the GPU is indeed very low on resources, so the default work group size of 128 is too big for anything but the most trivial kernels. The 'fix' is again to adjust yet another default work group size, which will hopefully be addressed 'automatically' in the course of populating our device database for the next release. If you need a fix earlier, please let us know.
Best regards,
Karli