Menu

Cannot run tutorial examples on AMD HD4000 series

Mr Dan
2014-05-14
2014-05-20
  • Mr Dan

    Mr Dan - 2014-05-14

    Hi,

    I am trying to setup the tutorial shipped with V1.5.1. When trying to run the "Blas3" example on the CPU, everything goes fine, but when switching to my GPU (ATI Radeon HD 4350) I get errors at runtime. For switching I employ the commands:

    viennacl::ocl::set_context_platform_index(0,1);
    viennacl::ocl::switch_context(0);

    which successfully makes ViennaCL select the ATI card. The console output looks as follows:

    --- Computing matrix-matrix product using ublas ---
    - Execution time: 0.15302

    --- Computing matrix-matrix product on each available compute device using Vienn
    aCL ---
    - Device Name: ATI RV710
    Build Scalar: Err = -11 Status = -2
    Log: Error: Requested compile size is bigger than the required workgroup size of
    32¬♣ elements
    Error: Creating kernel kernel_0_0 failed!

    Sources: #if defined(cl_khr_fp64)

    pragma OPENCL EXTENSION cl_khr_fp64: enable

    elif defined(cl_amd_fp64)

    pragma OPENCL EXTENSION cl_amd_fp64: enable

    endif

    attribute((reqd_work_group_size(8,8,1)))
    kernel void kernel_0_0(
    unsigned int M,unsigned int N,unsigned int K,
    global float arg0,global float
    * arg1,
    global float
    arg2,float arg3,float arg4
    )
    {
    float res0_0 = (float)(0) ;
    float res0_1 = (float)(0) ;
    float res0_2 = (float)(0) ;
    float res0_3 = (float)(0) ;
    float res1_0 = (float)(0) ;
    float res1_1 = (float)(0) ;
    float res1_2 = (float)(0) ;
    float res1_3 = (float)(0) ;
    float res2_0 = (float)(0) ;
    float res2_1 = (float)(0) ;
    float res2_2 = (float)(0) ;
    float res2_3 = (float)(0) ;
    float res3_0 = (float)(0) ;
    float res3_1 = (float)(0) ;
    float res3_2 = (float)(0) ;
    float res3_3 = (float)(0) ;
    local float lhs_buf[1056];
    global float global_lhs_ptr = arg1 + (get_group_id(0)32)K;
    global float rhs_ptr_0 = arg2 + (get_local_id(1)4 + get_group_id(1)32
    + 0)
    K;
    global float
    rhs_ptr_1 = arg2 + (get_local_id(1)4 + get_group_id(1)32
    + 1)K;
    global float rhs_ptr_2 = arg2 + (get_local_id(1)4 + get_group_id(1)32
    + 2)
    K;
    global float
    rhs_ptr_3 = arg2 + (get_local_id(1)4 + get_group_id(1)32
    + 3)K;
    for(unsigned int bl=0 ; bl<K/32 ; ++bl){
    barrier(CLK_LOCAL_MEM_FENCE);
    {
    float val;
    val =
    (global_lhs_ptr + (get_local_id(1) + 0) + K(get_local_id(0) + 0)
    );
    lhs_buf[(get_local_id(0) + 0)33 + (get_local_id(1) + 0)1] = val;
    val =
    (global_lhs_ptr + (get_local_id(1) + 0) + K(get_local_id(0) + 8)
    );
    lhs_buf[(get_local_id(0) + 8)33 + (get_local_id(1) + 0)1] = val;
    val =
    (global_lhs_ptr + (get_local_id(1) + 0) + K(get_local_id(0) + 16
    ));
    lhs_buf[(get_local_id(0) + 16)33 + (get_local_id(1) + 0)1] = val;
    val =
    (global_lhs_ptr + (get_local_id(1) + 0) + K(get_local_id(0) + 24
    ));
    lhs_buf[(get_local_id(0) + 24)33 + (get_local_id(1) + 0)1] = val;
    val =
    (global_lhs_ptr + (get_local_id(1) + 8) + K(get_local_id(0) + 0)
    );
    lhs_buf[(get_local_id(0) + 0)33 + (get_local_id(1) + 8)1] = val;
    val =
    (global_lhs_ptr + (get_local_id(1) + 8) + K(get_local_id(0) + 8)
    );
    lhs_buf[(get_local_id(0) + 8)33 + (get_local_id(1) + 8)1] = val;
    val =
    (global_lhs_ptr + (get_local_id(1) + 8) + K(get_local_id(0) + 16
    ));
    lhs_buf[(get_local_id(0) + 16)33 + (get_local_id(1) + 8)1] = val;
    val =
    (global_lhs_ptr + (get_local_id(1) + 8) + K(get_local_id(0) + 24
    ));
    lhs_buf[(get_local_id(0) + 24)33 + (get_local_id(1) + 8)1] = val;
    val =
    (global_lhs_ptr + (get_local_id(1) + 16) + K(get_local_id(0) + 0
    ));
    lhs_buf[(get_local_id(0) + 0)33 + (get_local_id(1) + 16)1] = val;
    val =
    (global_lhs_ptr + (get_local_id(1) + 16) + K(get_local_id(0) + 8
    ));
    lhs_buf[(get_local_id(0) + 8)33 + (get_local_id(1) + 16)1] = val;
    val =
    (global_lhs_ptr + (get_local_id(1) + 16) + K(get_local_id(0) + 1
    6));
    lhs_buf[(get_local_id(0) + 16)33 + (get_local_id(1) + 16)1] = val;
    val =
    (global_lhs_ptr + (get_local_id(1) + 16) + K(get_local_id(0) + 2
    4));
    lhs_buf[(get_local_id(0) + 24)33 + (get_local_id(1) + 16)1] = val;
    val =
    (global_lhs_ptr + (get_local_id(1) + 24) + K(get_local_id(0) + 0
    ));
    lhs_buf[(get_local_id(0) + 0)33 + (get_local_id(1) + 24)1] = val;
    val =
    (global_lhs_ptr + (get_local_id(1) + 24) + K(get_local_id(0) + 8
    ));
    lhs_buf[(get_local_id(0) + 8)33 + (get_local_id(1) + 24)1] = val;
    val =
    (global_lhs_ptr + (get_local_id(1) + 24) + K(get_local_id(0) + 1
    6));
    lhs_buf[(get_local_id(0) + 16)33 + (get_local_id(1) + 24)1] = val;
    val =
    (global_lhs_ptr + (get_local_id(1) + 24) + K(get_local_id(0) + 2
    4));
    lhs_buf[(get_local_id(0) + 24)33 + (get_local_id(1) + 24)1] = val;
    }
    barrier(CLK_LOCAL_MEM_FENCE);
    local float lhs_ptr_0 = lhs_buf + (get_local_id(0)4+0)*33;
    local float
    lhs_ptr_1 = lhs_buf + (get_local_id(0)4+1)33;
    local float lhs_ptr_2 = lhs_buf + (get_local_id(0)4+2)*33;
    local float lhs_ptr_3 = lhs_buf + (get_local_id(0)4+3)33;
    for(unsigned int bs=0 ; bs < 8 ; ++bs){
    float val_rhs_0_0 = * rhs_ptr_0++;
    float val_rhs_0_1 = * rhs_ptr_1++;
    float val_rhs_0_2 = * rhs_ptr_2++;
    float val_rhs_0_3 = * rhs_ptr_3++;
    float val_rhs_1_0 = * rhs_ptr_0++;
    float val_rhs_1_1 = * rhs_ptr_1++;
    float val_rhs_1_2 = * rhs_ptr_2++;
    float val_rhs_1_3 = * rhs_ptr_3++;
    float val_rhs_2_0 = * rhs_ptr_0++;
    float val_rhs_2_1 = * rhs_ptr_1++;
    float val_rhs_2_2 = * rhs_ptr_2++;
    float val_rhs_2_3 = * rhs_ptr_3++;
    float val_rhs_3_0 = * rhs_ptr_0++;
    float val_rhs_3_1 = * rhs_ptr_1++;
    float val_rhs_3_2 = * rhs_ptr_2++;
    float val_rhs_3_3 = * rhs_ptr_3++;
    float val_lhs_0_0 = * lhs_ptr_0++;
    float val_lhs_1_0 = * lhs_ptr_1++;
    float val_lhs_2_0 = * lhs_ptr_2++;
    float val_lhs_3_0 = * lhs_ptr_3++;
    float val_lhs_0_1 = * lhs_ptr_0++;
    float val_lhs_1_1 = * lhs_ptr_1++;
    float val_lhs_2_1 = * lhs_ptr_2++;
    float val_lhs_3_1 = * lhs_ptr_3++;
    float val_lhs_0_2 = * lhs_ptr_0++;
    float val_lhs_1_2 = * lhs_ptr_1++;
    float val_lhs_2_2 = * lhs_ptr_2++;
    float val_lhs_3_2 = * lhs_ptr_3++;
    float val_lhs_0_3 = * lhs_ptr_0++;
    float val_lhs_1_3 = * lhs_ptr_1++;
    float val_lhs_2_3 = * lhs_ptr_2++;
    float val_lhs_3_3 = * lhs_ptr_3++;
    res0_0+=val_lhs_0_0
    val_rhs_0_0;
    res1_0+=val_lhs_1_0val_rhs_0_0;
    res2_0+=val_lhs_2_0
    val_rhs_0_0;
    res3_0+=val_lhs_3_0val_rhs_0_0;
    res0_0+=val_lhs_0_1
    val_rhs_1_0;
    res1_0+=val_lhs_1_1val_rhs_1_0;
    res2_0+=val_lhs_2_1
    val_rhs_1_0;
    res3_0+=val_lhs_3_1val_rhs_1_0;
    res0_0+=val_lhs_0_2
    val_rhs_2_0;
    res1_0+=val_lhs_1_2val_rhs_2_0;
    res2_0+=val_lhs_2_2
    val_rhs_2_0;
    res3_0+=val_lhs_3_2val_rhs_2_0;
    res0_0+=val_lhs_0_3
    val_rhs_3_0;
    res1_0+=val_lhs_1_3val_rhs_3_0;
    res2_0+=val_lhs_2_3
    val_rhs_3_0;
    res3_0+=val_lhs_3_3val_rhs_3_0;
    res0_1+=val_lhs_0_0
    val_rhs_0_1;
    res1_1+=val_lhs_1_0val_rhs_0_1;
    res2_1+=val_lhs_2_0
    val_rhs_0_1;
    res3_1+=val_lhs_3_0val_rhs_0_1;
    res0_1+=val_lhs_0_1
    val_rhs_1_1;
    res1_1+=val_lhs_1_1val_rhs_1_1;
    res2_1+=val_lhs_2_1
    val_rhs_1_1;
    res3_1+=val_lhs_3_1val_rhs_1_1;
    res0_1+=val_lhs_0_2
    val_rhs_2_1;
    res1_1+=val_lhs_1_2val_rhs_2_1;
    res2_1+=val_lhs_2_2
    val_rhs_2_1;
    res3_1+=val_lhs_3_2val_rhs_2_1;
    res0_1+=val_lhs_0_3
    val_rhs_3_1;
    res1_1+=val_lhs_1_3val_rhs_3_1;
    res2_1+=val_lhs_2_3
    val_rhs_3_1;
    res3_1+=val_lhs_3_3val_rhs_3_1;
    res0_2+=val_lhs_0_0
    val_rhs_0_2;
    res1_2+=val_lhs_1_0val_rhs_0_2;
    res2_2+=val_lhs_2_0
    val_rhs_0_2;
    res3_2+=val_lhs_3_0val_rhs_0_2;
    res0_2+=val_lhs_0_1
    val_rhs_1_2;
    res1_2+=val_lhs_1_1val_rhs_1_2;
    res2_2+=val_lhs_2_1
    val_rhs_1_2;
    res3_2+=val_lhs_3_1val_rhs_1_2;
    res0_2+=val_lhs_0_2
    val_rhs_2_2;
    res1_2+=val_lhs_1_2val_rhs_2_2;
    res2_2+=val_lhs_2_2
    val_rhs_2_2;
    res3_2+=val_lhs_3_2val_rhs_2_2;
    res0_2+=val_lhs_0_3
    val_rhs_3_2;
    res1_2+=val_lhs_1_3val_rhs_3_2;
    res2_2+=val_lhs_2_3
    val_rhs_3_2;
    res3_2+=val_lhs_3_3val_rhs_3_2;
    res0_3+=val_lhs_0_0
    val_rhs_0_3;
    res1_3+=val_lhs_1_0val_rhs_0_3;
    res2_3+=val_lhs_2_0
    val_rhs_0_3;
    res3_3+=val_lhs_3_0val_rhs_0_3;
    res0_3+=val_lhs_0_1
    val_rhs_1_3;
    res1_3+=val_lhs_1_1val_rhs_1_3;
    res2_3+=val_lhs_2_1
    val_rhs_1_3;
    res3_3+=val_lhs_3_1val_rhs_1_3;
    res0_3+=val_lhs_0_2
    val_rhs_2_3;
    res1_3+=val_lhs_1_2val_rhs_2_3;
    res2_3+=val_lhs_2_2
    val_rhs_2_3;
    res3_3+=val_lhs_3_2val_rhs_2_3;
    res0_3+=val_lhs_0_3
    val_rhs_3_3;
    res1_3+=val_lhs_1_3val_rhs_3_3;
    res2_3+=val_lhs_2_3
    val_rhs_3_3;
    res3_3+=val_lhs_3_3val_rhs_3_3;
    }
    global_lhs_ptr += 32;
    }
    (arg0[(get_global_id(0)4+0)N+ (get_global_id(1)*4+0)]=(((res0_0)
    arg3)+(ar
    g0[(get_global_id(0)4+0)N+ (get_global_id(1)*4+0)]arg4)));
    (arg0[(get_global_id(0)4+0)N+ (get_global_id(1)*4+1)]=(((res0_1)
    arg3)+(ar
    g0[(get_global_id(0)4+0)N+ (get_global_id(1)*4+1)]arg4)));
    (arg0[(get_global_id(0)4+0)N+ (get_global_id(1)*4+2)]=(((res0_2)
    arg3)+(ar
    g0[(get_global_id(0)4+0)N+ (get_global_id(1)*4+2)]arg4)));
    (arg0[(get_global_id(0)4+0)N+ (get_global_id(1)*4+3)]=(((res0_3)
    arg3)+(ar
    g0[(get_global_id(0)4+0)N+ (get_global_id(1)*4+3)]arg4)));
    (arg0[(get_global_id(0)4+1)N+ (get_global_id(1)*4+0)]=(((res1_0)
    arg3)+(ar
    g0[(get_global_id(0)4+1)N+ (get_global_id(1)*4+0)]arg4)));
    (arg0[(get_global_id(0)4+1)N+ (get_global_id(1)*4+1)]=(((res1_1)
    arg3)+(ar
    g0[(get_global_id(0)4+1)N+ (get_global_id(1)*4+1)]arg4)));
    (arg0[(get_global_id(0)4+1)N+ (get_global_id(1)*4+2)]=(((res1_2)
    arg3)+(ar
    g0[(get_global_id(0)4+1)N+ (get_global_id(1)*4+2)]arg4)));
    (arg0[(get_global_id(0)4+1)N+ (get_global_id(1)*4+3)]=(((res1_3)
    arg3)+(ar
    g0[(get_global_id(0)4+1)N+ (get_global_id(1)*4+3)]arg4)));
    (arg0[(get_global_id(0)4+2)N+ (get_global_id(1)*4+0)]=(((res2_0)
    arg3)+(ar
    g0[(get_global_id(0)4+2)N+ (get_global_id(1)*4+0)]arg4)));
    (arg0[(get_global_id(0)4+2)N+ (get_global_id(1)*4+1)]=(((res2_1)
    arg3)+(ar
    g0[(get_global_id(0)4+2)N+ (get_global_id(1)*4+1)]arg4)));
    (arg0[(get_global_id(0)4+2)N+ (get_global_id(1)*4+2)]=(((res2_2)
    arg3)+(ar
    g0[(get_global_id(0)4+2)N+ (get_global_id(1)*4+2)]arg4)));
    (arg0[(get_global_id(0)4+2)N+ (get_global_id(1)*4+3)]=(((res2_3)
    arg3)+(ar
    g0[(get_global_id(0)4+2)N+ (get_global_id(1)*4+3)]arg4)));
    (arg0[(get_global_id(0)4+3)N+ (get_global_id(1)*4+0)]=(((res3_0)
    arg3)+(ar
    g0[(get_global_id(0)4+3)N+ (get_global_id(1)*4+0)]arg4)));
    (arg0[(get_global_id(0)4+3)N+ (get_global_id(1)*4+1)]=(((res3_1)
    arg3)+(ar
    g0[(get_global_id(0)4+3)N+ (get_global_id(1)*4+1)]arg4)));
    (arg0[(get_global_id(0)4+3)N+ (get_global_id(1)*4+2)]=(((res3_2)
    arg3)+(ar
    g0[(get_global_id(0)4+3)N+ (get_global_id(1)*4+2)]arg4)));
    (arg0[(get_global_id(0)4+3)N+ (get_global_id(1)*4+3)]=(((res3_3)
    arg3)+(ar
    g0[(get_global_id(0)4+3)N+ (get_global_id(1)*4+3)]*arg4)));
    }

    Can anybody help?

    Thanks,

    Daniel

     

    Last edit: Mr Dan 2014-05-14
  • Karl Rupp

    Karl Rupp - 2014-05-14

    Hi Daniel,

    could you please send us the output of examples/tutorial/viennacl-info? Your GPU is fairly old and low-end, so I suspect that it e.g. has insufficient local memory.

    Thanks and best regards,
    Karli

     
    • Mr Dan

      Mr Dan - 2014-05-15

      Hi Karli,

      thanks for the quick reply!

      The viennacl-info.log contains the following information:


      Der Buildvorgang wurde am 15.05.2014 09:50:50 gestartet.
      1>Projekt "F:\user\DRp\ViennaCL-1.5.1\build64\AMD\examples\tutorial\viennacl-info.vcxproj" auf Knoten "4", Rebuild Ziel(e).
      1>CustomBuild:
      Building Custom Rule F:/user/DRp/ViennaCL-1.5.1/examples/tutorial/CMakeLists.txt
      CMake does not need to re-run because F:\user\DRp\ViennaCL-1.5.1\build64\AMD\examples\tutorial\CMakeFiles\generate.stamp is up-to-date.
      ClCompile:
      C:\Program Files (x86)\Microsoft Visual Studio 11.0\VC\bin\x86_amd64\CL.exe /c /IF:\user\DRp\dependencies\boost_1_55_0 /I"F:\user\DRp\dependencies\MTL-all-4.0.9540-Linux\usr\include" /I"F:\user\DRp\ViennaCL-1.5.1" /I"C:\Program Files (x86)\AMD APP\include" /I"F:\user\DRp\dependencies\eigen-eigen-6b38706d90a9" /I"F:\user\DRp\ViennaCL-1.5.1\libviennacl\include" /Zi /nologo /W3 /WX- /Od /Ob0 /D WIN32 /D _WINDOWS /D VIENNACL_WITH_OPENMP /D _DEBUG /D VIENNACL_WITH_OPENCL /D "CMAKE_INTDIR=\"Debug\"" /D _MBCS /Gm- /EHsc /RTC1 /MDd /GS /fp:precise /Zc:wchar_t /Zc:forScope /GR /openmp /Fo"viennacl-info.dir\Debug\" /Fd"viennacl-info.dir\Debug\vc110.pdb" /Gd /TP /wd4996 /errorReport:prompt "........\examples\tutorial\viennacl-info.cpp"
      viennacl-info.cpp
      Link:
      C:\Program Files (x86)\Microsoft Visual Studio 11.0\VC\bin\x86_amd64\link.exe /ERRORREPORT:PROMPT /OUT:"F:\user\DRp\ViennaCL-1.5.1\build64\AMD\examples\tutorial\Debug\viennacl-info.exe" /INCREMENTAL /NOLOGO /LIBPATH:"F:/user/DRp/dependencies/boost_1_55_0/lib64-msvc-11.0" /LIBPATH:"F:/user/DRp/dependencies/boost_1_55_0/lib64-msvc-11.0/Debug" kernel32.lib user32.lib gdi32.lib winspool.lib shell32.lib ole32.lib oleaut32.lib uuid.lib comdlg32.lib advapi32.lib "C:\Program Files (x86)\AMD APP\lib\x86_64\OpenCL.lib" /MANIFEST /MANIFESTUAC:"level='asInvoker' uiAccess='false'" /manifest:embed /DEBUG /PDB:"F:/user/DRp/ViennaCL-1.5.1/build64/AMD/examples/tutorial/Debug/viennacl-info.pdb" /SUBSYSTEM:CONSOLE /TLBID:1 /DYNAMICBASE /NXCOMPAT /IMPLIB:"F:/user/DRp/ViennaCL-1.5.1/build64/AMD/examples/tutorial/Debug/viennacl-info.lib" /MACHINE:X64 /machine:x64 /debug "viennacl-info.dir\Debug\viennacl-info.obj"
      viennacl-info.vcxproj -> F:\user\DRp\ViennaCL-1.5.1\build64\AMD\examples\tutorial\Debug\viennacl-info.exe
      1>Die Erstellung des Projekts "F:\user\DRp\ViennaCL-1.5.1\build64\AMD\examples\tutorial\viennacl-info.vcxproj" ist abgeschlossen, Rebuild Ziel(e).

      Build erfolgreich.

      Verstrichene Zeit 00:00:01.97


      However, I am not sure if this is exactly the information you are looking for. Do you need the complete folder content?

      Regarding my card, I am little bit confused. I agree that it is a fairly old one but the manual indicates compatibility (in case of single precision). So I suppose someone at some place had better luck with this GPU.

      Thanks again and best regards,

      Daniel

       
  • Karl Rupp

    Karl Rupp - 2014-05-15

    Thanks, Dan. please provide the output when running viennacl-info, not the compiler messages. This gives us all the valuable hints about local memory sizes, etc.

    The error you observed is in the matrix-matrix multiplication routine, which is particularly sensitive to the underlying hardware. Most of the other kernels presumably run fine. Anyway, we want ViennaCL to work on as much hardware out there as possible, so any help on getting this to work is appreciated.

    Thanks and best regards,
    Karli

     
    • Mr Dan

      Mr Dan - 2014-05-15

      Hi Karl,

      sorry for the misinterpretation. Running viennacl-info creates the following output:

      =========================================

      Platform Information

      =========================================

      Vendor and version: Intel(R) Corporation: OpenCL 1.2

      ViennaCL uses this OpenCL platform by default.

      Available Devices:


      Address Bits: 64
      Available: 1
      Compiler Available: 1
      Endian Little: 1
      Error Correction Support: 0
      Execution Capabilities: CL_EXEC_KERNEL CL_EXEC_NATIVE_KERNEL
      Extensions: cl_khr_fp64 cl_khr_icd cl_khr_global_int32_base_a
      tomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_k
      hr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_intel_printf cl
      _ext_device_fission cl_intel_exec_by_local_thread cl_khr_gl_sharing cl_intel_dx9
      _media_sharing cl_khr_dx9_media_sharing cl_khr_d3d11_sharing
      Global Mem Cache Size: 262144 Bytes
      Global Mem Cache Type: CL_READ_WRITE_CACHE
      Global Mem Cacheline Size: 64 Bytes
      Global Mem Size: 8585207808 Bytes
      Host Unified Memory: 1
      Image Support: 1
      Image2D Max Height: 16384
      Image2D Max Width: 16384
      Image3D Max Depth: 2048
      Image3D Max Height: 2048
      Image3D Max Width: 2048
      Local Mem Size: 32768 Bytes
      Local Mem Type: CL_GLOBAL
      Max Clock Frequency: 3200 MHz
      Max Compute Units: 4
      Max Constant Args: 480
      Max Constant Buffer Size: 131072 Bytes
      Max Mem Alloc Size: 2146301952 Bytes
      Max Parameter Size: 3840 Bytes
      Max Read Image Args: 480
      Max Samplers: 480
      Max Work Group Size: 1024
      Max Work Item Dimensions: 3
      Max Work Item Sizes: 1024 1024 1024
      Max Write Image Args: 480
      Mem Base Addr Align: 1024
      Min Data Type Align Size: 128 Bytes
      Name: Intel(R) Core(TM) i5 CPU 650 @ 3.20GHz
      Native Vector Width char: 16
      Native Vector Width short: 8
      Native Vector Width int: 4
      Native Vector Width long: 2
      Native Vector Width float: 4
      Native Vector Width double: 2
      Native Vector Width half: 0
      OpenCL C Version: OpenCL C 1.2
      Platform: 00000000006E1800
      Preferred Vector Width char: 1
      Preferred Vector Width short: 1
      Preferred Vector Width int: 1
      Preferred Vector Width long: 1
      Preferred Vector Width float: 1
      Preferred Vector Width double: 1
      Preferred Vector Width half: 0
      Profile: FULL_PROFILE
      Profiling Timer Resolution: 320 ns
      Queue Properties: CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE CL_QUEUE_P
      ROFILING_ENABLE
      Single FP Config: CL_FP_DENORM CL_FP_INF_NAN CL_FP_ROUND_TO_NEAREST

      Type: CPU
      Vendor: Intel(R) Corporation
      Vendor ID: 32902
      Version: OpenCL 1.2 (Build 80752)
      Driver Version: 3.0.1.15216


      =========================================

      Platform Information

      =========================================

      Vendor and version: Advanced Micro Devices, Inc.: OpenCL 1.2 AMD-APP (937.2)

      Available Devices:


      Address Bits: 32
      Available: 1
      Compiler Available: 1
      Endian Little: 1
      Error Correction Support: 0
      Execution Capabilities: CL_EXEC_KERNEL
      Extensions: cl_khr_gl_sharing cl_amd_device_attribute_query c
      l_khr_d3d10_sharing
      Global Mem Cache Size: 0 Bytes
      Global Mem Cache Type: CL_NONE
      Global Mem Cacheline Size: 0 Bytes
      Global Mem Size: 536870912 Bytes
      Host Unified Memory: 0
      Image Support: 0
      Image2D Max Height: 0
      Image2D Max Width: 0
      Image3D Max Depth: 0
      Image3D Max Height: 0
      Image3D Max Width: 0
      Local Mem Size: 16384 Bytes
      Local Mem Type: CL_GLOBAL
      Max Clock Frequency: 600 MHz
      Max Compute Units: 2
      Max Constant Args: 8
      Max Constant Buffer Size: 65536 Bytes
      Max Mem Alloc Size: 134217728 Bytes
      Max Parameter Size: 1024 Bytes
      Max Read Image Args: 0
      Max Samplers: 0
      Max Work Group Size: 128
      Max Work Item Dimensions: 3
      Max Work Item Sizes: 128 128 128
      Max Write Image Args: 0
      Mem Base Addr Align: 2048
      Min Data Type Align Size: 128 Bytes
      Name: ATI RV710
      Native Vector Width char: 16
      Native Vector Width short: 8
      Native Vector Width int: 4
      Native Vector Width long: 2
      Native Vector Width float: 4
      Native Vector Width double: 0
      Native Vector Width half: 0
      OpenCL C Version: OpenCL C 1.0
      Platform: 000007FEEEF92A08
      Preferred Vector Width char: 16
      Preferred Vector Width short: 8
      Preferred Vector Width int: 4
      Preferred Vector Width long: 2
      Preferred Vector Width float: 4
      Preferred Vector Width double: 0
      Preferred Vector Width half: 0
      Profile: FULL_PROFILE
      Profiling Timer Resolution: 1 ns
      Queue Properties: CL_QUEUE_PROFILING_ENABLE
      Single FP Config: CL_FP_INF_NAN CL_FP_ROUND_TO_NEAREST CL_FP_ROUND_
      TO_ZERO CL_FP_ROUND_TO_INF CL_FP_FMA
      Type: GPU
      Vendor: Advanced Micro Devices, Inc.
      Vendor ID: 4098
      Version: OpenCL 1.0 AMD-APP (937.2)
      Driver Version: CAL 1.4.1734



      Address Bits: 64
      Available: 1
      Compiler Available: 1
      Endian Little: 1
      Error Correction Support: 0
      Execution Capabilities: CL_EXEC_KERNEL CL_EXEC_NATIVE_KERNEL
      Extensions: cl_khr_fp64 cl_amd_fp64 cl_khr_global_int32_base_
      atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_
      khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended
      _atomics cl_khr_byte_addressable_store cl_khr_gl_sharing cl_ext_device_fission c
      l_amd_device_attribute_query cl_amd_vec3 cl_amd_printf cl_amd_media_ops cl_amd_p
      opcnt cl_khr_d3d10_sharing
      Global Mem Cache Size: 32768 Bytes
      Global Mem Cache Type: CL_READ_WRITE_CACHE
      Global Mem Cacheline Size: 64 Bytes
      Global Mem Size: 8585207808 Bytes
      Host Unified Memory: 1
      Image Support: 1
      Image2D Max Height: 8192
      Image2D Max Width: 8192
      Image3D Max Depth: 2048
      Image3D Max Height: 2048
      Image3D Max Width: 2048
      Local Mem Size: 32768 Bytes
      Local Mem Type: CL_GLOBAL
      Max Clock Frequency: 3197 MHz
      Max Compute Units: 4
      Max Constant Args: 8
      Max Constant Buffer Size: 65536 Bytes
      Max Mem Alloc Size: 2147483648 Bytes
      Max Parameter Size: 4096 Bytes
      Max Read Image Args: 128
      Max Samplers: 16
      Max Work Group Size: 1024
      Max Work Item Dimensions: 3
      Max Work Item Sizes: 1024 1024 1024
      Max Write Image Args: 8
      Mem Base Addr Align: 1024
      Min Data Type Align Size: 128 Bytes
      Name: Intel(R) Core(TM) i5 CPU 650 @ 3.20GHz
      Native Vector Width char: 16
      Native Vector Width short: 8
      Native Vector Width int: 4
      Native Vector Width long: 2
      Native Vector Width float: 4
      Native Vector Width double: 0
      Native Vector Width half: 0
      OpenCL C Version: OpenCL C 1.2
      Platform: 000007FEEEF92A08
      Preferred Vector Width char: 16
      Preferred Vector Width short: 8
      Preferred Vector Width int: 4
      Preferred Vector Width long: 2
      Preferred Vector Width float: 4
      Preferred Vector Width double: 0
      Preferred Vector Width half: 0
      Profile: FULL_PROFILE
      Profiling Timer Resolution: 320 ns
      Queue Properties: CL_QUEUE_PROFILING_ENABLE
      Single FP Config: CL_FP_DENORM CL_FP_INF_NAN CL_FP_ROUND_TO_NEAREST
      CL_FP_ROUND_TO_ZERO CL_FP_ROUND_TO_INF CL_FP_FMA
      Type: CPU
      Vendor: GenuineIntel
      Vendor ID: 4098
      Version: OpenCL 1.2 AMD-APP (937.2)
      Driver Version: 2.0 (sse2)


      Please provide some feedback about this issue, in order to properly select a suitable card in the future.

      Thanks,

      Daniel

       

      Last edit: Mr Dan 2014-05-15
  • Philippe Tillet

    Philippe Tillet - 2014-05-15

    Hi Daniel,

    Thanks for the feedback!
    The default GEMM kernel requires less than 16kiB of local memory, and comforms with the OpenCL standards. However, it seems that in your case the kernel doesn't JIT-compile, and according to the log message we cannot require work group sizes of 8 x 8 x 1. It seems like your GPU reports max_work_group_size = 128, though! Perhaps the value reported is incorrect.
    Could you go to viennacl/generator/autotuning/profiles.hpp, line 108-111, 119 - 122, and replace :

    1,8,32,8,4,4,4,1,0

    bye

    1,4,32,4,4,4,4,1,0 ?

    This will set a different profile for the GEMM kernels.

    Thanks!

     

    Last edit: Philippe Tillet 2014-05-15
    • Mr Dan

      Mr Dan - 2014-05-15

      Hi Philippe,

      thanks for this hack, it does the trick!

      So, "Blas3" is running on the GPU. I will now examine further examples, hopefully with success!

      Thanks again and best Regards,

      Daniel

       
  • Karl Rupp

    Karl Rupp - 2014-05-15

    Thanks, Daniel! So apparently this is indeed a resource issue. Philippe, let's discuss this on our IRC meeting on Saturday.

     
  • Philippe Tillet

    Philippe Tillet - 2014-05-15

    If the SDKs don't report the right value for the maximum work-group size, we'll have a tough time :P

     
    • Mr Dan

      Mr Dan - 2014-05-20

      Hi Karli/ Philippe,

      it seems that this old ATI card is giving ViennaCL a hard time: When running the "custom-kernel" example, the program crashes again with similar output as above in "blas3":

      Unhandled exception at at 0x000007FEFDC5940D in custom-kernels.exe: Microsoft C++ exception: viennacl::ocl::invalid_work_group_size at memory location 0x00000000001DF180.

          vec1: [10](0,1,2,3,4,5,6,7,8,9)
          vec2: [10](10,9,8,7,6,5,4,3,2,1)
      

      vec1 . vec2: 10
      vec1 /
      vec2: 10
      ViennaCL: FATAL ERROR: Kernel start failed for 'norm'.
      ViennaCL: Smaller work sizes could not solve the problem.

      If you are interested, I can offer further investigation of the issue (with assistance). As for me I'll rather stick with a more recent graphics card.

      Thanks and Best Regards,

      Daniel

       

      Last edit: Mr Dan 2014-05-20
  • Karl Rupp

    Karl Rupp - 2014-05-20

    Hi Daniel,

    okay, the GPU is indeed very low on resources, so the default work group size of 128 is too big for anything but the most trivial kernels. The 'fix' is again to adjust yet another default work group size, which will hopefully be addressed 'automatically' in the course of populating our device database for the next release. If you need a fix earlier, please let us know.

    Best regards,
    Karli

     

Log in to post a comment.