From: ujoimro <uj...@gm...> - 2012-07-26 17:37:52
|
Hello Again, I found the source of the error. It was the fact, that the algorithm calculates C = alpha * A * B + beta * C instead of C = alpha * A * B I have set beta to 0 in all my tests, however, I did not initialize C to zero. If C contains nan then 0*nan + float is still nan, so nan's propagate up to the norm. I have re-run all the tests with the addition deactivated in the kernels, and the results where correct both for prod_A and prod16_A. Sorry for the inconvenience. Cordially, Laszlo On 2012-07-26 18:19:23 Karl Rupp wrote: > Hi, > > alright, then I'll start searching for the issue... Thanks for the > clarification. > > Best regards, > Karli > > On 07/26/2012 06:15 PM, ujoimro wrote: > > The error is tied to prod_AA. The kernel was cherry picked from ViennaCL > > and inserted into a different framework. The prod_AA kernel was called > > directly.> > > On 2012-07-26 18:08:52 Karl Rupp wrote: > >> Hi Laszlo, > >> > >> thanks for the quick check. So, prod16_AA definitely has a flaw. Can > >> you > >> deduce from your previous checks that prod_AA has a flaw as well, or > >> could the errors also be due to the use of prod16_AA for some special > >> matrix dimensions? > >> > >> Best regards, > >> Karli > >> > >> On 07/26/2012 05:59 PM, ujoimro wrote: > >>> Hello Karli, > >>> > >>> I have bad news. I have carried out the experiment with the > >>> following > >>> setups (in each case I repeted the multiplication 10000x / test for > >>> several tests): > >>> > >>> Experiment 1) Create 64x64 identity matrices 0% error > >>> Experiment 2) Create 64n x 64n identity matrices (square matrices) > >>> for n in (1..7) 0% error > >>> Experiment 3) Create random 64n x 64n matrices (random n, random > >>> data) > >>> for n in (1..7) 10-16% error. > >>> > >>> In the cases of the error the result contains nan. In the case of no > >>> error the result is close to that of OpenCV in norm(cvMat - vclMat). > >>> > >>> Cordially, > >>> > >>> Laszlo > >>> > >>> On 2012-07-26 16:02:02 Karl Rupp wrote: > >>>> Hi Laszlo, > >>>> > >>>> thanks for the extensive testing. I agree that this is a strong > >>>> indication for a concurrency issue. > >>>> > >>>> May I ask you for one favor: > >>>> We currently use two matrix multiplication kernels for > >>>> non-transposed > >>>> products: prod_AA and prod16_AA. The former works for arbitrary > >>>> matrix > >>>> sizes, while the latter is faster but works for row and column > >>>> multiples > >>>> of 64 only. Could you please check with your testing framework > >>>> whether > >>>> the issue shows up with prod16_AA as well (n,m,o need to be > >>>> multiples > >>>> of > >>>> 64)? > >>>> Details on how to set work sizes can be found in > >>>> viennacl/linalg/matrix_operations.hpp, line ~562. > >>>> > >>>> Thanks a lot again and best regards, > >>>> Karli > >>>> > >>>> On 07/26/2012 03:28 PM, ujoimro wrote: > >>>>> Dear Karl, > >>>>> > >>>>> I am sorry to revisit the question. I have made very extensive > >>>>> testing > >>>>> of this issue and now I am pretty sure that the kernel is buggy. > >>>>> I > >>>>> have > >>>>> done the following tests: > >>>>> > >>>>> 1) choose random numbers n, m, o between 1 and 500. > >>>>> 2) generate random matrices of sizes: nxm and mxo. > >>>>> 3) Call the matrix product function and compare the results with > >>>>> OpenCV > >>>>> (note the V). > >>>>> > >>>>> The multiplication gives good results between 95%-97% of cases. > >>>>> This > >>>>> means that the results are the close to those of OpenCV (note > >>>>> the > >>>>> V). > >>>>> In the rest of the cases the norm of the matrix is nan. > >>>>> > >>>>> After this I have determined, that the kernel in question is > >>>>> prod_AA. I > >>>>> have, then, extracted this kernel and I have inserted the kernel > >>>>> sole > >>>>> to the framework, I am developping. The kernel still manifests > >>>>> the > >>>>> same > >>>>> side effects. I have tried the same code on 2x NVidia GPUs > >>>>> (GeForce > >>>>> and > >>>>> mobile Quadro) and an AMD APU. > >>>>> > >>>>> I think this is a strong indication for a concurrency bug. > >>>>> > >>>>> Cordially, > >>>>> > >>>>> Laszlo Marak > >>>>> > >>>>> On 2012-07-02 18:40:29 Karl Rupp wrote: > >>>>>> Hi Laszlo, > >>>>>> > >>>>>> I've finally checked your code, there seems to be an access > >>>>>> violation > >>>>>> already during setup. When running the code with n larger than > >>>>>> 300, I > >>>>>> get > >>>>>> > >>>>>> /usr/include/boost/test/minimal.hpp(123): exception "memory > >>>>>> access > >>>>>> violation at address: 0x7fff8a11037c: no mapping at fault > >>>>>> address" > >>>>>> caught in function: 'int main(int, char**)' > >>>>>> > >>>>>> Maybe this is the reason why you get the wrong results later > >>>>>> on > >>>>>> (copying > >>>>>> from invalid memory). Maybe you can reproduce the error on > >>>>>> your > >>>>>> machine > >>>>>> by undefining NDEBUG. > >>>>>> > >>>>>> Best regards, > >>>>>> Karli > >>>>>> > >>>>>> On 06/26/2012 11:00 AM, ujoimro wrote: > >>>>>>> Hello, > >>>>>>> > >>>>>>> Thank You for Your consideration. > >>>>>>> > >>>>>>> GPU: GeForce GTX 580 AND AMD A8-3820 APU > >>>>>>> with > >>>>>>> Radeon(tm) HD Graphics > >>>>>>> > >>>>>>> ViennaCL version 1.3.0 > >>>>>>> > >>>>>>> OpenSuSE 12.1 64 bit > >>>>>>> The matrix size varies but at 400x400 it usually happens for > >>>>>>> both > >>>>>>> doubles and floats at 300x300 it usually happens for > >>>>>>> doubles. > >>>>>>> > >>>>>>> I have tryed copy with std::vector< std::vector<float> >, > >>>>>>> std::vector< > >>>>>>> std::vector<double> > and the symptoms appear regardless. > >>>>>>> > >>>>>>> Laszlo > >>>>>>> > >>>>>>> The minimalist code is here: > >>>>>>> > >>>>>>> // UjoImro, 2012 > >>>>>>> // RealEyes > >>>>>>> > >>>>>>> #include <fstream> > >>>>>>> #include <iostream> > >>>>>>> #include <boost/random.hpp> > >>>>>>> #include <tbb/tick_count.h> > >>>>>>> #include <opencv2/opencv.hpp> > >>>>>>> #include <viennacl/matrix.hpp> > >>>>>>> #include <boost/test/minimal.hpp> > >>>>>>> #include <boost/preprocessor.hpp> > >>>>>>> #include <viennacl/linalg/prod.hpp> > >>>>>>> > >>>>>>> typedef double data_t; > >>>>>>> typedef int64_t index_t; > >>>>>>> > >>>>>>> #define PRINT(var) std::cout << BOOST_PP_STRINGIZE(var) << " > >>>>>>> = " > >>>>>>> << > >>>>>>> var > >>>>>>> << std::endl; > >>>>>>> > >>>>>>> #define MATPRINT(var, rows, cols) > >>>>>>> \ > >>>>>>> > >>>>>>> std::cout << BOOST_PP_STRINGIZE(var) << " = > >>>>>>> " << > >>>>>>> std::endl; > >>>>>>> > >>>>>>> \ std::cout << "[ "; > >>>>>>> > >>>>>>> \ for ( index_t > >>>>>>> q=0; q < > >>>>>>> rows; > >>>>>>> ++q ) > >>>>>>> > >>>>>>> \ > >>>>>>> { > >>>>>>> > >>>>>>> \ > >>>>>>> > >>>>>>> std::cout << "[ "; > >>>>>>> > >>>>>>> \> > >>> > >>> \ > >>> > >>>>>>> for ( index_t w=0; w < cols; ++w ) > >>>>>>> > >>>>>>> \ { > >>>>>>> > >>>>>>> \ > >>>>>>> > >>>>>>> std::cout << (var)[ cols * q > >>>>>>> + w ] > >>>>>>> << ", > >>>>>>> "; > >>>>>>> > >>>>>>> \> > >>>>>>> > >>>>>>> } > >>>>>>> > >>>>>>> \ std::cout << " > >>>>>>> ]" << > >>>>>>> std::endl; > >>>>>>> > >>>>>>> \ > >>>>>>> > >>>>>>> } > >>>>>>> > >>>>>>> \ std::cout << "]" << std::endl; > >>>>>>> > >>>>>>> struct minus > >>>>>>> { > >>>>>>> > >>>>>>> template <class T0> > >>>>>>> T0 operator()( const T0 & a, const T0 & b ) > >>>>>>> const > >>>>>>> > >>>>>>> { > >>>>>>> > >>>>>>> return a - b; > >>>>>>> > >>>>>>> } > >>>>>>> > >>>>>>> }; // struct minus > >>>>>>> > >>>>>>> > >>>>>>> template <class T0, class T1, class T2, class T3, class T4> > >>>>>>> void > >>>>>>> map( T0 op, T1 res, T2 op0, T3 op1, T4 size ) > >>>>>>> { > >>>>>>> > >>>>>>> for ( index_t q=0; q < size; ++q ) > >>>>>>> { > >>>>>>> > >>>>>>> res[q] = op(op0[q], op1[q]); > >>>>>>> > >>>>>>> } > >>>>>>> > >>>>>>> } > >>>>>>> > >>>>>>> > >>>>>>> template <class T0, class T1> > >>>>>>> data_t > >>>>>>> norm( T0 * mat, T1 size ) > >>>>>>> { > >>>>>>> > >>>>>>> data_t result = 0; > >>>>>>> for ( index_t q=0; q<size; q++ ) > >>>>>>> { > >>>>>>> > >>>>>>> result += mat[q]*mat[q]; > >>>>>>> > >>>>>>> } > >>>>>>> > >>>>>>> return result; > >>>>>>> > >>>>>>> } > >>>>>>> > >>>>>>> namespace gel { > >>>>>>> > >>>>>>> template <class T0, class T1, class T2> > >>>>>>> void > >>>>>>> cpu2gpu( T0 * src, T1 size, T2 & dest ) > >>>>>>> { > >>>>>>> > >>>>>>> viennacl::fast_copy( src, src + > >>>>>>> size, dest > >>>>>>> ); > >>>>>>> return; > >>>>>>> > >>>>>>> } // cpu2gpu > >>>>>>> > >>>>>>> > >>>>>>> template <class T0, class T1> > >>>>>>> void > >>>>>>> gpu2cpu( const T0 & src, T1 * dest ) > >>>>>>> { > >>>>>>> > >>>>>>> viennacl::fast_copy( src, dest ); > >>>>>>> return; > >>>>>>> > >>>>>>> } // gpu2cpu > >>>>>>> > >>>>>>> } // namespace gel > >>>>>>> > >>>>>>> > >>>>>>> unsigned int > >>>>>>> good_seed() > >>>>>>> { > >>>>>>> > >>>>>>> unsigned int random_seed, random_seed_a, > >>>>>>> random_seed_b; > >>>>>>> std::ifstream file ( "/dev/urandom", > >>>>>>> std::ios::binary | > >>>>>>> std::ios::in ); if (file.is_open()) > >>>>>>> { > >>>>>>> > >>>>>>> file.read( > >>>>>>> reinterpret_cast<char*>(&random_seed > >>>>>>> _a), > >>>>>>> > >>>>>>> sizeof(random_seed_a) ); > >>>>>>> > >>>>>>> if (file.fail()) > >>>>>>> > >>>>>>> throw > >>>>>>> std::ios_base::failure("I > >>>>>>> could > >>>>>>> not > >>>>>>> obtain enough > >>>>>>> > >>>>>>> entropy."); > >>>>>>> > >>>>>>> }// end if > >>>>>>> else > >>>>>>> { > >>>>>>> > >>>>>>> throw std::ios_base::failure("I > >>>>>>> could not > >>>>>>> open > >>>>>>> the > >>>>>>> random seed > >>>>>>> > >>>>>>> file."); > >>>>>>> > >>>>>>> } > >>>>>>> random_seed_b = std::time(0); > >>>>>>> random_seed = random_seed_a xor > >>>>>>> random_seed_b; > >>>>>>> return random_seed; > >>>>>>> > >>>>>>> } // end good_seed() > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> int test_main( int argc, char* argv[] ) > >>>>>>> { > >>>>>>> > >>>>>>> const int n=300; > >>>>>>> tbb::tick_count start; > >>>>>>> tbb::tick_count end; > >>>>>>> PRINT(0); > >>>>>>> > >>>>>>> PRINT(viennacl::ocl::current_device().info() > >>>>>>> ); > >>>>>>> PRINT(viennacl::ocl::current_device().double > >>>>>>> _suppo > >>>>>>> rt()); > >>>>>>> > >>>>>>> PRINT(1); > >>>>>>> > >>>>>>> > >>>>>>> // setting up the opencl device > >>>>>>> viennacl::ocl::set_context_device_type(0, > >>>>>>> viennacl::ocl::gpu_tag()); > >>>>>>> //viennacl::ocl::current_context().build_opt > >>>>>>> ions(" > >>>>>>> -cl-ma > >>>>>>> d-enab > >>>>>>> le -cl-fast-> > >>>>>>> > >>>>>>> relaxed-math"); //uncomment for additional optimizations > >>>>>>> > >>>>>>> std::vector<viennacl::ocl::device> devices = > >>>>>>> > >>>>>>> viennacl::ocl::current_context().devices(); > >>>>>>> > >>>>>>> PRINT(devices.size()); > >>>>>>> for (size_t i=0; i<devices.size(); ++i) > >>>>>>> { > >>>>>>> > >>>>>>> viennacl::ocl::current_context().swi > >>>>>>> tch_de > >>>>>>> vice(d > >>>>>>> evices > >>>>>>> [i]); > >>>>>>> std::cout << " - Device Name: " << > >>>>>>> > >>>>>>> viennacl::ocl::current_device().name() << std::endl; > >>>>>>> > >>>>>>> } > >>>>>>> viennacl::ocl::current_context().switch_devi > >>>>>>> ce(dev > >>>>>>> ices[0 > >>>>>>> ]); // > >>>>>>> selecting> > >>>>>>> > >>>>>>> the first device (parametrize later) > >>>>>>> > >>>>>>> //viennacl::ocl::get_queue().finish(); > >>>>>>> > >>>>>>> std::cout << "matrixmult" << std::endl; > >>>>>>> > >>>>>>> data_t matA[n*n]; > >>>>>>> data_t matB[n*n]; > >>>>>>> > >>>>>>> std::cout << "filling up the matrices with > >>>>>>> random > >>>>>>> data" > >>>>>>> << > >>>>>>> std::endl; // filling up the matrices > >>>>>>> for ( index_t q=0; q<n; ++q ) > >>>>>>> > >>>>>>> for ( index_t w=0; w<n; ++w ) > >>>>>>> { > >>>>>>> > >>>>>>> if (q==w) > >>>>>>> > >>>>>>> matA[ q*n + w ] = 1; > >>>>>>> > >>>>>>> else > >>>>>>> > >>>>>>> matA[ q*n + w ] = 0; > >>>>>>> > >>>>>>> if (q==w) > >>>>>>> > >>>>>>> matB[ q*n + w ] = 1; > >>>>>>> > >>>>>>> else > >>>>>>> > >>>>>>> matB[ q*n + w ] = 0; > >>>>>>> > >>>>>>> } > >>>>>>> > >>>>>>> std::cout << "multiplication" << std::endl; > >>>>>>> > >>>>>>> data_t matC[n*n]; > >>>>>>> > >>>>>>> std::cout << "COPY the matrices to the GPU > >>>>>>> memory" > >>>>>>> << > >>>>>>> std::endl; > >>>>>>> > >>>>>>> viennacl::matrix<data_t> clMatrixA(n, n); > >>>>>>> viennacl::matrix<data_t> clMatrixB(n, n); > >>>>>>> viennacl::matrix<data_t> clMatrixC(n, n); > >>>>>>> > >>>>>>> gel::cpu2gpu( matA, n*n, clMatrixA ); > >>>>>>> gel::cpu2gpu( matB, n*n, clMatrixB ); > >>>>>>> > >>>>>>> data_t diffMat2[n*n]; > >>>>>>> map( minus(), diffMat2, matA, matB, n*n ); > >>>>>>> PRINT( norm(diffMat2, n*n) ); > >>>>>>> > >>>>>>> PRINT("reading back the matrices from the > >>>>>>> memory"); > >>>>>>> > >>>>>>> data_t checkA[n*n]; > >>>>>>> data_t checkB[n*n]; > >>>>>>> data_t checkC[n*n]; > >>>>>>> > >>>>>>> gel::gpu2cpu( clMatrixA, checkA ); > >>>>>>> gel::gpu2cpu( clMatrixB, checkB ); > >>>>>>> > >>>>>>> data_t diffA[n*n]; > >>>>>>> data_t diffB[n*n]; > >>>>>>> > >>>>>>> map( minus(), diffA, matA, checkA, n*n ); > >>>>>>> map( minus(), diffB, matB, checkB, n*n ); > >>>>>>> > >>>>>>> PRINT( norm( diffA, n*n ) ); > >>>>>>> PRINT( norm( diffB, n*n ) ); > >>>>>>> > >>>>>>> start = tbb::tick_count::now(); > >>>>>>> clMatrixC = viennacl::linalg::prod( > >>>>>>> clMatrixA, > >>>>>>> clMatrixB > >>>>>>> ); > >>>>>>> end = tbb::tick_count::now(); > >>>>>>> gel::gpu2cpu( clMatrixC, checkC ); > >>>>>>> > >>>>>>> data_t diffMat[n*n]; > >>>>>>> map( minus(), diffMat, checkC, matC, n*n ); > >>>>>>> > >>>>>>> PRINT( norm( checkC, n*n ) ); > >>>>>>> > >>>>>>> MATPRINT( checkC, n, n ); > >>>>>>> > >>>>>>> std::cout << "GPU matrix multiplication time > >>>>>>> = " > >>>>>>> << (end > >>>>>>> - > >>>>>>> > >>>>>>> start).seconds() << "s" << std::endl; > >>>>>>> > >>>>>>> std::cout << "finished running" << > >>>>>>> std::endl; > >>>>>>> return EXIT_SUCCESS; > >>>>>>> > >>>>>>> } > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> // LuM end of file > >>>>>>> > >>>>>>> On 2012-06-25 22:17:32 Karl Rupp wrote: > >>>>>>>> Hello, > >>>>>>>> > >>>>>>>> thanks for the notification. > >>>>>>>> > >>>>>>>> Could you please provide in addition the following details: > >>>>>>>> * The GPU you're using > >>>>>>>> * ViennaCL version (1.3.0?) > >>>>>>>> * Operating system (32/64 bit) > >>>>>>>> * At which matrix sizes do you observe the > >>>>>>>> effect? > >>>>>>>> * Does the issue remain if you use > >>>>>>>> viennacl::copy() > >>>>>>>> instead of > >>>>>>>> > >>>>>>>> viennacl::fast_copy()? > >>>>>>>> > >>>>>>>> Thanks and best regards, > >>>>>>>> Karli > >>>>>>>> > >>>>>>>> On 06/25/2012 05:43 PM, ujoimro wrote: > >>>>>>>>> Dear ViennaCL Developers, > >>>>>>>>> > >>>>>>>>> I would like to use ViennaCL to accelerate matrix > >>>>>>>>> multiplications. > >>>>>>>>> After some matrix size nan values appear in the result > >>>>>>>>> of > >>>>>>>>> the > >>>>>>>>> multiplication. The rest of the values are correct. Here > >>>>>>>>> is > >>>>>>>>> a > >>>>>>>>> minimal example: > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> -------------------------------------------------------- > >>>>>>>>> ---- > >>>>>>>>> ---- > >>>>>>>>> ---- > >>>>>>>>> ---- ------ Live Security Virtual Conference > >>>>>>>>> Exclusive live event will cover all the ways today's > >>>>>>>>> security > >>>>>>>>> and > >>>>>>>>> threat landscape has changed and how IT managers can > >>>>>>>>> respond. > >>>>>>>>> Discussions > >>>>>>>>> will include endpoint security, mobile security and the > >>>>>>>>> latest > >>>>>>>>> in > >>>>>>>>> malware > >>>>>>>>> threats. > >>>>>>>>> http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263 > >>>>>>>>> / > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> _______________________________________________ > >>>>>>>>> ViennaCL-support mailing list > >>>>>>>>> Vie...@li... > >>>>>>>>> https://lists.sourceforge.net/lists/listinfo/viennacl-su > >>>>>>>>> ppor > >>>>>>>>> t |