From: Karel V. <ve...@gm...> - 2013-09-10 17:27:45
|
Hi Tony, Arnab, Dan and others, the idea is interesting, but first let's have a look at the profiling numbers coming from 12th iteration of switchboard DNN recipe on GTX680, topology with 6hid sigmoid layers, 2048 neurons each, roughly 9000 outputs: AddColSumMat 239.611s AddMat 1341.95s AddMatMat 20580s AddRowSumMat 927.644s AddVec 937.368s AddVecToRows 164.433s CuMatrix::CopyFromMatD2D 194.821s CuMatrix::CopyFromMatH2D 11.0674s CuMatrix::CopyToMatD2H 0.113835s CuMatrix::SetZero 51.3638s CuStlVector::CopyFromVecH2D 3.81734s CuStlVector::CopyToVecD2H 7.35736s CuStlVector::SetZero 4.45224s CuVector::CopyFromVecH2D 0.000355005s CuVector::CopyToVecD2H 5.94601s CuVector::SetZero 109.982s DiffSigmoid 197.664s DiffXent 2.89855s DivRowsVec 80.7666s FindRowMaxId 177.468s MulColsVec 5.66732s Randomize 6.10032s Set 21.3223s Sigmoid 267.29s Softmax 733.461s Splice 7.73193s The total amount of time was 25380s, and CUBLAS matrix multiplication is 81% of the time. The idea is suitable for simple hidden layer activations (Sigmoids), where we would save 2x access to global GPU memory. On the other hand Sigmoid corresponds to 1% of run-time. Based on these stats and assumption that CUBLAS is written optimally, we can say that the extra memory access for activation functions is not an issue. Maybe in case of smaller nets, the numbers would be different, but those also have faster training times. Is this argumentation convincing? :) Best regards, Karel On 09/05/13 18:06, Tony Robinson wrote: > On 09/05/2013 04:18 PM, Arnab Ghoshal wrote: >> On Thu, Sep 5, 2013 at 4:54 PM, Tony Robinson <to...@ca...> wrote: >>> I guess we can ask the question in the other way: does anyone have any >>> profile information to share? That is, what GPU utilisation does Kaldi >>> achieve? Clearly if it's currently getting over (say) 50% then there >>> is no point in thinking about this any more. >> I don't think it is possible to look up the computation utilization of >> the GTX cards or at least I haven't figured out how to. > If you run the nvidia visual profiler (nvvp) which is available as part > of the CUDA toolkit download > (https://developer.nvidia.com/nvidia-visual-profiler) you can get the > compute utilization and much else besides. All you need do is create a > new session with the binary and relevant arguments (ensuring that you > binary will only run for a short amount of time e.g ~3 secs) and then > generate a timeline for your program. Once you have a timeline you can > used the 'guided analysis' to measure different metrics > (http://docs.nvidia.com/cuda/profiler-users-guide/index.html#analysis-view) > including compute utilization. > > Tony > (who cheated - I'm not a GPU guru - I had to ask a colleague to write > the above paragraph for me) > |