Re: [Kaldi-developers] Query on GPU batching

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi Tony, Arnab, Dan and others,
the idea is interesting, but first let's have a look at the profiling 
numbers
coming from 12th iteration of switchboard DNN recipe on GTX680,
topology with 6hid sigmoid layers, 2048 neurons each, roughly 9000 outputs:

AddColSumMat    239.611s
AddMat  1341.95s
AddMatMat       20580s
AddRowSumMat    927.644s
AddVec  937.368s
AddVecToRows    164.433s
CuMatrix::CopyFromMatD2D        194.821s
CuMatrix::CopyFromMatH2D        11.0674s
CuMatrix::CopyToMatD2H  0.113835s
CuMatrix::SetZero       51.3638s
CuStlVector::CopyFromVecH2D     3.81734s
CuStlVector::CopyToVecD2H       7.35736s
CuStlVector::SetZero    4.45224s
CuVector::CopyFromVecH2D        0.000355005s
CuVector::CopyToVecD2H  5.94601s
CuVector::SetZero       109.982s
DiffSigmoid     197.664s
DiffXent        2.89855s
DivRowsVec      80.7666s
FindRowMaxId    177.468s
MulColsVec      5.66732s
Randomize       6.10032s
Set     21.3223s
Sigmoid 267.29s
Softmax 733.461s
Splice  7.73193s

The total amount of time was 25380s, and CUBLAS matrix multiplication is 
81% of the time.
The idea is suitable for simple hidden layer activations (Sigmoids), 
where we would save
2x access to global GPU memory. On the other hand Sigmoid corresponds to 
1% of run-time.

Based on these stats and assumption that CUBLAS is written optimally,
we can say that the extra memory access for activation functions is not 
an issue.
Maybe in case of smaller nets, the numbers would be different,
but those also have faster training times.

Is this argumentation convincing? :)

Best regards,
Karel

On 09/05/13 18:06, Tony Robinson wrote:
> On 09/05/2013 04:18 PM, Arnab Ghoshal wrote:
>> On Thu, Sep 5, 2013 at 4:54 PM, Tony Robinson <to...@ca...> wrote:
>>> I guess we can ask the question in the other way:  does anyone have any
>>> profile information to share?   That is, what GPU utilisation does Kaldi
>>> achieve?   Clearly if it's currently getting over (say) 50% then there
>>> is no point in thinking about this any more.
>> I don't think it is possible to look up the computation utilization of
>> the GTX cards or at least I haven't figured out how to.
> If you run the nvidia visual profiler (nvvp) which is available as part
> of the CUDA toolkit download
> (https://developer.nvidia.com/nvidia-visual-profiler) you can get the
> compute utilization and much else besides. All you need do is create a
> new session with the binary and relevant arguments (ensuring that you
> binary will only run for a short amount of time e.g ~3 secs) and then
> generate a timeline for your program. Once you have a timeline you can
> used the 'guided analysis' to measure different metrics
> (http://docs.nvidia.com/cuda/profiler-users-guide/index.html#analysis-view)
> including compute utilization.
>
> Tony
> (who cheated - I'm not a GPU guru - I had to ask a colleague to write
> the above paragraph for me)
>