Re: [Kaldi-developers] Query on GPU batching

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

So what you're concerned about is the fact that the results of the
sgemm CUDA kernel are written out to GPU memory before being read in
again to do the nonlinearity?  I'm not too concerned about this as it
seems to me that the matrix-multiply will be much slower than the
softmax (as it's O(n^3) not O(n^2)), and therefore the small penalty
from doing them separately does not matter relative to the possibly
large performance gain from the faster matrix multiply.

BTW, all this stuff is on GPU memory, cublasSgemm works with inputs
and outputs on the GPU board.

Also, the way the nnet software in Kaldi is currently written, the
softmax is anyway bound to be a separate operation from the
matrix-multiply.

Dan

On Thu, Sep 5, 2013 at 10:37 AM, Tony Robinson <to...@ca...> wrote:
> Hi Dan,
>
> Ah yes, I found this call in cudamatrix/cu-matrix.cc now - thanks.
>
> I think it's an open question as to whether CUBLAS is the right way to go or
> not.
>
> For:  As you say, cublasSgemm() is very optimised.   The example code that
> comes with CUDA 5.5 gets 1.4TFLOP from a GTX TITAN and 2 * 1.0TFLOP from a
> GTX 690 - impressive stuff.
>
> Against:  CUBLAS doens't do what we want for NN implementations. There is a
> high latency overhead in writing out the results of cublasSgemm() and then
> reading it in again to do a trivial sigmoid/ReLU non-linearity (or softmax
> or the indirect you need for sparse inputs or ouputs).   You can mask this
> to some degree with streams, but the overhead is still there.
>
> Ideally we'd have access to the CUBLAS source code and would be able to add
> the non-linearity in just before writing out and so it would come for free.
> My feeling right now is that it could well be better to use a slower matrix
> multiply that is modifiable just to avoid the extra write then read.
>
>
> Tony
>
>
> On 09/05/2013 03:06 PM, Daniel Povey wrote:
>>
>> For matrix multiplication we just call CUBLAS
>> (cublasDgemm/cublasSgemm), because we imagine it will be more highly
>> optimized than anything we can code.
>> BTW, the latest on the cudamatrix stuff is in the sandbox in
>> ^/sandbox/dan2.  This is being actively developed right now.
>>
>> Dan
>>
>>
>> On Thu, Sep 5, 2013 at 7:50 AM, Tony Robinson <to...@ca...>
>> wrote:
>>>
>>> Karel et al,
>>>
>>> I've spent a long time thinking about how to efficiently implement NNs
>>> on GPUs (inc. taking the Coursera and Udacity courses).
>>>
>>> As I understand it GPUs aren't all that good at the simple view of a NN
>>> which is outputVector = sigmoid(inputVector * weightMatrix) as they have
>>> to read the entire weight matrix just to compute one output.   However,
>>> we often use minibatches so instead of doing vector matrix operations we
>>> can group all the input vectors in a minibatch into a matrix and run
>>> matrix matrix operations.  That is do outputVector[t] =
>>> sigmoid(inputVector[t] * weightMatrix) all in one go and so
>>> substantially reduce memory bandwidth.
>>>
>>> Having got somewhat disillusioned with the CUBLAS calls I've poked
>>> around kaldi/src/cudamatrix and I find cuda_mul_elements(),
>>> cuda_mul_cols_vec() and cuda_mul_rows_vec() but no cuda_mul_mat().
>>>
>>> Have I got this right in that Kaldi doesn't use GPU matrix matrix
>>> operations?   If so, is there a theoretical reason why not?
>>>
>>>
>>> Tony
>>>
>>> --
>>> Dr A J Robinson, Founder and Director of Cantab Research Limited.
>>> St Johns Innovation Centre, Cowley Road, Cambridge, CB4 0WS, UK.
>>> Company reg no 05697423 (England and Wales), VAT reg no 925606030.
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> Learn the latest--Visual Studio 2012, SharePoint 2013, SQL 2012, more!
>>> Discover the easy way to master current and previous Microsoft
>>> technologies
>>> and advance your career. Get an incredible 1,500+ hours of step-by-step
>>> tutorial videos with LearnDevNow. Subscribe today and save!
>>>
>>> http://pubads.g.doubleclick.net/gampad/clk?id=58041391&iu=/4140/ostg.clktrk
>>> _______________________________________________
>>> Kaldi-developers mailing list
>>> Kal...@li...
>>> https://lists.sourceforge.net/lists/listinfo/kaldi-developers
>
>
>
> --
> Dr A J Robinson, Founder and Director of Cantab Research Limited.
> St Johns Innovation Centre, Cowley Road, Cambridge, CB4 0WS, UK.
> Company reg no 05697423 (England and Wales), VAT reg no 925606030.