Re: [Kaldi-developers] Query on GPU batching

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi Dan,

Ah yes, I found this call in cudamatrix/cu-matrix.cc now - thanks.

I think it's an open question as to whether CUBLAS is the right way to 
go or not.

For:  As you say, cublasSgemm() is very optimised.   The example code 
that comes with CUDA 5.5 gets 1.4TFLOP from a GTX TITAN and 2 * 1.0TFLOP 
from a GTX 690 - impressive stuff.

Against:  CUBLAS doens't do what we want for NN implementations. There 
is a high latency overhead in writing out the results of cublasSgemm() 
and then reading it in again to do a trivial sigmoid/ReLU non-linearity 
(or softmax or the indirect you need for sparse inputs or ouputs).   You 
can mask this to some degree with streams, but the overhead is still there.

Ideally we'd have access to the CUBLAS source code and would be able to 
add the non-linearity in just before writing out and so it would come 
for free.   My feeling right now is that it could well be better to use 
a slower matrix multiply that is modifiable just to avoid the extra 
write then read.

Tony

On 09/05/2013 03:06 PM, Daniel Povey wrote:
> For matrix multiplication we just call CUBLAS
> (cublasDgemm/cublasSgemm), because we imagine it will be more highly
> optimized than anything we can code.
> BTW, the latest on the cudamatrix stuff is in the sandbox in
> ^/sandbox/dan2.  This is being actively developed right now.
>
> Dan
>
>
> On Thu, Sep 5, 2013 at 7:50 AM, Tony Robinson <to...@ca...> wrote:
>> Karel et al,
>>
>> I've spent a long time thinking about how to efficiently implement NNs
>> on GPUs (inc. taking the Coursera and Udacity courses).
>>
>> As I understand it GPUs aren't all that good at the simple view of a NN
>> which is outputVector = sigmoid(inputVector * weightMatrix) as they have
>> to read the entire weight matrix just to compute one output.   However,
>> we often use minibatches so instead of doing vector matrix operations we
>> can group all the input vectors in a minibatch into a matrix and run
>> matrix matrix operations.  That is do outputVector[t] =
>> sigmoid(inputVector[t] * weightMatrix) all in one go and so
>> substantially reduce memory bandwidth.
>>
>> Having got somewhat disillusioned with the CUBLAS calls I've poked
>> around kaldi/src/cudamatrix and I find cuda_mul_elements(),
>> cuda_mul_cols_vec() and cuda_mul_rows_vec() but no cuda_mul_mat().
>>
>> Have I got this right in that Kaldi doesn't use GPU matrix matrix
>> operations?   If so, is there a theoretical reason why not?
>>
>>
>> Tony
>>
>> --
>> Dr A J Robinson, Founder and Director of Cantab Research Limited.
>> St Johns Innovation Centre, Cowley Road, Cambridge, CB4 0WS, UK.
>> Company reg no 05697423 (England and Wales), VAT reg no 925606030.
>>
>> ------------------------------------------------------------------------------
>> Learn the latest--Visual Studio 2012, SharePoint 2013, SQL 2012, more!
>> Discover the easy way to master current and previous Microsoft technologies
>> and advance your career. Get an incredible 1,500+ hours of step-by-step
>> tutorial videos with LearnDevNow. Subscribe today and save!
>> http://pubads.g.doubleclick.net/gampad/clk?id=58041391&iu=/4140/ostg.clktrk
>> _______________________________________________
>> Kaldi-developers mailing list
>> Kal...@li...
>> https://lists.sourceforge.net/lists/listinfo/kaldi-developers

-- 
Dr A J Robinson, Founder and Director of Cantab Research Limited.
St Johns Innovation Centre, Cowley Road, Cambridge, CB4 0WS, UK.
Company reg no 05697423 (England and Wales), VAT reg no 925606030.