i-merge Blog

Intelligent Merge is a machine learning library.

Brought to you by: agaguk

OpenCL for BPNN

Lot of things may be parallelized when "executing" a bpnn :

Each multiplication and addition of the dot product to compute the input signal of a given neuron : there will be Nb_neurons_on_previous_layer multiplications and as many additions.
The activation function (a sigmoid computation for example). There will be as many such computation as there are neurons (inputs and biases excluded of course)
The copy of the weights array (see previous post) from the host
Computation of different neurons activation on the same layer
Computation of different instances of networks (same network structure but different weights)

Notice furthermore that most of the time, all the neurons in the same layer just do the same thing. And finally all the networks in an ensemble method like bagging and boosting do the same thing although with a different set of weights.

Dot product parallelization?

Should we or shouldn't we explicitely parallelize the N multiplications and the N-1 additions of the dot product execution?

We are better to let grains of that size, that is single instruction size grains, to automatic out-of-order execution, either through VLWI (compile-time), SIMD (compile-time) or superscalar execution (execution-time). It should be easy for an OpenCL compiler to automatically optimize a vector dot product computation for parallel execution on an array of SIMD for example and/or VLWI execution. But I absolutely do not know if they do or not (AMD, Intel or Nvidia). I guess that some testing is necessary to determine this.

Anyway I am very reluctant to explicitly break down a dot product into smaller parallel operations like multiplications and additions. Essentially, a dot product is a reduction and therefore does need a lot of synchronization between the operations if they are to be executed in parallel.

Of course, I can see that a dot product may be very efficiently parallelized if we do not need complex synchronization. For example:

electronicians or FPGA programmers carefully computing wavefront timings won't need to explicitely synchronize. I guess FPGA is the perfect hard-wire but yet flexible platform for a dot product;
Or else, there will be a need to carefully exploit massive parallel hardware devices like SIMD/VLIW by using a very smart compiler or by hand-tuning machine-code. Maybe the Intel Math Kernel Library does this. I don't know.

But we are not studying these two alternatives here. We are studying the possibility to use OpenCL (AMD, Intel or Nvidia) to speed up neural network computations which implies a lot of dot products. And to explicitely parallelize a dot product in OpenCL does require too much synchronization. Plus extra indexing. It shouldn't work well.

I've just found a very interesting paper about parallelizing dot product and BLAS in general using either OpenCL-Cuda, FPGA, plain naive threaded-C on CPU and the Intel Math Kernel Library :

http://research.microsoft.com/pubs/130834/isvlsi_final.pdf

And indeed, the OpenCL-Cuda just doesn't scale well on this problem.

Furthermore, OpenCL offers a built-in dot product on two vectors of 1, 2, 3 or 4 floats (although 1 is weird, it resolves to a single multiplication). I guess that it should be possible for most of the implementations to do the four multiplications at once without having to synchronize for the final 3 additions that should also occur at once (through VLWI I guess). If so, that's pretty cool. It will take the time of 2 instructions to execute 7 and costly multiplications...

Work-item and kernel definition

So, what is it that we should take as our smallest unit of computation? i.e. as an OpenCL work-item? Since we should not have a grain smaller than the dot-product, it seems useless to separate the dot-product from the activation function in a context involving many neurons, many layers and many networks.

It is interesting to note here that many networks may mean :

Same network structure, same weights and many different input vectors (lots of inputs to classify);
Same network structure, same input vector, many different weights (useful for ensemble methods like bagging or boosting);
Both 1 and 2.

Items 1 and 2 are computationally equivalent. One of the two vectors in the dot product is changed and that's all. What is not equivalent is the loading of weights/neuron values into memory. The weights are something more permanent (excluding the learning phase of course) than input data.

So our smallest work-item execution unit will be the neuron activation level computation. And since an OpenCL kernel is the implementation of a single workitem, our OpenCL kernel will be the computation of a single neuron. We can re-use the same OpenCL kernel for all the neurons having the same structure, i.e.

Same number of incoming connections
Same activation function
Same steepness

Synchronization

Since our smallest unit of execution is the dot-product coupled with the activation function, it makes sense to wait for the activation value of all the neurons of the previous layer before beginning to compute the activation value of any given neuron. Notice that this synchronization requirement is the same for all the neurons of the same layer.

Work-group definition

Following the previous considerations on synchronization, execution of neurons may occur in parallel sets, ie those neurons on the same layer. We want to avoid if-branching logics as much as possible. By restricting neuron definitions on a given layer to the same structure, same previous layer, same activation function, a kernel definition avoids any branching logic, may well be optimized (loop unrolls for example, less index computation) by providing one kernel definition per layer. A workgroup will then be all the neurons of a given layer, with its own kernel definition.

There will be one NDRange enque instruction per layer. The NDRange will have only one dimension and its global item value will go from 0 to Nb_neurons-1 on that layer. The completion OpenCl-event of the work-group definition of the previous layer will be a requirement for execution of the NDRange.

Kernel definition

From the previous section, kernel definition is straightforward. It takes a minimum number of arguments and completely avoids branching logics.

Kernel's argument shall be conceptually limited to:

Previous layer array of neurons activation value (bias excluded)
Incoming connection weights for the current neuron or more precisely for all the neurons of the current layer. We will know which one is the first with the global item id. Sope, that will be an array of (nb neurons on previous layer, bias included) X (nb neurons on this layer, bias excluded) elements.
An output array for the values of this layer neuron activation level.

Memory layout

For the weights:

Allocate it in one big chunk OpenCL buffer for the whole network and give the possibility to the kernel to know where lies the first weight that may be of interest for current execution. We want to avoid the kernel to know the complete network layout in order to determine this. Maybe the best way then is to explicitely give the index of the first incoming connection's weight for the complete layer. It is then possible to deduce the index of the first incoming connection's weight of the current neuron from the global item id.
Maybe we could optimize and avoid the extra index parameter and the extra computation to determine the beginning of the layer weights (for each of the neurons) by using sub-buffers.
For the "write" operation, i.e. to copy the weights from host-memory to OpenCL-device(s) memory, it could be carried in multiple operations to increase parallelism. Overhead for memory copies may be important.

For the neurons activation value:

About the same thing applies. Allocate it as one big chunk for the whole network and then use sub-buffers or index for the kernel to know where are the previous/current layer neurons.
Notice that memory need to be declared as read/write except for the first and the last neuron.
Of course, only the input neurons need to be initialised.

In the context of multiple networks with the same structure (ensemble methods like bagging or boosting):

It is important that multiple networks may be executed in parallel. We chose to limit the parallelism inside a single network to avoid synchronization and therefore the layers have to be computed one after the other. There may not be enough work to do in a single layer to busy OpenCL devices.

It should be safe to allocate space for the neuron activation values for all the networks so they may possibly be executed in parallel in this regards. Say we have a very big network of 5 layers of 300 neurons on each layer and a thousand ensemble version of the network, that makes 4 to 5 Mb maximum for this non senseically big ensemble. So pretty feasible. Also, the first layer may be allocated and initialised once and only once for all the networks.
It is a completely different story for the weights. Assuming a networks of 300 neurons per layer and 5 layers, that makes 360 000 connections for a single network, i.e. 1,5Mb, and 150Mb for a hundred of them. Considering that we are not alone to use a graphic card and that there is typically 512Mb/1G of global memory on graphic card, then it may prove unfeasible to allocate everything. Solution may be some where in between. Allocate some, a given number matching the number of networks that may be executed in parallel, and reuse them.

Posted by 2012-08-07