Lot of things may be parallelized when "executing" a bpnn :
Notice furthermore that most of the time, all the neurons in the same layer just do the same thing. And finally all the networks in an ensemble method like bagging and boosting do the same thing although with a different set of weights.
Should we or shouldn't we explicitely parallelize the N multiplications and the N-1 additions of the dot product execution?
We are better to let grains of that size, that is single instruction size grains, to automatic out-of-order execution, either through VLWI (compile-time), SIMD (compile-time) or superscalar execution (execution-time). It should be easy for an OpenCL compiler to automatically optimize a vector dot product computation for parallel execution on an array of SIMD for example and/or VLWI execution. But I absolutely do not know if they do or not (AMD, Intel or Nvidia). I guess that some testing is necessary to determine this.
Anyway I am very reluctant to explicitly break down a dot product into smaller parallel operations like multiplications and additions. Essentially, a dot product is a reduction and therefore does need a lot of synchronization between the operations if they are to be executed in parallel.
Of course, I can see that a dot product may be very efficiently parallelized if we do not need complex synchronization. For example:
But we are not studying these two alternatives here. We are studying the possibility to use OpenCL (AMD, Intel or Nvidia) to speed up neural network computations which implies a lot of dot products. And to explicitely parallelize a dot product in OpenCL does require too much synchronization. Plus extra indexing. It shouldn't work well.
I've just found a very interesting paper about parallelizing dot product and BLAS in general using either OpenCL-Cuda, FPGA, plain naive threaded-C on CPU and the Intel Math Kernel Library :
http://research.microsoft.com/pubs/130834/isvlsi_final.pdf
And indeed, the OpenCL-Cuda just doesn't scale well on this problem.
Furthermore, OpenCL offers a built-in dot product on two vectors of 1, 2, 3 or 4 floats (although 1 is weird, it resolves to a single multiplication). I guess that it should be possible for most of the implementations to do the four multiplications at once without having to synchronize for the final 3 additions that should also occur at once (through VLWI I guess). If so, that's pretty cool. It will take the time of 2 instructions to execute 7 and costly multiplications...
So, what is it that we should take as our smallest unit of computation? i.e. as an OpenCL work-item? Since we should not have a grain smaller than the dot-product, it seems useless to separate the dot-product from the activation function in a context involving many neurons, many layers and many networks.
It is interesting to note here that many networks may mean :
Items 1 and 2 are computationally equivalent. One of the two vectors in the dot product is changed and that's all. What is not equivalent is the loading of weights/neuron values into memory. The weights are something more permanent (excluding the learning phase of course) than input data.
So our smallest work-item execution unit will be the neuron activation level computation. And since an OpenCL kernel is the implementation of a single workitem, our OpenCL kernel will be the computation of a single neuron. We can re-use the same OpenCL kernel for all the neurons having the same structure, i.e.
Since our smallest unit of execution is the dot-product coupled with the activation function, it makes sense to wait for the activation value of all the neurons of the previous layer before beginning to compute the activation value of any given neuron. Notice that this synchronization requirement is the same for all the neurons of the same layer.
Following the previous considerations on synchronization, execution of neurons may occur in parallel sets, ie those neurons on the same layer. We want to avoid if-branching logics as much as possible. By restricting neuron definitions on a given layer to the same structure, same previous layer, same activation function, a kernel definition avoids any branching logic, may well be optimized (loop unrolls for example, less index computation) by providing one kernel definition per layer. A workgroup will then be all the neurons of a given layer, with its own kernel definition.
There will be one NDRange enque instruction per layer. The NDRange will have only one dimension and its global item value will go from 0 to Nb_neurons-1 on that layer. The completion OpenCl-event of the work-group definition of the previous layer will be a requirement for execution of the NDRange.
From the previous section, kernel definition is straightforward. It takes a minimum number of arguments and completely avoids branching logics.
Kernel's argument shall be conceptually limited to:
For the weights:
For the neurons activation value:
In the context of multiple networks with the same structure (ensemble methods like bagging or boosting):
It is important that multiple networks may be executed in parallel. We chose to limit the parallelism inside a single network to avoid synchronization and therefore the layers have to be computed one after the other. There may not be enough work to do in a single layer to busy OpenCL devices.