performance slightly low
N-dimensional parallel functions package for octave
Brought to you by:
ederag
As reported by "user1129812" in the first comment to
http://stackoverflow.com/a/25959122/3565696,
Using "ndpar_arrayfun" reduces the running time to about 65% compared with "cellfun". It seems slightly slower than "parcellfun".
This is strange as the speedups are usually higher than 3 on a quadcore machine (under linux). What was the operating system ?
As compared to parcellfun (reduction to 60%, see 4th comment to http://stackoverflow.com/a/25945953/3565696)
Strange, since it was intended to be slightly faster actually. Did the 60% reduction include everything including the mat2cell and cell2mat lines ?
Reproduced here, even worse for my sample function. With such a fast F, the overhead for parallelization is too high. From another comment of "user1129812" it seems that fast functions were used here also. To be confirmed.
I am "user1129812".
My operating system is "Ubuntu 12.04 amd64". I am using Octave 3.8.1 on an "intel i5-2500" quadcore desktop.
The running time reduction to 60%/65% of that of "cellfun" does include everything including the mat2cell and cell2mat lines (for "cellfun" and "parcellfun" only), and several lines of matrix multiplication/addition (I think not time critical).
In fact, F = @(a,B) sum(bsxfun(@times, a, B),2)'; and B has several hundreds rows and several ten thousands columns.
But I am not sure whether "F" is too fast for effective parallelization.
Last edit: Lawrence Tsang 2014-09-23
With
k = 300; #rows A
m = 20000; # columns A and B
n = 300; # rows B
F(A(1,:), B); takes about 40 ms, and the speedup is about 3.3 (parcellfun calculation) or 2.0 (ndpar), compared to a serial (for) version. There is to much bookkeeping in ndpar for such fast functions, I'll investigate that.
But your results are different. Increasing m and n to much increase memory usage (as it runs more F executions at once) and can ruin the parallelization advantage. Could it be the case ?
I don't know how do you define "speedup". Here I write down my cases for your comparison.
k = 10;
m = 60000;
n = 785;
F = @(a,B) sum(bsxfun(@times, a, B),2)';
f = @(a) F(a,B);
%----------
switch (method_choice)
case 1
pkg load ndpar;
nproc = 3;
result = ndpar_arrayfun(nproc, F, A, B, "IdxDimensions", [1, 0], "CatDimensions", [1], "VerboseLevel", 0);
case 2
A_cell = mat2cell(A, ones(1,size(A,1)));
pkg load parallel;
nproc = 3;
result_cell = parcellfun(nproc, f, A, "UniformOutput", false, "VerboseLevel", 0);
result = cell2mat(result_cell);
case 3
A_cell = mat2cell(A, ones(1,size(A,1)));
result_cell = cellfun(f, A, "UniformOutput", false);
result = cell2mat(result_cell);
endswitch
%----------
The "switch" statement is executed for 100 times. The running times are :
Hope it helps.
So your speedups (with respect to the serial version) are 200/130~1.5 and 200/120~1.7 for ndpar and parcellfun. These speedups are too low. Memory usage might be the culprit. To ascertain that, you can use "free -h" on the command line. Comparing the results when parcellfun or cellfun is executing would be interesting.