performance slightly low

N-dimensional parallel functions package for octave

Brought to you by: ederag

#2 performance slightly low

Status: need-info

Owner: John Hunt

Labels: performance (1)

version: 1.0.0

Updated: 2014-09-24

Created: 2014-09-22

Creator: John Hunt

Private: No

As reported by "user1129812" in the first comment to
http://stackoverflow.com/a/25959122/3565696,
Using "ndpar_arrayfun" reduces the running time to about 65% compared with "cellfun". It seems slightly slower than "parcellfun".

Discussion

John Hunt - 2014-09-22

This is strange as the speedups are usually higher than 3 on a quadcore machine (under linux). What was the operating system ?

As compared to parcellfun (reduction to 60%, see 4th comment to http://stackoverflow.com/a/25945953/3565696)
Strange, since it was intended to be slightly faster actually. Did the 60% reduction include everything including the mat2cell and cell2mat lines ?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

John Hunt - 2014-09-22

status: open --> need-info
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

John Hunt - 2014-09-22

Reproduced here, even worse for my sample function. With such a fast F, the overhead for parallelization is too high. From another comment of "user1129812" it seems that fast functions were used here also. To be confirmed.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Lawrence Tsang - 2014-09-23

I am "user1129812".

My operating system is "Ubuntu 12.04 amd64". I am using Octave 3.8.1 on an "intel i5-2500" quadcore desktop.

The running time reduction to 60%/65% of that of "cellfun" does include everything including the mat2cell and cell2mat lines (for "cellfun" and "parcellfun" only), and several lines of matrix multiplication/addition (I think not time critical).

In fact, F = @(a,B) sum(bsxfun(@times, a, B),2)'; and B has several hundreds rows and several ten thousands columns.

But I am not sure whether "F" is too fast for effective parallelization.

Last edit: Lawrence Tsang 2014-09-23

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

John Hunt - 2014-09-23

With
k = 300; #rows A
m = 20000; # columns A and B
n = 300; # rows B

F(A(1,:), B); takes about 40 ms, and the speedup is about 3.3 (parcellfun calculation) or 2.0 (ndpar), compared to a serial (for) version. There is to much bookkeeping in ndpar for such fast functions, I'll investigate that.

But your results are different. Increasing m and n to much increase memory usage (as it runs more F executions at once) and can ruin the parallelization advantage. Could it be the case ?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Lawrence Tsang - 2014-09-24

I don't know how do you define "speedup". Here I write down my cases for your comparison.

k = 10;
m = 60000;
n = 785;
F = @(a,B) sum(bsxfun(@times, a, B),2)';
f = @(a) F(a,B);
%----------
switch (method_choice)
case 1
pkg load ndpar;
nproc = 3;
result = ndpar_arrayfun(nproc, F, A, B, "IdxDimensions", [1, 0], "CatDimensions", [1], "VerboseLevel", 0);
case 2
A_cell = mat2cell(A, ones(1,size(A,1)));
pkg load parallel;
nproc = 3;
result_cell = parcellfun(nproc, f, A, "UniformOutput", false, "VerboseLevel", 0);
result = cell2mat(result_cell);
case 3
A_cell = mat2cell(A, ones(1,size(A,1)));
result_cell = cellfun(f, A, "UniformOutput", false);
result = cell2mat(result_cell);
endswitch
%----------

The "switch" statement is executed for 100 times. The running times are :

When method_choice = 1, running time = about 130 seconds,

when method_choice = 2, running time = about 120 seconds,

when method_choice = 3, running time = about 200 seconds.

Hope it helps.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

John Hunt - 2014-09-24

So your speedups (with respect to the serial version) are 200/130~1.5 and 200/120~1.7 for ndpar and parcellfun. These speedups are too low. Memory usage might be the culprit. To ascertain that, you can use "free -h" on the command line. Comparing the results when parcellfun or cellfun is executing would be interesting.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Log in to post a comment.