for part of my masters' project i'm profiling/optimising a large simulation environment that uses it++.
while profiling, i've noticed that a large portion of the execution time (5-10%, maybe) seems to be spent in various functions related to memory allocation stemming fom the Vec/Mat/Array classes. looking in each of these, we have pseudo-code like the following:
alloc(n)
if cursize!=n
free
create_elements(n)
cursize=n
the problem is that this function called during not just the constructor and set_size, but also indirectly through many different paths, such as overloaded operators.
internally within our environment, i've modified the code to be something like:
alloc(n)
growsize=totalsize // new variable totalsize has buffer size which is >=cursize
if(growsize<n)
do growsize*=factor while growsize<n
free
create_elements(growsize)
totalsize=growsize
cursize=n
or in other words, if shrinking we don't free/reallocate, and if growing we grow in powers of "factor", which should probably depend on the dimensionality of the object in question. the result is that a little wasted space is traded for a substantial time increase, which in our specific case is worthwhile. appropriately choosing the value of factor such that factor^d*n < 2n where d is the dimensionality of the object could guarantee that at most O(n) space is wasted, and reallocation will happen at most O(lg n) number of times but in practice O(1) times.
so the question i pose to you is:
would you be interested in patches that implement this?
if so, how would you like it implemented?
for the second question, i'd bet that there are environments where space is more important than time, and thus simply swapping out the old with the new wouldn't be viable. instead, i could see two optional routes:
create a new class inherited from Vec/Mat/Array that exports public functions for controlling memory mgmt in an object
use partial templating to provide a way to declare this at compile time
of these the first is probably the easiest, but the second puts all the work at compile time, which is in most cases where you would want it anyway. both could be done keeping backwards compatibility on the API level and/or the behaviour of the memory management.
another approach would be to implement some mid-level memory management system, so that instead of calling new/del in create_elements a pool of memory segments would be kept and re-used throughout the lifecycle of the program. though that's quite a bit more complicated.
questions/comments/thoughts/etc welcome,
sean
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I think the optimal solution you are looking for are Expression Templates. By using this technique in the Vector and Matrix classes, it would eliminate all the extra temporary variables.
You proposal is very interesting. Could you give some examples how substantial time increase you obtained for some of our tutorial examples?
As for the issue, please open a new Feature Request tracker item, so we won't forget about this issue. Of course patches are welcome but against our SVN trunk sources (eventually on the latest development version 3.99.x).
As for the implementation, the second proposal which will change the behaviour at compile time seems to be the most attractive. But at first, you can only provide some preliminary patches, which replace the current memory management model, so we can test it on some separate branch.
Alternatively, we can also go for configuration time switching of these two approaches (e.g. --enable-fast-mem-alloc), but this is not elegant and flexible solution.
Thanks for your great ideas!
BR,
/ediap
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
currently the numbers for what kind improvement we get from scaling are mixed in with the numbers from choosing the right overloaded operators (i.e. = vs ), but i'll see what i can do about seperating the two. it could be that most of the gains i've been seeing were from the latter, in which case that would be a good argument for a full-blown memory manager instead of calling new/delete directly.
but for some current numbers:
i've been using the create_elements function as a rough metric, since this seems to be where the allocation actually occurs, though the total
cost really also includes set_size + alloc + the constructors for each of the datatypes. intially we were seeing something like ~2 billion calls to this function, which by itself took ~3% the total execution time of the program (the total time being ~6 hours). after making the management a little "lazier" plus the above mentioned operator changes, this has been brought down to ~ 200 million, and a ~1-2% decrease in total simulation time. however, this was in the case where i had the scaling factor set to 2 for even multi-dimensional Array<Array<Mat<Num_T> > > type objects. i'm currently running a simulation with different scaling factors to see if tweaking them has a noticable affect.
as for providing patches against svn goes, the changes so far are pretty small so it shouldn't be too hard for me to forward-port them to the devel version from our version which is a little behind the current one. i can supply a rough proof-of-concept where alloc/set_size use hard-coded scaling factors, or if you'd prefer i can try my hand at templating it.
btw, do you know off-hand of a particular test case in your tests dir that would make extensive use of a dynamically growing/shrinking vec/mat/array? i could construct a rather contrived case: for(i=0;i<INT_MAX;i++) v.set_size(i);, but that's not quite realistic :) i figure that i won't be allowed to post code from this simulation environment, so it'd be nice to have a publically available metric for providing benchmark results.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
1-2% decrease in total simulation time is not much in most typical solution. I hoped for an improvement at least 5-10% to be worth including at the cost of wasted memory.
As for the patches, the hard-coded scaling parameters are OK for preliminary tests, provided that they are designed more or less carefully ;)
sorry for the delay in getting back to you, i guess you could say i've been multitasking this with other things.
anyway, looking into this matter further, i've come to the conclusion that resizing/scaling the size of these element arrays doesn't seem to have a huge benefit. ah well, at least it was a nice excercise :)
however, it has led me to a slightly different focus within the memory management...
as i said earlier, the memory allocation seems to be a significant
portion of runtime (somewhere ~5%) in our environment, but it seems that
the largest portion of this comes from what you might call "incidental"
(sp?) allocations, where the allocation was resulting from either
overloaded operators or entering/leaving a scope many times.
two examples:
(1)
Matrix<double> a(somesize), b(somesize);
// ... values get put into a and b
for(int i = 0; i < somethingbig; i++ ) { a = a + ( a * b ); }
(2)
foo(){
for(int i = 0; i < somethingbig; i++ ) { bar(); }
}
bar(){
Matrix<double> a(somesize), b(somesize);
// .. stuff done with and b
}
in case (1) you have not only a and b, but third matrix to have the result of
( a * b ) and a fourth to have a + the third. a smart coder who was fully
aware of what was going on with the operators could partially circumvent this by
using the self-assignment operators (or using other itpp functions where the result is stored in
one of the passed-by-reference values, or adding another couple of for loops and
the () operators to do the assignments manually), but i doubt most people pay
that close attention. so the result is you have somethingbig*2 calls to
Matrix constructor/destructors, with the corresponding memory management
calls for their data.
however, in case (2) there is no easy fix, apart from moving the declarations
for a and b "up" into the calling function (or storing them as a static/class
variable). again the fully attentive programmer might figure this out, but
i'd wager more often such issues slip through. so again, you have
somethingbig*2 allocations/deallocations.
assuming that most of the cases are like the above, we can observe
a few trends:
most of the allocations tend to be clustered around certain size values
most of the allocations tend to be temporally clustered
most of the allocations tend to be temporally independandant of each other
(that is, out of somethingbig*2 allocations, never more than a handful
of objects are needed simultaneously)
so my thought: why not implement some form of pool or cache store for
these chunks of objects, so that allocated objects are not immediately
freed but instead stored somewhere convenient for later re-use?
i've started some work on something along these lines, but i wonder what your thoughts are
on the matter. is it worthwhile, or would it be better to instead spend time training people
to use the various workarounds mentioned above?
sean
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
hi there,
for part of my masters' project i'm profiling/optimising a large simulation environment that uses it++.
while profiling, i've noticed that a large portion of the execution time (5-10%, maybe) seems to be spent in various functions related to memory allocation stemming fom the Vec/Mat/Array classes. looking in each of these, we have pseudo-code like the following:
alloc(n)
if cursize!=n
free
create_elements(n)
cursize=n
the problem is that this function called during not just the constructor and set_size, but also indirectly through many different paths, such as overloaded operators.
internally within our environment, i've modified the code to be something like:
alloc(n)
growsize=totalsize // new variable totalsize has buffer size which is >=cursize
if(growsize<n)
do growsize*=factor while growsize<n
free
create_elements(growsize)
totalsize=growsize
cursize=n
or in other words, if shrinking we don't free/reallocate, and if growing we grow in powers of "factor", which should probably depend on the dimensionality of the object in question. the result is that a little wasted space is traded for a substantial time increase, which in our specific case is worthwhile. appropriately choosing the value of factor such that factor^d*n < 2n where d is the dimensionality of the object could guarantee that at most O(n) space is wasted, and reallocation will happen at most O(lg n) number of times but in practice O(1) times.
so the question i pose to you is:
for the second question, i'd bet that there are environments where space is more important than time, and thus simply swapping out the old with the new wouldn't be viable. instead, i could see two optional routes:
of these the first is probably the easiest, but the second puts all the work at compile time, which is in most cases where you would want it anyway. both could be done keeping backwards compatibility on the API level and/or the behaviour of the memory management.
another approach would be to implement some mid-level memory management system, so that instead of calling new/del in create_elements a pool of memory segments would be kept and re-used throughout the lifecycle of the program. though that's quite a bit more complicated.
questions/comments/thoughts/etc welcome,
Sean,
I think the optimal solution you are looking for are Expression Templates. By using this technique in the Vector and Matrix classes, it would eliminate all the extra temporary variables.
Here are a few resources:
http://ubiety.uwaterloo.ca/~tveldhui/papers/Expression-Templates/exprtmpl.html
http://met.sourceforge.net/
gah, the pseudo code in the above was messed up. in both cases the if blocks contain everything up to and including the create_elemetns calls.
Hi Sean,
You proposal is very interesting. Could you give some examples how substantial time increase you obtained for some of our tutorial examples?
As for the issue, please open a new Feature Request tracker item, so we won't forget about this issue. Of course patches are welcome but against our SVN trunk sources (eventually on the latest development version 3.99.x).
As for the implementation, the second proposal which will change the behaviour at compile time seems to be the most attractive. But at first, you can only provide some preliminary patches, which replace the current memory management model, so we can test it on some separate branch.
Alternatively, we can also go for configuration time switching of these two approaches (e.g. --enable-fast-mem-alloc), but this is not elegant and flexible solution.
Thanks for your great ideas!
BR,
/ediap
hi adam,
currently the numbers for what kind improvement we get from scaling are mixed in with the numbers from choosing the right overloaded operators (i.e. = vs ), but i'll see what i can do about seperating the two. it could be that most of the gains i've been seeing were from the latter, in which case that would be a good argument for a full-blown memory manager instead of calling new/delete directly.
but for some current numbers:
i've been using the create_elements function as a rough metric, since this seems to be where the allocation actually occurs, though the total
cost really also includes set_size + alloc + the constructors for each of the datatypes. intially we were seeing something like ~2 billion calls to this function, which by itself took ~3% the total execution time of the program (the total time being ~6 hours). after making the management a little "lazier" plus the above mentioned operator changes, this has been brought down to ~ 200 million, and a ~1-2% decrease in total simulation time. however, this was in the case where i had the scaling factor set to 2 for even multi-dimensional Array<Array<Mat<Num_T> > > type objects. i'm currently running a simulation with different scaling factors to see if tweaking them has a noticable affect.
as for providing patches against svn goes, the changes so far are pretty small so it shouldn't be too hard for me to forward-port them to the devel version from our version which is a little behind the current one. i can supply a rough proof-of-concept where alloc/set_size use hard-coded scaling factors, or if you'd prefer i can try my hand at templating it.
btw, do you know off-hand of a particular test case in your tests dir that would make extensive use of a dynamically growing/shrinking vec/mat/array? i could construct a rather contrived case: for(i=0;i<INT_MAX;i++) v.set_size(i);, but that's not quite realistic :) i figure that i won't be allowed to post code from this simulation environment, so it'd be nice to have a publically available metric for providing benchmark results.
Hi Sean,
1-2% decrease in total simulation time is not much in most typical solution. I hoped for an improvement at least 5-10% to be worth including at the cost of wasted memory.
As for the patches, the hard-coded scaling parameters are OK for preliminary tests, provided that they are designed more or less carefully ;)
Anyway I look forward to further tests from you.
I thing good measures of typical usage of IT++ for scientific work are simulations of communications systems, e.g.:
http://itpp.sourceforge.net/devel/qpsk_simulation.html
http://itpp.sourceforge.net/devel/convcode.html
http://itpp.sourceforge.net/devel/ldpc_gen_codes.html
http://itpp.sourceforge.net/devel/ldpc_bersim_awgn.html
But I do not think that there is much growing/shrinking containers in such examples.
BR,
/ediap
hi adam,
sorry for the delay in getting back to you, i guess you could say i've been multitasking this with other things.
anyway, looking into this matter further, i've come to the conclusion that resizing/scaling the size of these element arrays doesn't seem to have a huge benefit. ah well, at least it was a nice excercise :)
however, it has led me to a slightly different focus within the memory management...
as i said earlier, the memory allocation seems to be a significant
portion of runtime (somewhere ~5%) in our environment, but it seems that
the largest portion of this comes from what you might call "incidental"
(sp?) allocations, where the allocation was resulting from either
overloaded operators or entering/leaving a scope many times.
two examples:
(1)
Matrix<double> a(somesize), b(somesize);
// ... values get put into a and b
for(int i = 0; i < somethingbig; i++ ) { a = a + ( a * b ); }
(2)
foo(){
for(int i = 0; i < somethingbig; i++ ) { bar(); }
}
bar(){
Matrix<double> a(somesize), b(somesize);
// .. stuff done with and b
}
in case (1) you have not only a and b, but third matrix to have the result of
( a * b ) and a fourth to have a + the third. a smart coder who was fully
aware of what was going on with the operators could partially circumvent this by
using the self-assignment operators (or using other itpp functions where the result is stored in
one of the passed-by-reference values, or adding another couple of for loops and
the () operators to do the assignments manually), but i doubt most people pay
that close attention. so the result is you have somethingbig*2 calls to
Matrix constructor/destructors, with the corresponding memory management
calls for their data.
however, in case (2) there is no easy fix, apart from moving the declarations
for a and b "up" into the calling function (or storing them as a static/class
variable). again the fully attentive programmer might figure this out, but
i'd wager more often such issues slip through. so again, you have
somethingbig*2 allocations/deallocations.
assuming that most of the cases are like the above, we can observe
a few trends:
(that is, out of somethingbig*2 allocations, never more than a handful
of objects are needed simultaneously)
so my thought: why not implement some form of pool or cache store for
these chunks of objects, so that allocated objects are not immediately
freed but instead stored somewhere convenient for later re-use?
i've started some work on something along these lines, but i wonder what your thoughts are
on the matter. is it worthwhile, or would it be better to instead spend time training people
to use the various workarounds mentioned above?
sean