From: Matthew L. <mat...@gm...> - 2009-03-20 16:01:05
|
On Mar 20, 2009, at 9:19 AM, Ian Scott wrote: > > Marcus Brubaker wrote: >>>> Fair enough. In that case I would add methods to the base class >>>> which support evaluating the density at multiple points. I.E., >>>> they take as input a matrix where each column (or row) is a >>>> single element and they return a vector with the pdf evaluated at >>>> each point. This would allow efficient PDF evaluation at >>>> multiple points. The nice thing is that a default implementation >>>> in terms of prob_density can be used to reduce overhead for >>>> people implementing distributions. >>>> > > > Matthew Leotta wrote: >>> I have to admit, my experience in implementing various >>> distributions is quite limited. It hadn't crossed my mind that it >>> could be significantly more efficient to evaluate a distribution >>> at many points in a single function call (rather than a function >>> call for each point). Is your reasoning for this based on virtual >>> function overhead and multiple evaluations of a normalization >>> constant, or are there other factors? Could you give an example >>> of a distribution that would benefit from this? > > Marcus Brubaker wrote: >> All of the above really. Virtual function overhead and >> normalization constants can hurt. More generally though being able >> to use SIMD instructions and preserving cache locality can be a big >> win. Pretty much any distribution could be helped by this if >> you're evaluating at more than a few points. E.G., for a >> multivariate gaussian, one matrix-matrix multiplication is >> typically cheaper (due to SIMD instructions, cache locality, etc) >> than multiple matrix vector multiplies. > > For anything other than a trivial PDF (possibly a 1D Epanechnikov > quadratic kernel?), I would have thought that the virtual function > call cost would be trivial. The virtual functions do seem to be trivial with respect to calling time. However, the existence of any virtual functions adds to the size of the class, and this may not be trivial for memory allocation. Consider a scalar Gaussian of type float. It requires 8 bytes (4 for each of 2 floats for mean and variance). When you use virtual functions you need 12 bytes (4 more for the virtual function table pointer). That's 50% more and is not trivial when you have an array of a million distributions. This is the main reason I added vpdt and went back to the idea of a pure template library that is wrapped by the polymorphic strategy tree in vpdl. The lack of virtual functions may also allow more inlining and optimizations by some compilers. > Whilst the arguments for SIMD compatibility and cache-locality is > true, the problems with applying these arguments are that > > 1. They pretty-much require vnl_matrix types for parameters and > results, destroying the type idioms we use in VXL and C++ in > general. In extremis you have Matlab's everything-is-a-matrix mess. > > 2. If you apply this argument everywhere you end up massively > multiplying your API to deal with all cases where SIMD speedups > might be needed - since your particular matrix folding will depend > on the particular cache-locality characteristics of your data > configuration and algorithms. > > 3. Reordering code for SIMD and cache locality really should be the > domain of the optimising compiler - although I should admit to the > endlessly-dashed optimism I have towards compiler performance. > > I would suggest one of the following options. > > 1. Not providing multi-sample methods in the general base-class vpdl > API. If someone wants a specific case they can either write their > own private code tailored to their requirements. At best it could be > added to either the relevant vpdt PDF class, or derived vpdl PDF class > > 2. Providing a general multi-sample API that uses iterators. If the > iterators point to compact memory, then you will get cache locality > for free and SIMD if you do some careful iterator-traits template > wizardry. This way you maintain the integrity of the type system and > limit the API size increase. > > I kind of like solution 2 because it also ties in with a good design > for the builders. The use of iterators could merge the API for batch > and online learning of PDF parameters. Currently vpdfl uses > mbl_data_wrapper which vaguely equivalent to a pair of STL > iterators. Replacing this with proper iterator pairs could make the > code much more accessible to existing C++ programmers, and leverage > the existing iterator based algorithms. I agree that 2 is better. Each distribution should know how to handle a single point and external functions can iterate over multiple points. This is why I've added virtual functions to the vpdl base distribution for normalization constant and unnormalized density. The biggest savings will probably come from computing normalization once and then applying it to each unnormalized density as they are computed. > The only trick is to avoid the huge template instantiation sizes of > using templated iterators with complicated algorithms. One way > around this it to proved a polymorphic strategy tree of iterator > types, and use this where there are no speed optimisations. > I'm not convinced that huge template instantiation will be an issue. I think most of these functions are simple enough to be inlined in a header file and only instantiated by the user as needed (i.e. not in libvpdl for several possible iterator type). Matt |