From: Ian S. <ian...@gm...> - 2009-03-20 13:19:55
|
Marcus Brubaker wrote: >>> Fair enough. In that case I would add methods to the base class >>> which support evaluating the density at multiple points. I.E., they >>> take as input a matrix where each column (or row) is a single element >>> and they return a vector with the pdf evaluated at each point. This >>> would allow efficient PDF evaluation at multiple points. The nice >>> thing is that a default implementation in terms of prob_density can >>> be used to reduce overhead for people implementing distributions. >>> Matthew Leotta wrote: >> I have to admit, my experience in implementing various distributions >> is quite limited. It hadn't crossed my mind that it could be >> significantly more efficient to evaluate a distribution at many points >> in a single function call (rather than a function call for each >> point). Is your reasoning for this based on virtual function overhead >> and multiple evaluations of a normalization constant, or are there >> other factors? Could you give an example of a distribution that would >> benefit from this? Marcus Brubaker wrote: > All of the above really. Virtual function overhead and normalization > constants can hurt. More generally though being able to use SIMD > instructions and preserving cache locality can be a big win. Pretty > much any distribution could be helped by this if you're evaluating at > more than a few points. E.G., for a multivariate gaussian, one > matrix-matrix multiplication is typically cheaper (due to SIMD > instructions, cache locality, etc) than multiple matrix vector multiplies. > For anything other than a trivial PDF (possibly a 1D Epanechnikov quadratic kernel?), I would have thought that the virtual function call cost would be trivial. Whilst the arguments for SIMD compatibility and cache-locality is true, the problems with applying these arguments are that 1. They pretty-much require vnl_matrix types for parameters and results, destroying the type idioms we use in VXL and C++ in general. In extremis you have Matlab's everything-is-a-matrix mess. 2. If you apply this argument everywhere you end up massively multiplying your API to deal with all cases where SIMD speedups might be needed - since your particular matrix folding will depend on the particular cache-locality characteristics of your data configuration and algorithms. 3. Reordering code for SIMD and cache locality really should be the domain of the optimising compiler - although I should admit to the endlessly-dashed optimism I have towards compiler performance. I would suggest one of the following options. 1. Not providing multi-sample methods in the general base-class vpdl API. If someone wants a specific case they can either write their own private code tailored to their requirements. At best it could be added to either the relevant vpdt PDF class, or derived vpdl PDF class 2. Providing a general multi-sample API that uses iterators. If the iterators point to compact memory, then you will get cache locality for free and SIMD if you do some careful iterator-traits template wizardry. This way you maintain the integrity of the type system and limit the API size increase. I kind of like solution 2 because it also ties in with a good design for the builders. The use of iterators could merge the API for batch and online learning of PDF parameters. Currently vpdfl uses mbl_data_wrapper which vaguely equivalent to a pair of STL iterators. Replacing this with proper iterator pairs could make the code much more accessible to existing C++ programmers, and leverage the existing iterator based algorithms. The only trick is to avoid the huge template instantiation sizes of using templated iterators with complicated algorithms. One way around this it to proved a polymorphic strategy tree of iterator types, and use this where there are no speed optimisations. Ian. |