From: Matthew L. <mat...@gm...> - 2009-03-20 17:21:25
|
On Mar 20, 2009, at 12:51 PM, Ian Scott wrote: > Matthew Leotta wrote: >> On Mar 20, 2009, at 9:19 AM, Ian Scott wrote: >>> Marcus Brubaker wrote: >>>>>> Fair enough. In that case I would add methods to the base >>>>>> class which support evaluating the density at multiple >>>>>> points. I.E., they take as input a matrix where each column >>>>>> (or row) is a single element and they return a vector with the >>>>>> pdf evaluated at each point. This would allow efficient PDF >>>>>> evaluation at multiple points. The nice thing is that a >>>>>> default implementation in terms of prob_density can be used to >>>>>> reduce overhead for people implementing distributions. >>>>>> >>> >>> Matthew Leotta wrote: >>>>> I have to admit, my experience in implementing various >>>>> distributions is quite limited. It hadn't crossed my mind that >>>>> it could be significantly more efficient to evaluate a >>>>> distribution at many points in a single function call (rather >>>>> than a function call for each point). Is your reasoning for >>>>> this based on virtual function overhead and multiple >>>>> evaluations of a normalization constant, or are there other >>>>> factors? Could you give an example of a distribution that >>>>> would benefit from this? >>> Marcus Brubaker wrote: >>>> All of the above really. Virtual function overhead and >>>> normalization constants can hurt. More generally though being >>>> able to use SIMD instructions and preserving cache locality can >>>> be a big win. Pretty much any distribution could be helped by >>>> this if you're evaluating at more than a few points. E.G., for >>>> a multivariate gaussian, one matrix-matrix multiplication is >>>> typically cheaper (due to SIMD instructions, cache locality, >>>> etc) than multiple matrix vector multiplies. >>> For anything other than a trivial PDF (possibly a 1D Epanechnikov >>> quadratic kernel?), I would have thought that the virtual >>> function call cost would be trivial. >> The virtual functions do seem to be trivial with respect to >> calling time. However, the existence of any virtual functions >> adds to the size of the class, and this may not be trivial for >> memory allocation. Consider a scalar Gaussian of type float. It >> requires 8 bytes (4 for each of 2 floats for mean and variance). >> When you use virtual functions you need 12 bytes (4 more for the >> virtual function table pointer). That's 50% more and is not >> trivial when you have an array of a million distributions. >> This is the main reason I added vpdt and went back to the idea of >> a pure template library that is wrapped by the polymorphic >> strategy tree in vpdl. The lack of virtual functions may also >> allow more inlining and optimizations by some compilers. > > Oh - absolutely. I was only considering the vpdl strategy tree case, > where you have already agreed to the VFPT pointer cost. > >>> Whilst the arguments for SIMD compatibility and cache-locality is >>> true, the problems with applying these arguments are that >>> >>> 1. They pretty-much require vnl_matrix types for parameters and >>> results, destroying the type idioms we use in VXL and C++ in >>> general. In extremis you have Matlab's everything-is-a-matrix mess. >>> >>> 2. If you apply this argument everywhere you end up massively >>> multiplying your API to deal with all cases where SIMD speedups >>> might be needed - since your particular matrix folding will >>> depend on the particular cache-locality characteristics of your >>> data configuration and algorithms. >>> >>> 3. Reordering code for SIMD and cache locality really should be >>> the domain of the optimising compiler - although I should admit >>> to the endlessly-dashed optimism I have towards compiler >>> performance. >>> >>> I would suggest one of the following options. >>> >>> 1. Not providing multi-sample methods in the general base-class >>> vpdl API. If someone wants a specific case they can either write >>> their own private code tailored to their requirements. At best it >>> could be added to either the relevant vpdt PDF class, or derived >>> vpdl PDF class >>> >>> 2. Providing a general multi-sample API that uses iterators. If >>> the iterators point to compact memory, then you will get cache >>> locality for free and SIMD if you do some careful iterator-traits >>> template wizardry. This way you maintain the integrity of the >>> type system and limit the API size increase. >>> >>> I kind of like solution 2 because it also ties in with a good >>> design for the builders. The use of iterators could merge the API >>> for batch and online learning of PDF parameters. Currently vpdfl >>> uses mbl_data_wrapper which vaguely equivalent to a pair of STL >>> iterators. Replacing this with proper iterator pairs could make >>> the code much more accessible to existing C++ programmers, and >>> leverage the existing iterator based algorithms. >> I agree that 2 is better. Each distribution should know how to >> handle a single point and external functions can iterate over >> multiple points. This is why I've added virtual functions to the >> vpdl base distribution for normalization constant and unnormalized >> density. The biggest savings will probably come from computing >> normalization once and then applying it to each unnormalized >> density as they are computed. >>> The only trick is to avoid the huge template instantiation sizes >>> of using templated iterators with complicated algorithms. One >>> way around this it to proved a polymorphic strategy tree of >>> iterator types, and use this where there are no speed >>> optimisations. >>> >> I'm not convinced that huge template instantiation will be an >> issue. I think most of these functions are simple enough to be >> inlined in a header file and only instantiated by the user as >> needed (i.e. not in libvpdl for several possible iterator type). > > Well I guess it depends on what you think is too big - Some of the > PDF building functions in vpdfl are well over 100 lines. It wouldn't > be hard to think of a MAP-estimating PDF learning algorithm that > used cross-validation, higher-order statistical biases and/or > complexity control and that had 1000 LoC. I was actually only referring to the original concern that Marcus had about efficiently evaluating probabilities at a collections of points. These types of functions should be generally be under 10 LoC. I don't think builder functions should be treated the same way. However, it might be possible to have light-weight templated functions that wrap an arbitrary iterator into a polymorphic iterator and then call a builder implementation that uses the polymorphic iterator base class. This would allow the user to use any type of iterator and hide the fact that a special polymorphic iterator type is required. I haven't thought about this enough to know exactly how it might work. > > I know it isn't part of your brief, but the pattern used in vpdl may > well set the standard for future libraries that need to train > something based on lots of examples. vpdfl certainly set the > standard for a classifier API (see mul/clsfy) where some of my > private support vector machine training algorithms run into the > thousands LoC. It also set the standard for Manchester's private AAM > libraries, which have building algorithms running easily into the > 10^5 LoC. > > The additional problem with these other training algorithms is that > they often have multiple inputs - e.g. weights (which maybe used by > vpdl), or classification targets. The number of possible template > types grow polynomially. > > Still - I guess they could all be inlined - at the risk of a large > compilation time cost. In any case, any such template variation > should be limited to the templated build or density functions - it > would be disastrous if the iterator types affected the type of the > vpdl_distribution or vpdl_builder classes. We would have to put > every possible type combination and derivative into the vsl's > polymorphic loader scheme. I certainly agree. These templates for iterators should have no effect on the types used in distributions. Matt |