From: Matthew Leotta <matt.leotta@gm...>  20090115 19:48:16

Peter, Ian, and anyone else interested, In an attempt to keep things moving on this combined probability library I'd like to make a proposal on how to integrate bsta, pdf1d, and vpdfl. I have a few general questions to be resolved first. If we can come to an agreement on the general way to proceed then I'll take some time to make a more detailed class specification. I've reviewed all three libraries in more detail, and I'm even more convinced that we can't completely merge them and still meet the requirements of both Brown and Manchester. However, I think we could adopt the strategy pattern of vpdfl as the primary design and then put the generic programming design of bsta in a subdirectory and make it a pure template library. The template distributions could mirror a subset of the strategy pattern distributions and have wrapper classes to use template distributions in the strategy pattern framework. For example (naming conventions subject to change): template <class T> class gaussian<T> : public base_distribution<T>; // uses virtual functions template <class T> class gaussian_ref<T> : public gaussian<T> // uses virtual functions and in the template library template <class T, unsigned N> class gaussian_fixed<T,N>; // no virtual functions The gaussian_fixed class is unrelated to base_distribution. It's data is represented in terms of vnl_vector_fixed and vnl_matrix_fixed while the gaussian_ref class contains vnl_vector_ref and vnl_matrix_ref with the same data representation. So the fixed size data can be used in the strategy pattern framework for functions that are not speed critical. The generic template part of the library could be used on it's own, but would only contain the basic data structures and algorithms that need to be optimized for speed or memory layout. What I think we would need: 1) A common set naming conventions and data representations for distributions found in both designs. Examples: a) both bsta and vpdfl have axis aligned and full covariance Gaussians (with different names), vpdlf also has a truncated principal components Gaussian while bsta also has a spherically symmetric Gaussian. b) vpdfl currently stores a Gaussian covariance matrix in eigenvalue decomposition, while bsta stores the original matrix and caches its inverse 2) The strategy pattern code (vpdfl) should be templated over numeric type (float or double). 3) Acceptance that in the scalar case (if num_dimensions is 1 at run time) some scalar values will have to be represented as vnl_vector of size 1 and vnl_matrix of size 1x1. I am most worried about number 3. This is why pdf1d exists. Can pdf1d be a special case of vpdfl? If someone is interested in working only in 1d will they be put off by the need to use 1d vectors and matrices? bsta solves this with template specialization that substitutes vnl_vector_fixed<double,N> with double when N==1. Other things I would like to see in the strategy pattern: 1) it would be nice if the distribution is more than just a density (pdf). I would also like to see cumulative calculations (cdf). Sometimes density is not enough and you want to know the actual probability integrated over some area. If we can evaluate the cdf then we can at least get the probability in an axisaligned bounding box (by evaluating cdf at the box corners). some parts of bsta support this. 2) It would be nice if we could work out recursive estimation for the "builder" classes. It's nice that the mbl_data_wrapper doesn't require that you have all the data in memory, but it does require that you use all the data before you get the estimated distribution. If a builder supports recursive estimation it would be nice if you could feed the data points one at a time and stop at any point to get the distribution so far. 3) It might be good to clean up the vpdfl interface a bit. Some functions on vpdfl_pdf base seem a little to application or distribution specific for the general case. n_peaks() and peak(int) seem to make sense for mixtures of Gaussians with well separate components, but it might be misleading (and difficult to compute accurately) in the general case. Others like nearest_plausible seem a bit application specific and maybe should nonmember functions to reduce clutter. Thoughts? Matt On Jan 12, 2009, at 2:07 PM, Peter Vanroose wrote: > OK, I see your point. > I don't know what your time frame is; maybe we'll have time to first > arrange some kind of "common brain storm" forum; I'm definitely > interested in joining in (although I've not used any of the three > libraries yet; but as a mathematician/statistician I'm certainly > interested in the topic). > Then someone could write out new class specs (any takers?), again > with a few days time for others to suggest adaptations/additions/... > Finally, we could divide work to do the actual "move", i.e., > implement the new classes with the implementations from the three > existing libraries. > > If we find (say) 3 or 4 people having time to do this, in a > reasonable short time span, this should be doable. > Your second path (keeping the libraries "as is") is easier, but > would maybe not fulfill the requirement of "make sure others are > using bsta". > >  Peter. > > >> I think there are a couple of different ways of moving >> forward with this statistical library. >> >> One is to bring both the template library (bsta) and the >> strategy patterned libraries (pdf1d and vpdfl) into one >> library. If we take this route then I think your strategy >> is a good one. However I'm not sure who is going to >> write out the clean design. I think we would need input >> from Brown and Manchester (and anyone else interested) to >> meet everyone's needs. We don't have all the >> interested parties sitting together. I suppose this could >> be done over email, but someone would need to take the lead >> and propose an initial design. I'd be willing to help >> with fitting the bsta code into the design, but I don't >> think I could take the lead on this as I'm currently >> trying to write my Phd thesis. >> >> Another option is to consider keeping the libraries >> separate. bsta is really a generic template library in the >> style of STL. After cleaning out the old deprecated code >> you would find that the only compiled files are the .cxx >> files in the Templates subdirectory used to instantiate with >> some of the more common types. All of the data structures >> and algorithms are templated so that the code can be >> optimized at compile time for whichever data types and >> dimensionality are chosen. This is a bit more extreme than >> in vnl where the vnl_vector_fixed is almost always wrapped >> as a vnl_vector for use in algorithms. I think there may be >> some merit to keeping bsta as a pure template library. It >> might be useful to have a strategy patterned library wrap >> some of the templated distribution classes, but I don't >> see how the templated algorithms can make much use of >> strategy patterned distribution classes. Thoughts? >> >> My ulterior motive here is to document the design and use >> of some libraries I've written and used in my thesis. >> I'm already doing this with vidl2 and trying to see if I >> can do the same with bsta. The stipulations set forth by my >> advisor (Joe Mundy) are >> 1) The library is one I have designed (at least the initial >> framework) >> 2) The library is used in my thesis work. >> 3) I am to write a VXL book chapter to document the code >> and include a copy as an appendix in my thesis. >> 4) The library must be used by others and promoted to core. >> >> The "promoted to core" part is the most difficult >> since there has been nothing added to core in a long long >> time. I am willing to clean up, rename, move, and >> thoroughly document bsta on my own if it is accepted into >> core. I don't think I have the time to redesign it to >> merge with pdf1d and vpdfl while finishing my thesis. In >> that case, I would probably have to cut if from my thesis >> and put off writing a book chapter until some unforeseen >> future date. >> >> >> Matt >> >> >> On Jan 8, 2009, at 5:00 AM, Peter Vanroose wrote: >> >>> Interesting discussion! >>> >>> I believe we should (1) write out a "clean" >> design, essentially from scratch, indeed including several >> "pointsofview" (in the style of vnl_vector vs. >> vnl_vector_fixed), but fairly complete (i.e., including >> functionality which is currently either not used or not >> fully implemented). >>> Then (2) gradually fill this framework with >> implementations from bsta, pdf1d and vpdfl (and possibly >> other places). >>> Next (3) replace implementations in bsta, pdf1d and >> vpdfl by (inline) calls to the new library. >>> And finally (4) gradually replace (in client code) all >> use of the then "old" libraries by directly >> accessing the new core library. >>> >>> We've had experience and good results with a >> similar approach, when we converted TargetJr into vxl. It >> took us then 6 days with 10 people (sitting together in >> Oxford) to have a complete and working set of core >> libraries; the first 2 days were mainly spent to just write >> out the design (and discuss choices to be made). >>> For a new statistics library, I guess we'll need >> less than 50 manhours to do a similar thing (which >> corresponds to steps 1 and 2 above). >>> >>> Thoughts? >>> >>>  Peter. > > > > > > __________________________________________________________ > Låna pengar utan säkerhet. Jämför vilkor online hos Kelkoo. > http://www.kelkoo.se/c100390123lanutansakerhet.html?partnerId=96915014 