Menu

#14 Faster variance calculations

open-fixed
None
5
2008-10-07
2007-11-21
No

I just figured out that...

dividing all elements of a vector by a constant and then normalizing the vector

...is the same as...

normalizing the vector and then dividing the answer (scalar) by the constant

:-D

I want to change the code that calculate the variance in AssociatedPairDataSetBuilder to return a scalar instead of a Vector that has to be normalized. I also want to normalize the diffSquare Vector before dividing it by the set's size. This should be faster, because we save one division calculation per element. This will also simplify code in other classes, because we don't have to call .norm() on the Vectors that are returned then.

Alternatively, just add another method that returns a scalar and keep the one that returns a Vector.

I'm putting this on the tracker so that I don't forget about it.

Discussion

  • Anonymous

    Anonymous - 2007-11-21
    • assigned_to: nobody --> heatzync
     
  • Gary Pampara

    Gary Pampara - 2007-11-21

    Logged In: YES
    user_id=975307
    Originator: NO

    Where are we doing this calculation? I have some ideas to refactor these operations into a standard set of utilities.

     
  • Gary Pampara

    Gary Pampara - 2007-11-21

    Logged In: YES
    user_id=975307
    Originator: NO

    Yup :)

    I have been planning a generic StatisticsUtil class of sorts for some time now. Now should be a good time for us to add something like this.

     
  • Anonymous

    Anonymous - 2007-11-22

    Logged In: YES
    user_id=1079274
    Originator: YES

    A generic StatisticsUtil class would be a most excellent idea. I've also been toying with the idea... It will reduce the amount of code in CILib dramatically, because mean, deviations and variances are calculated all over the show. We would just have to define a standard way of representing a set of Types, i.e. will we use ArrayList<Type> or will we create a new CILib specific class. This is so that everyone who wants to create sets of anything will use the CILib standard, because then it will work with the StatisticsUtil class. We will also need methods for scalars as well as vectors, because they are handled differently, i.e:

    double getMean(CILibSetStandardClass<Numeric> set); //and
    Vector getMean(CILibSetStandardClass<Vector> set);

    Currently, I have getMean() and getVariance() in AssociatedPairDataSetBuilder. Those methods return the mean and variance of the "this" dataset respectively. Both of them make use of the getSetMean() and getSetVariance() methods, also in AssociatedPairDataSetBuilder.

     
  • Anonymous

    Anonymous - 2007-11-22

    Logged In: YES
    user_id=1079274
    Originator: YES

    We should also keep the "new dataset stuff" in mind. Since some statistics will be performed on it.

     
  • Gary Pampara

    Gary Pampara - 2007-11-22

    Logged In: YES
    user_id=975307
    Originator: NO

    Small steps. The new dataset stuff can use it but I don't think building the stats stuff into it is a good idea. The stats stuff will be the simplest building blocks from which other actions can benefit. I have a class I wrote for some functions with this code in it. I'm quickly gonna find it

     
  • Anonymous

    Anonymous - 2007-12-07
    • status: open --> pending-later
     
  • Anonymous

    Anonymous - 2007-12-07

    Logged In: YES
    user_id=1079274
    Originator: YES

    This tracker item depends on the StatsUtil class that still needs to be implemented... I'm therefore changing the status to "pending".

     
  • SourceForge Robot

    Logged In: YES
    user_id=1312539
    Originator: NO

    This Tracker item was closed automatically by the system. It was
    previously set to a Pending status, and the original submitter
    did not respond within 14 days (the time period specified by
    the administrator of this Tracker).

     
  • SourceForge Robot

    • status: pending-later --> closed-later
     
  • Gary Pampara

    Gary Pampara - 2007-12-23
    • status: closed-later --> open-later
     
  • Gary Pampara

    Gary Pampara - 2008-02-27

    Logged In: YES
    user_id=975307
    Originator: NO

    The initial StatsUtils has been committed. This will be refactored as needed.

     
  • Gary Pampara

    Gary Pampara - 2008-02-27
    • assigned_to: heatzync --> gpampara
    • status: open-later --> closed-fixed
     
  • Anonymous

    Anonymous - 2008-05-22
    • status: closed-fixed --> open-fixed
     
  • Anonymous

    Anonymous - 2008-05-22

    Logged In: YES
    user_id=1079274
    Originator: YES

    I removed the methods that calculated the mean and variance of a set/cluster/collection from AssociatedPairDataSetBuilder and moved them to the StatUtils class. Also added unit tests. The AssociatedPairDataSetBuilder caches and keeps track of its own mean and variance though.

    This changed occured in revision 751

     
  • Gary Pampara

    Gary Pampara - 2008-10-07
    • status: open-fixed --> closed-fixed
     
  • Andrich

    Andrich - 2008-10-07

    I might misunderstand your statement. But if a vector is normalized then the answer is still a vector (of unit length) and not a scalar. So dividing an unnormalized vector by a scalar is not the same as dividing the equivalent normalized vector by a scalar (the vectors will have the same direction, but different magnitudes).

     
  • Anonymous

    Anonymous - 2008-10-07

    Sorry, I clearly used the wrong terms. What I meant was:

    dividing all elements of a vector by a scalar and then taking the norm of the
    vector (which is a scalar)

    ...is the same as...

    taking the norm of the vector (which is a scalar) and dividing it by the scalar

     
  • Anonymous

    Anonymous - 2008-10-07
    • status: closed-fixed --> open-fixed
     
  • Gary Pampara

    Gary Pampara - 2008-10-07

    There is definitely some confusion here. A normal vector is a vector of unit length that is orthogonal to a specific point (be it on a surface or whatever).

    Theuns, can you provide the math for this?

     
  • Andrich

    Andrich - 2008-10-07

    Hi

    I think I might have figured out what Theuns has in mind, to calculate the norm (not normalize) of a vector is to assign a positive scalar to the vector, which is it's size in a vector space. Usually the norm is the euclidean length, but it can be different, if a different norm calculation is used (Manhattan etc.)

    If this is the case, the method name is incorrect, it should be norm(), which returns a scalar and then what Theuns said is true. Normalization is the process of finding a vector of unit length parallel to the original vector ( u = v\norm(v)), which is still a useful method to have, since it is a common operation in vector algebra.

     
  • Gary Pampara

    Gary Pampara - 2008-10-07

    That's currently the implementation in the Vector class. Ie: ||x|| = sqrt(x_1^2 + ... + x_n^2).

    I'm really confused about what all this is really about now... Is something missing?

     
  • Andrich

    Andrich - 2008-10-07

    I had a look at the class now. As is, the implementation is correct, that is indeed the vector's norm that is calculated. I think Theuns used the word normalized incorrectly in the Tracker submission. Currently CIlib cannot normalize a vector, if you want to add it, here is some code (not tested):

    void normalize() {

    double norm = this.norm();

    if ((norm == 1)||(norm==0))

    return;
    for (int i = 0; i < this.size(); i++)

    this.setReal(i,this.getReal(i) / norm);

    }

    Sorry if I caused unnecessary confusion, it just didn't make sense to me that the result of normalization is a scalar.

     
  • Gary Pampara

    Gary Pampara - 2008-10-07

    So you are wanting to create the unit vector of a vector?

     
  • Andrich

    Andrich - 2008-10-07

    That is normalization, if that is what is required.

     

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.