#656 Extend stats command: skewness and kurtosis + standard errors of moments

None
closed-accepted
nobody
stats (1)
5
2014-03-07
2014-03-05
No

The attached patch adds the calculation of the skewness, the kurtosis, and the
standard errors of mean, stddev, skewness, and kurtosis to the stats command.
It also changes the calculation formula for the variance to the "corrected two-pass
algorithm", since according to the Numerical Recipes (3rd ed.) the formula used before
"can magnify the roundoff error by large factor and is generally unjustifiable in terms
of computing speed".

1 Attachments

Discussion

  • Bastian Märkisch

    This patch is greatly appreciated. It has two minor issues, though:

    • The alignment of the output of the new variables is different. As I don't see a sensible way to shorten the descriptions, all other output should probably be realigned. Any other idea?
    • In degenerate cases the variance/stddev might be zero. The attached update to your patch checks for (var != 0) to avoid division by zero (and corrects indentation). Should we suppress displaying the skewness and kurtosis in that case altogether?
     
    • Alexander Täschner

      The alignment of the output of the new variables is different. As I don't see a
      sensible way to shorten the descriptions, all other output should probably be
      realigned. Any other idea?

      In the function two_column_output one could put the standard errors above the already not aligned output of the slope and intercept. For the function sgl_column_output I'm
      not sure.

      In degenerate cases the variance/stddev might be zero.

      In writing the patch I also thought about this case and decided to skip the test and
      risk an assignment of positive infinity since I didn't know what to assign to STATS_skewness and STATS_kurtosis when this happens.

       
  • Bastian Märkisch

    This variant of the patch now re-indents the output of all results to match the new results. The maximum total line length is ~51, which is still acceptable.
    In cases where the calculation is undefined (variance == 0), the user variables are set to NaN and the printed value is "undefined".

    There is a last issue which needs to be clarified. Your patch changes the definition of the variance from the population variance (n) to the sample variance (n-1). This then propagates to the kurtosis and skewness. While this makes a lot of sense in some cases, it is a change in behaviour. In any case this should be documented and consistent with other parts of gnuplot. See also comments on [bugs:#1118].

     

    Related

    Bugs: #1118

  • Alexander Täschner

    Thank you for the reindentaion and the good solution of the variance == 0 case.

    Your patch changes the definition of the variance from the population variance (n) to
    the sample variance (n-1).

    I didn't notice this change when implementing the two-pass algorithm. Since I hope that
    everybody who uses this function will have a large enough data set, where the difference
    between n and n-1 does not matter, you can revert this change. If one uses the sample
    variance one should include a test for the (n - 1 == 0) case and handle it like you
    did in the (variance == 0) case.

     
  • Bastian Märkisch

    • labels: --> stats
    • status: open --> closed-accepted
    • Group: -->
     
  • Bastian Märkisch

    Thanks. Now in CVS.

     

Log in to post a comment.