What kind of variance should `var()` return? #149

milancurcic · 2020-02-19T18:03:34Z

Playing with a toy example today, I was surprised to see the result (I guess I should have read the specification first, LOL).

Program:

use stdlib_experimental_stats, only: mean, var
real :: a(5) = [1, 2, 3, 4 ,5]
print *, var(a), mean((a - mean(a))**2)
end

I expected two identical numbers. Result:

   2.50000000       2.00000000

Then I looked at the code for var() and saw that we're making an average by dividing the sum by (N - 1), rather than N.

Then I looked at this issue and the spec, and it's all good: It says that the variance is defined in such way so that we're dividing by (N - 1). The code works as advertised.

But then I wondered why N - 1 and not N, and did some Google searching and found that there are all kinds of variances out there and that this particular flavor is called "population variance", or as described here the best unbiased estimator. Dividing by just N seems to be called just "variance".

How we define this not only affects the numerical result, but also in some cases the behavior of the program: variance of a scalar is 0, but population variance of a scalar is a NaN. Some discussion here.

NumPy's np.var() for example defines it as a simple variance (divide by N), but you can optionally set a "delta degrees of freedom", so that np.var(x, ddof=1) corresponds to the population variance.

I'm not a statistician but a wave physicist. I expect my variance to divide by N, and NumPy as served me well so far.

Question: Should we consider adding optional "delta degrees of freedom" to make both statisticians and physicists happy?

The text was updated successfully, but these errors were encountered:

certik · 2020-02-19T18:12:42Z

I would do what NumPy does, and divide by N. Then provide an option, just like NumPy, to allow the N-1 convention.

jvdp1 · 2020-02-19T18:54:14Z

This Wikipedia page gives a nice explanation about the differences sample variance vs population variance.

In Octave, var uses N-1 as default (opt = 0), and N as an option (opt = 1)., So it is divided by N - 1 + opt.

In NumPy var, it is divided by N - opt, with opt = 0 as default.

Both R and Julia use N-1 as default denominator.

Both approaches are fine for me, if it is clearly mentioned in the spec. N-1 seems to be more often use as default.

Question: Should we consider adding optional "delta degrees of freedom" to make both statisticians and physicists happy?

I think so too.

Regarding the API, I would propose

Syntax

result = var(array [, corrected [, mask]])
result = var(array, dim [, corrected[, mask]])

Arguments

corrected (optional): Shall be a logical scalar. If corrected is .true. (default value), the sum is scaled with n-1. If corrected is .false., then the sum is scaled with n.

I prefer a logical in this case since only two values can be passed to var (0 and 1). So, allowing a number for this option would imply checking the passed value inside var (and possibly a return for erroneous values) to be sure it is 0 or 1.

milancurcic · 2020-02-19T19:05:12Z

@jvdp1 Are you saying that scaling with N and N - 1 are the only two possibilities? Meaning, is there any person or method that would perhaps want to scale with N - 2? We don't need to cover all possible bases, but a logical would indeed disable some flexibility if somebody wanted a different kind of variance.

If yes, then I think a logical argument works well. "Corrected" is confusing to me as a non-statistician. Any alternative names?

I'm erring slightly toward what @certik suggested.

certik · 2020-02-19T19:48:43Z

I didn't realize that NumPy was in the minority. Matlab's var also does N-1 by default (but it's configurable to use N as an option). As a physicist, using N seems more natural to me also, but given that NumPy is the only one that does this, and Matlab, Octave, R, Julia all use N-1, it seems we should do N-1 also for consistency.

jvdp1 · 2020-02-19T19:56:38Z

@jvdp1 Are you saying that scaling with N and N - 1 are the only two possibilities? Meaning, is there any person or method that would perhaps want to scale with N - 2? We don't need to cover all possible bases, but a logical would indeed disable some flexibility if somebody wanted a different kind of variance.

The variance of a random variable X is defined as the expected value of the squared deviation from the mean, i.e., the variance of X is = to the mean of the square of X minus the square of the mean of X. Hence, the scaling with N.
The scaling with N-1 is called the Bessel's correction (hence the term corrected).
I never read/learned about values other than N and N-1 for variance (it doesn't mean it does not exist, but I don't see a justifictation for them).

If yes, then I think a logical argument works well. "Corrected" is confusing to me as a non-statistician. Any alternative names?

I used corrected, similarly to Julia, but also because N-1 is called the Bessel 's correction.

I'm erring slightly toward what @certik suggested.

Numpy is the only one allowing values different than N or N-1. Among those I checked (Octave, Matlab, Julia, SAS), N-1 is the default, and N is optional.

I have a preference for N-1 as default, and the only other option N, but I would be fine with the Numpy approach.

certik · 2020-02-19T20:21:08Z

I am fine with either also, but leaning towards N-1 as well, for consistency with other packages.

leonfoks · 2020-02-19T22:26:00Z

The Bessel correction is necessary when the mean of the population is unknown and also needs to be estimated from the samples. Which just happens to be most practical cases.

Since we are estimating the sample mean inside the function var, the denominator N-1 is the most appropriate for the variance.

milancurcic · 2020-02-19T23:15:55Z

There is prevalent preference for N - 1 as the default. I agree with it.

As for the API, @jvdp1 suggested optional logical argument which I think is a nice interface. The only downside to this I can think of is that it requires an extra if-branch relative to a ddof integer argument like numpy does, assuming no input validation (I don't think we need any).

For this reason I'm slightly in favor an integer argument -- less code and potentially less overhead.

jvdp1 · 2020-02-20T07:53:51Z

The only downside to this I can think of is that it requires an extra if-branch relative to a ddof integer argument like numpy does, assuming no input validation (I don't think we need any).

I don't agree with this statement. If we want a robust function, we need some checks, even with an integer argument (Otherwise, the denominator could become <0, that would lead to negative results for the variance). Numpy var checks the validity of the integer argument (e.g., you can't give an integer argument > N).

milancurcic · 2020-02-20T13:43:27Z

I think you disagree only with the second statement, that we don't need any input validation. You do need an extra if-branch if you pass a logical.

That is fine. If ensuring that the denominator must be 0 or 1 is important, then I'm in favor of a logical argument approach.

certik · 2020-02-21T18:30:38Z

@jvdp1 are you ok with a logical argument also?

If so, then we are all in agreement. We already agree to make N-1 the default.

jvdp1 · 2020-02-21T22:21:12Z

@jvdp1 are you ok with a logical argument also?

If so, then we are all in agreement. We already agree to make N-1 the default.

I am ok with the logical argument.

For the API?

Syntax

result = var(array [, corrected [, mask]])
result = var(array, dim [, corrected[, mask]])

Arguments

corrected (optional): Shall be a logical scalar. If corrected is .true. (default value), the sum is scaled with n-1. If corrected is .false., then the sum is scaled with n.

I proposed corrected because n-1 is called Bessel 's correction.
Other names than corrected?

milancurcic · 2020-02-21T22:56:33Z

Excellent. I think this API is the way to go. After your and Leon's explanations, corrected is clear and intuitive. Of course, we will explain its meaning in the spec.

jvdp1 · 2020-02-22T08:34:42Z

I thought a bit about the implementation. What would be the best (efficiency/clarity):

...
    logical, intent(in), optional :: corrected
    real(${k1}$) :: correction
....
    if ( optval( corrected, .true.)) then
        correction = 1._${k1}$
    else
        correction = 0._${k1}$
    endif
...

...
    logical, intent(in), optional :: corrected
    real(${k1}$) :: correction
....
    correction = 1._${k1}$
    if ( .not. optval( corrected, .true.)) correction = 0._${k1}$
...

...
    logical, intent(in), optional :: corrected
    real(${k1}$) :: correction
....
    correction = 1._${k1}$
    if ( present(corrected)) then
       if (.not. corrected) correction = 0._${k1}$
     end if
...

Other approaches?

@milancurcic I can have a look at the implementation and submit a draft PR, if you want.

milancurcic · 2020-02-22T16:57:24Z

Both 1. and 2. look good to me, with a slight preference for 2 because fewer lines of code.

I think 3. is unnecessary now that we have optval to do exactly that.

We could also do option 4:

correction = merge(1, 0, optval(corrected, .true.))

jvdp1 · 2020-02-22T18:35:07Z

We could also do option 4:
correction = merge(1, 0, optval(corrected, .true.))

That is even better, because correction is then not needed :)

Addition of corrected in API of var (following #149)

milancurcic · 2020-02-24T02:49:57Z

Resolved by #151.

jvdp1 mentioned this issue Feb 22, 2020

Addition of corrected in API of var (following #149) #151

Merged

milancurcic closed this as completed Feb 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What kind of variance should `var()` return? #149

What kind of variance should `var()` return? #149

milancurcic commented Feb 19, 2020

certik commented Feb 19, 2020

jvdp1 commented Feb 19, 2020

milancurcic commented Feb 19, 2020

certik commented Feb 19, 2020

jvdp1 commented Feb 19, 2020

certik commented Feb 19, 2020

leonfoks commented Feb 19, 2020

milancurcic commented Feb 19, 2020

jvdp1 commented Feb 20, 2020

milancurcic commented Feb 20, 2020

certik commented Feb 21, 2020

jvdp1 commented Feb 21, 2020

milancurcic commented Feb 21, 2020

jvdp1 commented Feb 22, 2020

milancurcic commented Feb 22, 2020

jvdp1 commented Feb 22, 2020

milancurcic commented Feb 24, 2020

What kind of variance should var() return? #149

What kind of variance should var() return? #149

Comments

milancurcic commented Feb 19, 2020

certik commented Feb 19, 2020

jvdp1 commented Feb 19, 2020

Syntax

Arguments

milancurcic commented Feb 19, 2020

certik commented Feb 19, 2020

jvdp1 commented Feb 19, 2020

certik commented Feb 19, 2020

leonfoks commented Feb 19, 2020

milancurcic commented Feb 19, 2020

jvdp1 commented Feb 20, 2020

milancurcic commented Feb 20, 2020

certik commented Feb 21, 2020

jvdp1 commented Feb 21, 2020

Syntax

Arguments

milancurcic commented Feb 21, 2020

jvdp1 commented Feb 22, 2020

milancurcic commented Feb 22, 2020

jvdp1 commented Feb 22, 2020

milancurcic commented Feb 24, 2020

What kind of variance should `var()` return? #149

What kind of variance should `var()` return? #149