Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What kind of variance should var() return? #149

Closed
milancurcic opened this issue Feb 19, 2020 · 17 comments
Closed

What kind of variance should var() return? #149

milancurcic opened this issue Feb 19, 2020 · 17 comments

Comments

@milancurcic
Copy link
Member

Playing with a toy example today, I was surprised to see the result (I guess I should have read the specification first, LOL).

Program:

use stdlib_experimental_stats, only: mean, var
real :: a(5) = [1, 2, 3, 4 ,5]
print *, var(a), mean((a - mean(a))**2)
end

I expected two identical numbers. Result:

   2.50000000       2.00000000    

Then I looked at the code for var() and saw that we're making an average by dividing the sum by (N - 1), rather than N.

Then I looked at this issue and the spec, and it's all good: It says that the variance is defined in such way so that we're dividing by (N - 1). The code works as advertised.

But then I wondered why N - 1 and not N, and did some Google searching and found that there are all kinds of variances out there and that this particular flavor is called "population variance", or as described here the best unbiased estimator. Dividing by just N seems to be called just "variance".

How we define this not only affects the numerical result, but also in some cases the behavior of the program: variance of a scalar is 0, but population variance of a scalar is a NaN. Some discussion here.

NumPy's np.var() for example defines it as a simple variance (divide by N), but you can optionally set a "delta degrees of freedom", so that np.var(x, ddof=1) corresponds to the population variance.

I'm not a statistician but a wave physicist. I expect my variance to divide by N, and NumPy as served me well so far.

Question: Should we consider adding optional "delta degrees of freedom" to make both statisticians and physicists happy?

@certik
Copy link
Member

certik commented Feb 19, 2020

I would do what NumPy does, and divide by N. Then provide an option, just like NumPy, to allow the N-1 convention.

@jvdp1
Copy link
Member

jvdp1 commented Feb 19, 2020

This Wikipedia page gives a nice explanation about the differences sample variance vs population variance.

In Octave, var uses N-1 as default (opt = 0), and N as an option (opt = 1)., So it is divided by N - 1 + opt.

In NumPy var, it is divided by N - opt, with opt = 0 as default.

Both R and Julia use N-1 as default denominator.

Both approaches are fine for me, if it is clearly mentioned in the spec. N-1 seems to be more often use as default.

Question: Should we consider adding optional "delta degrees of freedom" to make both statisticians and physicists happy?

I think so too.

Regarding the API, I would propose

Syntax

result = var(array [, corrected [, mask]])
result = var(array, dim [, corrected[, mask]])

Arguments

corrected (optional): Shall be a logical scalar. If corrected is .true. (default value), the sum is scaled with n-1. If corrected is .false., then the sum is scaled with n.

I prefer a logical in this case since only two values can be passed to var (0 and 1). So, allowing a number for this option would imply checking the passed value inside var (and possibly a return for erroneous values) to be sure it is 0 or 1.

@milancurcic
Copy link
Member Author

@jvdp1 Are you saying that scaling with N and N - 1 are the only two possibilities? Meaning, is there any person or method that would perhaps want to scale with N - 2? We don't need to cover all possible bases, but a logical would indeed disable some flexibility if somebody wanted a different kind of variance.

If yes, then I think a logical argument works well. "Corrected" is confusing to me as a non-statistician. Any alternative names?

I'm erring slightly toward what @certik suggested.

@certik
Copy link
Member

certik commented Feb 19, 2020

I didn't realize that NumPy was in the minority. Matlab's var also does N-1 by default (but it's configurable to use N as an option). As a physicist, using N seems more natural to me also, but given that NumPy is the only one that does this, and Matlab, Octave, R, Julia all use N-1, it seems we should do N-1 also for consistency.

@jvdp1
Copy link
Member

jvdp1 commented Feb 19, 2020

@jvdp1 Are you saying that scaling with N and N - 1 are the only two possibilities? Meaning, is there any person or method that would perhaps want to scale with N - 2? We don't need to cover all possible bases, but a logical would indeed disable some flexibility if somebody wanted a different kind of variance.

The variance of a random variable X is defined as the expected value of the squared deviation from the mean, i.e., the variance of X is = to the mean of the square of X minus the square of the mean of X. Hence, the scaling with N.
The scaling with N-1 is called the Bessel's correction (hence the term corrected).
I never read/learned about values other than N and N-1 for variance (it doesn't mean it does not exist, but I don't see a justifictation for them).

If yes, then I think a logical argument works well. "Corrected" is confusing to me as a non-statistician. Any alternative names?

I used corrected, similarly to Julia, but also because N-1 is called the Bessel 's correction.

I'm erring slightly toward what @certik suggested.

Numpy is the only one allowing values different than N or N-1. Among those I checked (Octave, Matlab, Julia, SAS), N-1 is the default, and N is optional.

I have a preference for N-1 as default, and the only other option N, but I would be fine with the Numpy approach.

@certik
Copy link
Member

certik commented Feb 19, 2020

I am fine with either also, but leaning towards N-1 as well, for consistency with other packages.

@leonfoks
Copy link

The Bessel correction is necessary when the mean of the population is unknown and also needs to be estimated from the samples. Which just happens to be most practical cases.

Since we are estimating the sample mean inside the function var, the denominator N-1 is the most appropriate for the variance.

@milancurcic
Copy link
Member Author

There is prevalent preference for N - 1 as the default. I agree with it.

As for the API, @jvdp1 suggested optional logical argument which I think is a nice interface. The only downside to this I can think of is that it requires an extra if-branch relative to a ddof integer argument like numpy does, assuming no input validation (I don't think we need any).

For this reason I'm slightly in favor an integer argument -- less code and potentially less overhead.

@jvdp1
Copy link
Member

jvdp1 commented Feb 20, 2020

The only downside to this I can think of is that it requires an extra if-branch relative to a ddof integer argument like numpy does, assuming no input validation (I don't think we need any).

I don't agree with this statement. If we want a robust function, we need some checks, even with an integer argument (Otherwise, the denominator could become <0, that would lead to negative results for the variance). Numpy var checks the validity of the integer argument (e.g., you can't give an integer argument > N).

@milancurcic
Copy link
Member Author

I think you disagree only with the second statement, that we don't need any input validation. You do need an extra if-branch if you pass a logical.

That is fine. If ensuring that the denominator must be 0 or 1 is important, then I'm in favor of a logical argument approach.

@certik
Copy link
Member

certik commented Feb 21, 2020

@jvdp1 are you ok with a logical argument also?

If so, then we are all in agreement. We already agree to make N-1 the default.

@jvdp1
Copy link
Member

jvdp1 commented Feb 21, 2020

@jvdp1 are you ok with a logical argument also?

If so, then we are all in agreement. We already agree to make N-1 the default.

I am ok with the logical argument.

For the API?

Syntax

result = var(array [, corrected [, mask]])
result = var(array, dim [, corrected[, mask]])

Arguments

corrected (optional): Shall be a logical scalar. If corrected is .true. (default value), the sum is scaled with n-1. If corrected is .false., then the sum is scaled with n.

I proposed corrected because n-1 is called Bessel 's correction.
Other names than corrected?

@milancurcic
Copy link
Member Author

Excellent. I think this API is the way to go. After your and Leon's explanations, corrected is clear and intuitive. Of course, we will explain its meaning in the spec.

@jvdp1
Copy link
Member

jvdp1 commented Feb 22, 2020

I thought a bit about the implementation. What would be the best (efficiency/clarity):

...
    logical, intent(in), optional :: corrected
    real(${k1}$) :: correction
....
    if ( optval( corrected, .true.)) then
        correction = 1._${k1}$
    else
        correction = 0._${k1}$
    endif
...
...
    logical, intent(in), optional :: corrected
    real(${k1}$) :: correction
....
    correction = 1._${k1}$
    if ( .not. optval( corrected, .true.)) correction = 0._${k1}$
...
...
    logical, intent(in), optional :: corrected
    real(${k1}$) :: correction
....
    correction = 1._${k1}$
    if ( present(corrected)) then
       if (.not. corrected) correction = 0._${k1}$
     end if
...

Other approaches?

@milancurcic I can have a look at the implementation and submit a draft PR, if you want.

@milancurcic
Copy link
Member Author

Both 1. and 2. look good to me, with a slight preference for 2 because fewer lines of code.

I think 3. is unnecessary now that we have optval to do exactly that.

We could also do option 4:

correction = merge(1, 0, optval(corrected, .true.))

@jvdp1
Copy link
Member

jvdp1 commented Feb 22, 2020

We could also do option 4:

correction = merge(1, 0, optval(corrected, .true.))

That is even better, because correction is then not needed :)

milancurcic added a commit that referenced this issue Feb 24, 2020

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
Addition of corrected in API of var (following #149)
@milancurcic
Copy link
Member Author

Resolved by #151.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants