-
Notifications
You must be signed in to change notification settings - Fork 184
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
What kind of variance should var()
return?
#149
Comments
I would do what NumPy does, and divide by |
This Wikipedia page gives a nice explanation about the differences sample variance vs population variance. In Octave, var uses In NumPy var, it is divided by Both R and Julia use Both approaches are fine for me, if it is clearly mentioned in the spec.
I think so too. Regarding the API, I would propose Syntax
Arguments
I prefer a logical in this case since only two values can be passed to |
@jvdp1 Are you saying that scaling with N and N - 1 are the only two possibilities? Meaning, is there any person or method that would perhaps want to scale with N - 2? We don't need to cover all possible bases, but a logical would indeed disable some flexibility if somebody wanted a different kind of variance. If yes, then I think a logical argument works well. "Corrected" is confusing to me as a non-statistician. Any alternative names? I'm erring slightly toward what @certik suggested. |
I didn't realize that NumPy was in the minority. Matlab's var also does N-1 by default (but it's configurable to use N as an option). As a physicist, using N seems more natural to me also, but given that NumPy is the only one that does this, and Matlab, Octave, R, Julia all use N-1, it seems we should do N-1 also for consistency. |
The variance of a random variable X is defined as the expected value of the squared deviation from the mean, i.e., the variance of X is = to the mean of the square of X minus the square of the mean of X. Hence, the scaling with
I used
Numpy is the only one allowing values different than I have a preference for |
I am fine with either also, but leaning towards N-1 as well, for consistency with other packages. |
The Bessel correction is necessary when the mean of the population is unknown and also needs to be estimated from the samples. Which just happens to be most practical cases. Since we are estimating the sample mean inside the function var, the denominator N-1 is the most appropriate for the variance. |
There is prevalent preference for N - 1 as the default. I agree with it. As for the API, @jvdp1 suggested optional logical argument which I think is a nice interface. The only downside to this I can think of is that it requires an extra if-branch relative to a For this reason I'm slightly in favor an integer argument -- less code and potentially less overhead. |
I don't agree with this statement. If we want a robust function, we need some checks, even with an integer argument (Otherwise, the denominator could become <0, that would lead to negative results for the variance). Numpy |
I think you disagree only with the second statement, that we don't need any input validation. You do need an extra if-branch if you pass a logical. That is fine. If ensuring that the denominator must be 0 or 1 is important, then I'm in favor of a logical argument approach. |
@jvdp1 are you ok with a logical argument also? If so, then we are all in agreement. We already agree to make N-1 the default. |
I am ok with the logical argument. For the API? Syntax
Arguments
I proposed |
Excellent. I think this API is the way to go. After your and Leon's explanations, |
I thought a bit about the implementation. What would be the best (efficiency/clarity): ...
logical, intent(in), optional :: corrected
real(${k1}$) :: correction
....
if ( optval( corrected, .true.)) then
correction = 1._${k1}$
else
correction = 0._${k1}$
endif
... ...
logical, intent(in), optional :: corrected
real(${k1}$) :: correction
....
correction = 1._${k1}$
if ( .not. optval( corrected, .true.)) correction = 0._${k1}$
... ...
logical, intent(in), optional :: corrected
real(${k1}$) :: correction
....
correction = 1._${k1}$
if ( present(corrected)) then
if (.not. corrected) correction = 0._${k1}$
end if
... Other approaches? @milancurcic I can have a look at the implementation and submit a draft PR, if you want. |
Both 1. and 2. look good to me, with a slight preference for 2 because fewer lines of code. I think 3. is unnecessary now that we have We could also do option 4: correction = merge(1, 0, optval(corrected, .true.)) |
That is even better, because |
Addition of corrected in API of var (following #149)
Resolved by #151. |
Playing with a toy example today, I was surprised to see the result (I guess I should have read the specification first, LOL).
Program:
I expected two identical numbers. Result:
Then I looked at the code for var() and saw that we're making an average by dividing the sum by (N - 1), rather than N.
Then I looked at this issue and the spec, and it's all good: It says that the variance is defined in such way so that we're dividing by (N - 1). The code works as advertised.
But then I wondered why N - 1 and not N, and did some Google searching and found that there are all kinds of variances out there and that this particular flavor is called "population variance", or as described here the best unbiased estimator. Dividing by just N seems to be called just "variance".
How we define this not only affects the numerical result, but also in some cases the behavior of the program: variance of a scalar is 0, but population variance of a scalar is a NaN. Some discussion here.
NumPy's
np.var()
for example defines it as a simple variance (divide by N), but you can optionally set a "delta degrees of freedom", so thatnp.var(x, ddof=1)
corresponds to the population variance.I'm not a statistician but a wave physicist. I expect my variance to divide by N, and NumPy as served me well so far.
Question: Should we consider adding optional "delta degrees of freedom" to make both statisticians and physicists happy?
The text was updated successfully, but these errors were encountered: