-
Notifications
You must be signed in to change notification settings - Fork 184
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal for descriptive statistics #113
Comments
Let's talk about the API of function mean_sp_sp(vector) result(res)
real(sp), intent(in) :: vector(:)
real(sp) :: res
end function I assume the user facing function will be just Regarding the 2D version: function mean_sp_sp(mat, dim) result(res)
real(sp), intent(in) :: mat(:,:)
integer, intent(in), optional :: dim !dim = 1 or dim = 2
real(sp), allocatable :: res(:)
end function I would not use Regarding the Let's first talk about the intrinsic
So NumPy's Your |
For me, Matlab and Fortran
I agree for speed, while I think it would be very inconvenient. Should the API be something like that: function mean_sp_sp(mat, n, dim) result(res)
real(sp), intent(in) :: mat(:,:)
integer, intent(in) :: n
integer, intent(in), optional :: dim !dim = 1 or dim = 2
real(sp), allocatable :: res(n)
end function |
Regarding using
I'm not sure what you mean by "includes" or "excludes", but in case of Fortran, the sum is performed along real :: a(2,3,4) = 1
print *, shape(sum(a, 2))
end outputs:
This behavior is both Fortrannic and intuitive. It's consistent with numpy, I don't know about Matlab. It seems to me as the only reasonable behavior. A notable difference between Fortran's stdlib's generic
A minor nit-pick about the API, I suggest that we don't insinuate a vector/matrix as input, as they're linear algebra-specific and imply rank in the name. Instead, I suggest use a more general name considering we should support any number of dimensions. For arrays I just like plain x: function mean_sp_sp(x) result(res)
real(sp), intent(in) :: x(:)
real(sp) :: res
end function For a 2-d array, passing the result size as an input is an unacceptable API in my opinion. Is there any other way we could do assumed-size result for reducing a 2-d (or multi-d) array? Another important point of discussion: Should a mean of an integer array be an integer or a real? |
I believe the mean of an integer array should be dp. The result can be converted if needed. Numpy returns a float from mean(int array) |
I would suggest to implement first the same behaviour as Fortran
Good to know.
I agree. Below is a possible workaround for 2D arrays (I just tried it; this implementation gives the same behaviour as Fortran interface mean
module function mean_1_sp_sp(x) result(res)
real(sp), intent(in) :: x(:)
real(sp) ::res
end function
...
module function mean_2_all_sp_sp(x) result(res)
real(sp), intent(in) :: x(:,:)
real(sp) ::res
end function mean_2_all_sp_sp
...
module function mean_2_sp_sp(x, dim) result(res)
real(sp), intent(in) :: x(:,:)
integer, intent(in) :: dim
real(sp) :: res(size(x)/size(x, dim))
end function mean_2_sp_sp
...
end interface
IMHO, it should be always a real (for integer and real arrays). |
Minor comment, I think this last iteration is better than using the optional argument dim. Let the interface handle the look up on function names. I believe optional arguments hit runtime? Thus leading to slow down? |
I don't think we can use |
@milancurcic this interface don't use interface mean
module function mean_1_sp_sp(x) result(res)
real(sp), intent(in) :: x(:)
real(sp) ::res
end function
...
module function mean_2_all_sp_sp(x) result(res)
real(sp), intent(in) :: x(:,:)
real(sp) ::res
end function mean_2_all_sp_sp
...
module function mean_2_sp_sp(x, dim) result(res)
real(sp), intent(in) :: x(:,:)
integer, intent(in) :: dim
real(sp) :: res(size(x)/size(x, dim))
end function mean_2_sp_sp
...
end interface This has the same behaviour as Fortran |
Sorry @milancurcic for the misunderstanding
Me neither. Any idea how it is implemented for Fortran If the community doesn't find a solution (I don't believe that :) ), should we first implement something for 1D and 2D arrays, and see later how to do it for >2D arrays (as for |
For integer arrays, the API could look like: module function mean_1_int8_dp(x) result(res)
integer(int8), intent(in) :: x(:)
real(dp) :: res
end function mean_1_int8_dp If the array is integer, the result will be always |
@milancurcic @certik @leonfoks I found a workaround (using the Fortran Function: Issue: I tried to use pure functions. While it compiled well with manual Makefile, CMake 3.16.1 does not like submodules + pure functions. Am I alone with this issue? API for interface mean
module function mean_1_dp_dp(x) result(res)
real(dp), intent(in) :: x(:)
real(dp) :: res
end function mean_1_dp_dp
module function mean_1_int8_dp(x) result(res)
integer(int8), intent(in) :: x(:)
real(dp) :: res
end function mean_1_int8_dp
module function mean_2_all_dp_dp(x) result(res)
real(dp), intent(in) :: x(:,:)
real(dp) :: res
end function mean_2_all_dp_dp
module function mean_2_all_int8_dp(x) result(res)
integer(int8), intent(in) :: x(:,:)
real(dp) :: res
end function mean_2_all_int8_dp
module function mean_2_dp_dp(x, dim) result(res)
real(dp), intent(in) :: x(:,:)
integer, intent(in) :: dim
real(dp) :: res(size(x)/size(x, dim))
end function mean_2_dp_dp
module function mean_2_int8_dp(x, dim) result(res)
integer(int8), intent(in) :: x(:,:)
integer, intent(in) :: dim
real(dp) :: res(size(x)/size(x, dim))
end function mean_2_int8_dp
module function mean_3_all_dp_dp(x) result(res)
real(dp), intent(in) :: x(:,:,:)
real(dp) :: res
end function mean_3_all_dp_dp
module function mean_3_all_int8_dp(x) result(res)
integer(int8), intent(in) :: x(:,:,:)
real(dp) :: res
end function mean_3_all_int8_dp
module function mean_3_dp_dp(x, dim) result(res)
real(dp), intent(in) :: x(:,:,:)
integer, intent(in) :: dim
real(dp) :: res( &
merge(size(x,1),size(x,2),mask = 1 < dim, &
merge(size(x,2),size(x,3),mask = 2 < dim )
end function mean_3_dp_dp
module function mean_3_int8_dp(x, dim) result(res)
integer(int8), intent(in) :: x(:,:,:)
integer, intent(in) :: dim
real(dp) :: res( &
merge(size(x,1),size(x,2),mask = 1 < dim, &
merge(size(x,2),size(x,3),mask = 2 < dim )
end function mean_3_int8_dp
end interface |
@jvdp1 Great! This is the exactly the solution I was trying to find yesterday but couldn't figure it out. Yes, it looks like it will expand nicely to as many dims as we need. We can do as many as 15, though I never worked with more than 5-d arrays. There's a typo in your interface (missing parentheses): module function mean_3_dp_dp(x, dim) result(res)
real(dp), intent(in) :: x(:,:,:)
integer, intent(in) :: dim
real(dp) :: res(merge(size(x, 1), size(x, 2), mask = 1 < dim), &
merge(size(x, 2), size(x, 3), mask = 2 < dim))
end function mean_3_dp_dp |
I am not too concerned about performance here. Arguments to Also, for stdlib I'd like to stress and re-iterate, easy of use and nice API should take priority over performance. Let's worry about making a great API first, then if needed work on performance within constraints of a great API design. |
I use
Sorry. Too fast copy-paste... |
Me neither. For such functions, I prefer the functionality than efficiency. If I need efficiency, I would probably implement it myself anyway.
Currently the proposed API for |
I agree to try to stick to Fortran conventions. Regarding performance, I would say great API and great performance are equal --- we can sacrifice a little bit of one to get a lot of the other, on a case by case basis. We should try not to sacrifice a lot of either. |
I think everybody agrees on the following API of result = mean(x)
result = mean(x, dim) with The same Fortran conventions should be used for other similar functions (median, variance, standard deviation, geometric mean,...). For performance, the current implementation might be good with @certik @milancurcic @leonfoks @ivan-pi @scivision |
Great work on this interface! My only concern is what will be the default behavior if the user passed
Could |
I just added a call error_stop("ERROR (mean): wrong dimension") The issue is that the functions cannot be
I think so (at least if auto-parallelisation or OpenMP is used). |
MEAN - mean of array elementsDescriptionReturns the mean of all the elements of ARRAY, or of the elements of ARRAY along dimension DIM. SyntaxRESULT = mean(ARRAY) RESULT = mean(ARRAY, DIM) ArgumentsARRAY: Must be an array of type INTEGER, or REAL. DIM (optional): Must be a scalar of type INTEGER with a value in the range from 1 to n, where n equals the rank of ARRAY. Return valueIf ARRAY is of type REAL, the result is of the same type as ARRAY. If DIM is absent, a scalar with the mean of all elements in ARRAY is returned. Otherwise, an array of rank n-1, where n equals the rank of ARRAY, and a shape similar to that of ARRAY with dimension DIM dropped is returned. Exampleprogram test
use stdlib_experimental_stat, only: mean
implicit none
real :: x(1:6) = (/ 1., 2., 3., 4., 5., 6. /)
print *, mean(x) !returns 21.
print *, mean( reshape(x, (/ 2, 3 /) )) !returns 21.
print *, mean( reshape(x, (/ 2, 3 /) ), 1) !returns (/ 3., 7., 11. /)
end program @certik @milancurcic @nncarlson Is such a specification document (in Markdown) desired alongside the module? If this API for |
Thanks @jvdp1!
I'd say yes. Editorial nit-picks:
|
@jvdp1 thank you for starting this! I suggest we put it along side the module for now. Later on, I would like to have some automatic mechanism to parse these semantically (i.e., the tool would understand the sections as well as perhaps even the Fortran code) and produce nice online and pdf documentations. |
These are functions we use often (or even daily) in my field (i.e. quantitative genetics), and therefore we re-implement these functions quite often (or we swicht to Octave/R/Julia to compute means, variances, regression coefficients, R2,...). |
Thanks @sblionel for your answer. I do not see a need to have these available as intrinsic procedures, but I do believe having them in a library such as this one can ease the experience for (new) Fortran users. An off-topic question: how does membership in national committees work? Browsing through the documents on the WG5 website, I had the feeling the national committee used to play an important role in driving Fortran development. In Alan Miller's Fortran Software there are many statistical functions, indicating that Fortran was used in this field. I also have a copy of the book "Programming for the social sciences: Algorithms and Fortran77 Coding" (from 1986), which discusses simple statistical functions. The book "Introduction to Computational Economics Using Fortran" also rolls its own versions of these functions. Today, I am sure the majority of programmers prefer interpreted languages (Python, Julia, Matlab/Octave, R) for such work, or even spreadsheet programs (Excel, GraphPad, Origin, SPSS, etc.). |
Re interpreted languages: I guess that is true for interactive use where
you want to explore the data, but if you run into large amounts of
information (say remote sensing images), a compiled language comes in quite
handy. I think a comprehensive set of statistical functions would be most
welcome. I have scanned Alan Millers' website for ideas myself :). And a
lot of his software is rather more advanced than a mere mean value or other
descriptive statistics.
Another great source of algorithms is the work of Michel Olagnon. A bit of
googling gave me this URL: http://www.fortran-2000.com/rank/ (and similar
ones). Michel used to be an active contributor to comp.lang.fortran.
Regards,
Arjen
Op wo 29 jul. 2020 om 11:56 schreef Ivan <[email protected]>:
… Thanks @sblionel <https://github.com/sblionel> for your answer. I do not
see a need to have these available as intrinsic procedures, but I do
believe having them in a library such as this one can ease the experience
for users of Fortran. An off-topic question: how does membership in
national committees work? Browsing through the documents on the WG5
website, I had the feeling the national committee used to play an important
role in driving Fortran development.
In Alan Miller's Fortran Software <https://jblevins.org/mirror/amiller/>
there are many statistical functions, indicating that Fortran was used in
this field. I also have a copy of the book "Programming for the social
sciences: Algorithms and Fortran77 Coding" (from 1986), which discusses
simple statistical functions. The book "Introduction to Computational
Economics Using Fortran" <https://www.ce-fortran.com/> also rolls its own
versions of these functions.
Today, I am sure the majority of programmers prefer interpreted languages
(Python, Julia, Matlab/Octave, R) for such work, or even spreadsheet
programs (Excel, GraphPad, Origin, SPSS, etc.).
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#113 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAN6YR5HQ5T7HD63B27YU6DR57W3LANCNFSM4KIYGOQQ>
.
|
Briefly, each National Body has its own rules for membership. Typically, you must live in that country or be employed by a company with offices in that country. Each country has its own rules for fees and intellectual property. For Fortran specifically, WG5 (ISO/IEC) delegates the development of the standard to the US NB (PL22.3 aka J3). WG5 sets the feature list and votes on drafts. Practically speaking, there are several non-J3 members who regularly participate in the development work. The only NBs that actively participate in WG5 are Canada, Germany, Japan, UK and US. |
Re interpreted languages: with both stdlib and LFortran fully developed in a few years, I expect the experience with Fortran can be very similar as with Julia or Python in terms of interactive usage. |
In testing my builds, I am having troubles compiling the statistical modules using Makefile.manual. On my machine the Makefile.manual is invoking gfortran, I suspect version 10.2. stdlib_stats_moment and stdlib_stats_var are taking forever to compile. I am not having this slowdown using Cmake, which I believe also invokes gfortran. Using Makefile.manual gfortran is also issuing a number of warnings that I suspect indicate problems for large arrays. Examples of the warnings are stdlib_stats_moment.f90:26261:12:
Warning: Possible change of value in conversion from INTEGER(8) to REAL(4) at (1) [-Wconversion] and stdlib_stats_moment.f90:1312:12:
Warning: Possible change of value in conversion from INTEGER(8) to REAL(4) at (1) [-Wconversion] I don't think there is any advantage to invoking count and size with kind=int64 if the results are assigned to a variable of kind int32. |
FWIW the command line for the Makefile.manual invocation of gfortran is gfortran -Wall -Wextra -Wimplicit-interface -fPIC -g -fcheck=all -c stdlib_stats_moment.f90 |
That looks to me like |
You’re right I misread it.
… On Sep 27, 2020, at 3:20 PM, Brad Richardson ***@***.***> wrote:
That looks to me like n is declared as a real, which is likely an error (although I haven't looked at the code).
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub <#113 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/APTQDOUIQNP7KPO26MEF46LSH6UBLANCNFSM4KIYGOQQ>.
|
There are hundreds of functions to compile in both submodules. Limilting the RANK to 4 might reduce the compilation time. |
That makes sense. At some point, as low hanging fruit for somebody, I'd recommend putting in the explicit conversion just to declutter the warning messages at least. |
At the time of implementation, I recommended to let compiler use mixed-mode arithmetic and not do explicit conversion. I won't dig out the specific comment and thread, but I vaguely remember that this wasn't a universally preferred opinion and we just went with it. I personally don't appreciate some of gfortran's overly paranoid warnings about correct use of the language. In this specific example, it's okay--it could be useful to a user to know that there is implicit conversion happening. A more grave example is when gfortran warns you about trying to allocate a string on assignment (triggers But it's not only about the compiler. The code may be easier to understand with an explicit |
I personally (I think) prefer explicit conversions between arithmetic, it helps to prevent unintended loss of accuracy. |
@certik There's no loss of accuracy. Compiler will correctly promote the type as per language rules. This is purely about explicit and verbose vs. implicit and concise.
|
@milancurcic I don't think there is any warning in the example you posted: integer :: a = 3
real :: b = 1.234, c
c = a * b
print *, c
end Such usage is fine and indeed there is no loss of accuracy. However in this example: integer :: a = 3, c
real :: b = 1.234
c = a * b
print *, c
end You get a warning and I personally like this warning, because you lose accuracy, and it might not have been intended by me:
|
Okay, yours is a better example. But explicit cast still doesn't help you:
How would you in this example use explicit cast to get rid of the warning? |
Sorry, I don't think this is a good example either. The relevant example is one upthread:
where you can put that inside of a My point is that adding a |
The example I posted gets fixed by explicit cast to an int (saying as a user to the compiler that I am explicitly trimming the real to an integer): integer :: a = 3, c
real :: b = 1.234
c = int(a * b)
print *, c
end In the example you posted:
The issue is that integer(4) :: a = 3
real(4) :: b = 1.234
b = a
print *, b
end but this produces a warning (and there is a possible loss of accuracy if the integer was large enough): integer(8) :: a = 3
real(4) :: b = 1.234
b = a
print *, b
end As a user, I personally like that, as it almost always (for me) means I made a mistake and didn't realize there is a loss of accuracy in the conversion. If I want it, I can always type it explicitly, then it is clear to both the (human) reader as well as the compiler what is intended. |
Good point, I didn't consider all the possibilities and especially the one of integer being too large. Jeremie explained here why |
It looks like the warning worked as expected: you didn't realize that there is a possible loss of accuracy. :) And the fix is to put an explicit cast to Update: however the compiler warning should have been worded differently: it should say that there is a possible loss of accuracy because the |
Thank you for the explanations. I will have a look and open a PR with explicit casts for these several warnings. |
Now, to be completely precise, you lose accuracy (in principle) any time you assign integer to real even of the same size. Here is an example: integer(4) :: a = 1234567890
real(4) :: b = 1.234
b = a
print *, a
print *, b
end Which does not warn, but prints:
So the last digit is now |
Well, this is only a fix for clarity of the code, right? If we really wanted to fix the possible loss of precision, shouldn't we use a
|
The loss of precision would appear at another stage, because the |
Yes. That we have thought about the issue and we "know what we are doing". That it is not an oversight. |
The conversion to real32 has a precision of 2**-24, and so has a round off error of about 2**-25. It is rare to have a precision this high for statistical measurements, for a standard deviation of 0.1% it would require about 2**30 measurements, i.e., (1/(2**-25/2**-10)**2, but I suppose for some of the fundamental constants it would be important.
|
Indeed. Such operations are mentioned by gfortran with the flag |
Overview
It would be nice to have a module in
stdlib
that provides functions for computing means,variances, medians, ... of vectors, and of rows (columns) of 2D-arrays (at least).E.g.,
The same could be implemented for variance, median, ... So the API of all functions would be (almost) the same.
API
Let 's discuss the API of only
mean
for a vector first, and then for an array.For a vector:
For a 2D array:
If
dim = 1
, it returns the mean of each row (sores(1:size(mat,1))
).If
dim = 2
, it returns the mean of each column (sores(1:size(mat,2))
).Here (generated manually with
fypp
) is an example formean
instdlib
.The same API could be used for variance, median, cumulative sum, geometric mean, ...
Should we support arrays of rank > 2? E.g., what would return
mean(mat(:,:,:,:), dim =3)
?Should we use functions or subroutine (and overload
=
)?:The result of the procedure would be of the same kind as the input, and (implicit) conversion would be performed by the user. Functions could then be used.
Alternatively:
For real arrays, procedures would return a result of the same kind, or of a lower kind, of the argument (e.g., a mean of a
dp
array would return the result insp
ordp
). All computations inside the procedure would be performed in the same kind as the input array, and the result would be converted just before the function returns the result.For integer arrays, procedures would return a result of a real kind (e.g., a mean of a
int64
array would return the result insp
,dp
, orqp
). All computations inside the procedure would be performed in the same kind as the result.Implementation
Probably most of us have some implementations. @leonfoks has also an implementation for 1D array on Github.
I would think about a module called
stdlib_experimental_stat.f90
and multiple submodules (one per stat, e.g.,stdlib_experimental_stat_mean.f90
, that contains all functions related with that stat).The first PR would contain only one stat, e.g.
mean
to facilitate the discussion.Currently in
stdlib
mean (
mean
)variance (
var
)central moment (
moment
)Possible additional functions
standard deviation (
std
)median (
median
)mode (
mode
)Others
covariance (
cov
)correlation (
corr
)Other languages
Matlab
Numpy
Octave
R
The text was updated successfully, but these errors were encountered: