Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal for descriptive statistics #113

Open
jvdp1 opened this issue Jan 19, 2020 · 55 comments
Open

Proposal for descriptive statistics #113

jvdp1 opened this issue Jan 19, 2020 · 55 comments
Labels
implementation Implementation in experimental and submission of a PR specification Discussion and iteration over the API topic: mathematics linear algebra, sparse matrices, special functions, FFT, random numbers, statistics, ...

Comments

@jvdp1
Copy link
Member

jvdp1 commented Jan 19, 2020

Overview

It would be nice to have a module in stdlib that provides functions for computing means,variances, medians, ... of vectors, and of rows (columns) of 2D-arrays (at least).
E.g.,

real :: res
real, allocatable :: res1(:)
real :: vector(5), mat(6,5)
...
res = mean(vector)
res1 = mean(mat) !returns the mean of each row of mat
res1 = mean(mat, dim = 2) !returns the mean of each column of mat

The same could be implemented for variance, median, ... So the API of all functions would be (almost) the same.

API

Let 's discuss the API of only mean for a vector first, and then for an array.

For a vector:

function mean_sp_sp(vector) result(res)
    real(sp), intent(in) :: vector(:)
    real(sp) :: res
end function

For a 2D array:

function mean_sp_sp(mat, dim) result(res)
    real(sp), intent(in) :: mat(:,:)
    integer, intent(in), optional :: dim !dim = 1 or dim = 2 
    real(sp), allocatable :: res(:)
end function

If dim = 1, it returns the mean of each row (so res(1:size(mat,1))).
If dim = 2, it returns the mean of each column (so res(1:size(mat,2))).

Here (generated manually with fypp) is an example for mean in stdlib.

The same API could be used for variance, median, cumulative sum, geometric mean, ...

Should we support arrays of rank > 2? E.g., what would return mean(mat(:,:,:,:), dim =3)?

Should we use functions or subroutine (and overload =)?:

The result of the procedure would be of the same kind as the input, and (implicit) conversion would be performed by the user. Functions could then be used.

Alternatively:
For real arrays, procedures would return a result of the same kind, or of a lower kind, of the argument (e.g., a mean of a dp array would return the result in sp or dp). All computations inside the procedure would be performed in the same kind as the input array, and the result would be converted just before the function returns the result.
For integer arrays, procedures would return a result of a real kind (e.g., a mean of a int64 array would return the result in sp, dp, or qp). All computations inside the procedure would be performed in the same kind as the result.

Implementation

Probably most of us have some implementations. @leonfoks has also an implementation for 1D array on Github.
I would think about a module called stdlib_experimental_stat.f90 and multiple submodules (one per stat, e.g., stdlib_experimental_stat_mean.f90, that contains all functions related with that stat).
The first PR would contain only one stat, e.g. mean to facilitate the discussion.

Currently in stdlib

mean (mean)
variance (var)
central moment (moment)

Possible additional functions

standard deviation (std)
median (median)
mode (mode)

Others

covariance (cov)
correlation (corr)

Other languages

Matlab
Numpy
Octave
R

@jvdp1 jvdp1 added topic: mathematics linear algebra, sparse matrices, special functions, FFT, random numbers, statistics, ... idea Proposition of an idea and opening an issue to discuss it labels Jan 19, 2020
@certik
Copy link
Member

certik commented Jan 19, 2020

Let's talk about the API of mean, as you suggested. This looks good:

function mean_sp_sp(vector) result(res)
    real(sp), intent(in) :: vector(:)
    real(sp) :: res
end function

I assume the user facing function will be just mean.

Regarding the 2D version:

function mean_sp_sp(mat, dim) result(res)
    real(sp), intent(in) :: mat(:,:)
    integer, intent(in), optional :: dim !dim = 1 or dim = 2 
    real(sp), allocatable :: res(:)
end function

I would not use allocatable result. For speed reasons, the lowest level API should assume the user knows the length.

Regarding the dim parameter name, we have to be careful of its meaning.

Let's first talk about the intrinsic sum function that's already available and try to understand how that works:

  • Fortran sum: the dim is an integer that excludes that dimension from the sum
  • NumPy sum: the axis is either or an array of integers, in both cases it includes those dimensions in the sum
  • Matlab sum: the dim is an integer that includes the dimension in the sum.

So NumPy's axis as well as Matlab's dim seem to be equivalent. Fortran, on the other hand, chose the opposite convention of using the dim to exclude a dimension.

Your mean above uses a Matlab's convention for mean. NumPy seems to be using a different convention. Not sure which one to choose here.

@jvdp1
Copy link
Member Author

jvdp1 commented Jan 19, 2020

  • Fortran sum: the dim is an integer that excludes that dimension from the sum
  • NumPy sum: the axis is either or an array of integers, in both cases it includes those dimensions in the sum
  • Matlab sum: the dim is an integer that includes the dimension in the sum.

So NumPy's axis as well as Matlab's dim seem to be equivalent. Fortran, on the other hand, chose the opposite convention of using the dim to exclude a dimension.

Your mean above uses a Matlab's convention for mean. NumPy seems to be using a different convention. Not sure which one to choose here.

For me, Matlab and Fortran dim option for sum behave in the same way (e.g., sum(mat(1:4,1:2),dim=2) will return an array of dimention 4). However, the default behaviour is different between Fortran (returns the sum of the whole array) and Matlab (returns the sum for dim = 1). After checking, my proposition does not follow these rules of dim.
So I would propose to use the same behaviour as Fortran sum (with the same name dim).
It would be also in agreement with the function size and it behaviour with dim.

I would not use allocatable result. For speed reasons, the lowest level API should assume the user knows the length.

I agree for speed, while I think it would be very inconvenient. Should the API be something like that:

function mean_sp_sp(mat, n, dim) result(res)
    real(sp), intent(in) :: mat(:,:)
    integer, intent(in) :: n
    integer, intent(in), optional :: dim !dim = 1 or dim = 2 
    real(sp), allocatable :: res(n)
end function

@milancurcic
Copy link
Member

Regarding using allocatable as result, I think we can adopt a rule of thumb to not use allocatable unless necessary to make it work, but the priority should be ease of use. If assumed-size result requires more complex API, we should use an allocatable result.

Fortran sum: the dim is an integer that excludes that dimension from the sum

I'm not sure what you mean by "includes" or "excludes", but in case of Fortran, the sum is performed along dim:

real :: a(2,3,4) = 1
print *, shape(sum(a, 2))
end

outputs:

           2           4

This behavior is both Fortrannic and intuitive. It's consistent with numpy, I don't know about Matlab. It seems to me as the only reasonable behavior.

A notable difference between Fortran's sum(x, dim) and Python's np.sum(x, axis) (I don't know about Matlab) is that Fortran always reduces the rank by 1 (dim must be a scalar). With numpy, axis can be a list or a tuple to perform reduction along multiple axes.

stdlib's generic mean could accept both a scalar or a rank-1 array for dim:

  • If scalar, behaves the same way as for sum and others;
  • If array, it invokes the scalar version iteratively to reduce multiple dimensions in a single call; (however perhaps this precludes assumed-size result)

A minor nit-pick about the API, I suggest that we don't insinuate a vector/matrix as input, as they're linear algebra-specific and imply rank in the name. Instead, I suggest use a more general name considering we should support any number of dimensions. For arrays I just like plain x:

function mean_sp_sp(x) result(res)
    real(sp), intent(in) :: x(:)
    real(sp) :: res
end function

For a 2-d array, passing the result size as an input is an unacceptable API in my opinion. Is there any other way we could do assumed-size result for reducing a 2-d (or multi-d) array?

Another important point of discussion: Should a mean of an integer array be an integer or a real?

@leonfoks
Copy link

I believe the mean of an integer array should be dp. The result can be converted if needed. Numpy returns a float from mean(int array)

@jvdp1
Copy link
Member Author

jvdp1 commented Jan 19, 2020

A notable difference between Fortran's sum(x, dim) and Python's np.sum(x, axis) (I don't know about Matlab) is that Fortran always reduces the rank by 1 (dim must be a scalar). With numpy, axis can be a list or a tuple to perform reduction along multiple axes.

stdlib's generic mean could accept both a scalar or a rank-1 array for dim:

  • If scalar, behaves the same way as for sum and others;
  • If array, it invokes the scalar version iteratively to reduce multiple dimensions in a single call; (however perhaps this precludes assumed-size result)

I would suggest to implement first the same behaviour as Fortran sum. We can see how to implement the Python behaviour later.

A minor nit-pick about the API, I suggest that we don't insinuate a vector/matrix as input, as they're linear algebra-specific and imply rank in the name. Instead, I suggest use a more general name considering we should support any number of dimensions. For arrays I just like plain x:

function mean_sp_sp(x) result(res)
    real(sp), intent(in) :: x(:)
    real(sp) :: res
end function

Good to know.

For a 2-d array, passing the result size as an input is an unacceptable API in my opinion. Is there any other way we could do assumed-size result for reducing a 2-d (or multi-d) array?

I agree. Below is a possible workaround for 2D arrays (I just tried it; this implementation gives the same behaviour as Fortran sum):

interface mean
    module function mean_1_sp_sp(x) result(res)
        real(sp), intent(in) :: x(:)
        real(sp) ::res
    end function
...
    module function mean_2_all_sp_sp(x) result(res)
        real(sp), intent(in) :: x(:,:)
        real(sp) ::res
    end function mean_2_all_sp_sp
...
    module function mean_2_sp_sp(x, dim) result(res)
        real(sp), intent(in) :: x(:,:)
        integer, intent(in) :: dim
        real(sp) :: res(size(x)/size(x, dim))
    end function mean_2_sp_sp
...
end interface

Another important point of discussion: Should a mean of an integer array be an integer or a real?

IMHO, it should be always a real (for integer and real arrays).

@leonfoks
Copy link

Minor comment, I think this last iteration is better than using the optional argument dim. Let the interface handle the look up on function names. I believe optional arguments hit runtime? Thus leading to slow down?

@milancurcic
Copy link
Member

I don't think we can use optional because the result is a scalar if dim is provided and array otherwise, and we need to specify that at compile-time.

@jvdp1
Copy link
Member Author

jvdp1 commented Jan 19, 2020

@milancurcic this interface don't use optional (and neither allocatable; the result is a scalar if dim is NOT provided, and an array otherwise; the interface handle that):

interface mean
    module function mean_1_sp_sp(x) result(res)
        real(sp), intent(in) :: x(:)
        real(sp) ::res
    end function
...
    module function mean_2_all_sp_sp(x) result(res)
        real(sp), intent(in) :: x(:,:)
        real(sp) ::res
    end function mean_2_all_sp_sp
...
    module function mean_2_sp_sp(x, dim) result(res)
        real(sp), intent(in) :: x(:,:)
        integer, intent(in) :: dim
        real(sp) :: res(size(x)/size(x, dim))
    end function mean_2_sp_sp
...
end interface

This has the same behaviour as Fortran sum for 1D and 2D arrays

@milancurcic
Copy link
Member

@jvdp1 Yes, I was responding to @leonfoks regarding optional. I think your interface is the way to go.

I couldn't find a viable solution for 3 or higher dimensional input arrays though.

@jvdp1
Copy link
Member Author

jvdp1 commented Jan 19, 2020

@jvdp1 Yes, I was responding to @leonfoks regarding optional. I think your interface is the way to go.

Sorry @milancurcic for the misunderstanding

I couldn't find a viable solution for 3 or higher dimensional input arrays though.

Me neither. Any idea how it is implemented for Fortran sum?

If the community doesn't find a solution (I don't believe that :) ), should we first implement something for 1D and 2D arrays, and see later how to do it for >2D arrays (as for loadtxt and savetxt)?

@jvdp1
Copy link
Member Author

jvdp1 commented Jan 19, 2020

For integer arrays, the API could look like:

    module function mean_1_int8_dp(x) result(res)
        integer(int8), intent(in) :: x(:)
        real(dp) :: res
    end function mean_1_int8_dp

If the array is integer, the result will be always dp.

@jvdp1
Copy link
Member Author

jvdp1 commented Jan 20, 2020

@milancurcic @certik @leonfoks I found a workaround (using the Fortran merge; edit: simpler version than the initial one) for 3D arrays (without using a function with an allocatable array) when dim is always a scalar. I am not sure about the performance of this workaround, but it allows the function mean to have the same behaviour as the Fortran sum, and it could be easily extended to ranks >3.

Function: mean(x) or mean(x, dim) (for 1D, 2D, 3D arrays)
As it is now implemented here, it returns the mean of array elements of 1D, 2D, 3D arrays.
If dim is absent, a scalar with the mean of all elements in x is returned.
if dim is present, an array of rank n-1, where n equal the rank of x, and a shape similar to that of x with dimension dim droppped is returned. The returned array contains the mean of the elements of x along dimension dim.
The result is of the same type as the array for real arrays, or is dp for integer arrays.

Issue: I tried to use pure functions. While it compiled well with manual Makefile, CMake 3.16.1 does not like submodules + pure functions. Am I alone with this issue?

API for dp

interface mean
    module function mean_1_dp_dp(x) result(res)
        real(dp), intent(in) :: x(:)
        real(dp) :: res
    end function mean_1_dp_dp

    module function mean_1_int8_dp(x) result(res)
        integer(int8), intent(in) :: x(:)
        real(dp) :: res
    end function mean_1_int8_dp


    module function mean_2_all_dp_dp(x) result(res)
        real(dp), intent(in) :: x(:,:)
        real(dp) :: res
    end function mean_2_all_dp_dp

    module function mean_2_all_int8_dp(x) result(res)
        integer(int8), intent(in) :: x(:,:)
        real(dp) :: res
    end function mean_2_all_int8_dp

    module function mean_2_dp_dp(x, dim) result(res)
        real(dp), intent(in) :: x(:,:)
        integer, intent(in) :: dim
        real(dp) :: res(size(x)/size(x, dim))
    end function mean_2_dp_dp

    module function mean_2_int8_dp(x, dim) result(res)
        integer(int8), intent(in) :: x(:,:)
        integer, intent(in) :: dim
        real(dp) :: res(size(x)/size(x, dim))
    end function mean_2_int8_dp


module function mean_3_all_dp_dp(x) result(res)
    real(dp), intent(in) :: x(:,:,:)
    real(dp) :: res
end function mean_3_all_dp_dp

module function mean_3_all_int8_dp(x) result(res)
    integer(int8), intent(in) :: x(:,:,:)
    real(dp) :: res
end function mean_3_all_int8_dp

module function mean_3_dp_dp(x, dim) result(res)
    real(dp), intent(in) :: x(:,:,:)
    integer, intent(in) :: dim
    real(dp) :: res( &
                  merge(size(x,1),size(x,2),mask = 1 < dim, &
                  merge(size(x,2),size(x,3),mask = 2 < dim )
end function mean_3_dp_dp

module function mean_3_int8_dp(x, dim) result(res)
    integer(int8), intent(in) :: x(:,:,:)
    integer, intent(in) :: dim
    real(dp) :: res( &
                  merge(size(x,1),size(x,2),mask = 1 < dim, &
                  merge(size(x,2),size(x,3),mask = 2 < dim )
end function mean_3_int8_dp

end interface

@milancurcic
Copy link
Member

@jvdp1 Great! This is the exactly the solution I was trying to find yesterday but couldn't figure it out. Yes, it looks like it will expand nicely to as many dims as we need. We can do as many as 15, though I never worked with more than 5-d arrays. There's a typo in your interface (missing parentheses):

module function mean_3_dp_dp(x, dim) result(res)
    real(dp), intent(in) :: x(:,:,:)
    integer, intent(in) :: dim
    real(dp) :: res(merge(size(x, 1), size(x, 2), mask = 1 < dim), &
                    merge(size(x, 2), size(x, 3), mask = 2 < dim))
end function mean_3_dp_dp

@milancurcic
Copy link
Member

I am not too concerned about performance here. Arguments to merge are all scalar so this should reduce internally to if-branches at compile time.

Also, for stdlib I'd like to stress and re-iterate, easy of use and nice API should take priority over performance. Let's worry about making a great API first, then if needed work on performance within constraints of a great API design.

@jvdp1
Copy link
Member Author

jvdp1 commented Jan 20, 2020

Yes, it looks like it will expand nicely to as many dims as we need. We can do as many as 15, though I never worked with more than 5-d arrays.

I use fypp to generate the files. I will give a try to extend them to 15 dimensions.

There's a typo in your interface (missing parentheses):

Sorry. Too fast copy-paste...

@jvdp1
Copy link
Member Author

jvdp1 commented Jan 20, 2020

I am not too concerned about performance here. Arguments to merge are all scalar so this should reduce internally to if-branches at compile time.

Me neither. For such functions, I prefer the functionality than efficiency. If I need efficiency, I would probably implement it myself anyway.

Also, for stdlib I'd like to stress and re-iterate, easy of use and nice API should take priority over performance. Let's worry about making a great API first, then if needed work on performance within constraints of a great API design.

Currently the proposed API for mean is the same as Fortran sum, maxval, minval (if we ignore the mask argument).
Would it be possible to implement a mask in mean?

@certik
Copy link
Member

certik commented Jan 20, 2020

I agree to try to stick to Fortran conventions. Regarding performance, I would say great API and great performance are equal --- we can sacrifice a little bit of one to get a lot of the other, on a case by case basis. We should try not to sacrifice a lot of either.

@jvdp1
Copy link
Member Author

jvdp1 commented Jan 20, 2020

I think everybody agrees on the following API of mean which follows Fortran conventions, and similar to Fortran sum, minval, maxval,...

result = mean(x)
result = mean(x, dim)

with x being an array of type 'integer' and 'real', and 'dim' a scalar with a value in the range from 1 to n (with n <=15), where n equals the rank of x.

The same Fortran conventions should be used for other similar functions (median, variance, standard deviation, geometric mean,...).

For performance, the current implementation might be good with merge. I tried to use pure functions, but it seems there is an issue with CMake and submodules. do concurrent could then be used.

@certik @milancurcic @leonfoks @ivan-pi @scivision
Should I write a spec and submit a PR? Or do we need another iterate on the API?

@ivan-pi
Copy link
Member

ivan-pi commented Jan 20, 2020

Great work on this interface!

My only concern is what will be the default behavior if the user passed dim is larger than the rank of the array? Should the code simply fail?

do concurrent could then be used.

Could do concurrent modify the calculation sequence, meaning slightly different round-off errors upon each evaluation? (e.g. when compiled with the -qopenmp flag in Intel Fortran).

@jvdp1
Copy link
Member Author

jvdp1 commented Jan 20, 2020

My only concern is what will be the default behavior if the user passed dim is larger than the rank of the array? Should the code simply fail?

I just added a

call error_stop("ERROR (mean): wrong dimension")

The issue is that the functions cannot be pure anymore.

Could do concurrent modify the calculation sequence, meaning slightly different round-off errors upon each evaluation? (e.g. when compiled with the -qopenmp flag in Intel Fortran).

I think so (at least if auto-parallelisation or OpenMP is used).

@jvdp1
Copy link
Member Author

jvdp1 commented Jan 21, 2020

MEAN - mean of array elements

Description

Returns the mean of all the elements of ARRAY, or of the elements of ARRAY along dimension DIM.

Syntax

RESULT = mean(ARRAY)

RESULT = mean(ARRAY, DIM)

Arguments

ARRAY: Must be an array of type INTEGER, or REAL.

DIM (optional): Must be a scalar of type INTEGER with a value in the range from 1 to n, where n equals the rank of ARRAY.

Return value

If ARRAY is of type REAL, the result is of the same type as ARRAY.
If ARRAY is of type INTEGER, the result is of type as double precision.

If DIM is absent, a scalar with the mean of all elements in ARRAY is returned. Otherwise, an array of rank n-1, where n equals the rank of ARRAY, and a shape similar to that of ARRAY with dimension DIM dropped is returned.

Example

program test
    use stdlib_experimental_stat, only: mean
    implicit none
    real :: x(1:6) = (/ 1., 2., 3., 4., 5., 6. /)
    print *, mean(x)                            !returns 21.
    print *, mean( reshape(x, (/ 2, 3 /) ))     !returns 21.
    print *, mean( reshape(x, (/ 2, 3 /) ), 1)  !returns (/ 3., 7., 11. /) 
end program

@certik @milancurcic @nncarlson Is such a specification document (in Markdown) desired alongside the module?

If this API for mean is acceptable and also the document, I will create a PR for discussing its implementation (style, efficiency, ...). I implemented it up to 15 dimensions (with fypp).
When the implementation will be ok, the implementation of a mask should be easy.
This mean function may also be used as a reference for other descriptive statistics (e.g., variance).

@milancurcic
Copy link
Member

Thanks @jvdp1!

Is such a specification document (in Markdown) desired alongside the module?

I'd say yes.

Editorial nit-picks:

  • I recommend using monospace lowercase instead of uppercase, for example array and dim instead of ARRAY and DIM;
  • Use modern array constructor syntax, for example [2, 3] instead of (/2, 3/). This should go into style guide as well;
  • Regarding module name, I suggest stats instead of stat, which can be confused for "status".

@certik
Copy link
Member

certik commented Jan 21, 2020

@jvdp1 thank you for starting this!

I suggest we put it along side the module for now. Later on, I would like to have some automatic mechanism to parse these semantically (i.e., the tool would understand the sections as well as perhaps even the Fortran code) and produce nice online and pdf documentations.

@jvdp1 jvdp1 added specification Discussion and iteration over the API and removed idea Proposition of an idea and opening an issue to discuss it labels Jan 21, 2020
@jvdp1
Copy link
Member Author

jvdp1 commented Jul 28, 2020

I don't have an adequate background in statistics to know if people writing statistics applications in Fortran would find it useful to have these functions in the language rather than in a separate library, nor even if that is a sizeable body of users. Fortran's emphasis is on scientific and engineering programming, of which HPC is closely related. I could certainly see these in some sort of stdlib - what would be the advantage of making them intrinsics? Would programmers find that these simple interfaces are adequate for their needs, or are they likely to look for something with more options?

The committee members largely have engineering/science backgrounds, so we don't have a lot of experience with other application fields, and this complicates our ability to judge usefulness of features aimed at other fields. We're always looking for new members, especially if they add diversity to the committee background. Unfortunately, the structure of standards work, especially in the US (but also in some other countries) places financial burdens on those who wish to contribute. J3 (US) works around this by naming people as alternates, which costs nothing, but alternates can't vote if their principal is present.

It's certainly something worth thinking about, but my preference would be to have it be in stdlib first and see how well it is accepted.

These are functions we use often (or even daily) in my field (i.e. quantitative genetics), and therefore we re-implement these functions quite often (or we swicht to Octave/R/Julia to compute means, variances, regression coefficients, R2,...).
Having them as intrinsics could be useful, or at least in stdlib (e.g. as here, with many options).
For me, the main advantage of having them as instrinsics would be to not link a simple program to a librabry out of which I just need var or moment functions.

@ivan-pi
Copy link
Member

ivan-pi commented Jul 29, 2020

Thanks @sblionel for your answer. I do not see a need to have these available as intrinsic procedures, but I do believe having them in a library such as this one can ease the experience for (new) Fortran users. An off-topic question: how does membership in national committees work? Browsing through the documents on the WG5 website, I had the feeling the national committee used to play an important role in driving Fortran development.

In Alan Miller's Fortran Software there are many statistical functions, indicating that Fortran was used in this field. I also have a copy of the book "Programming for the social sciences: Algorithms and Fortran77 Coding" (from 1986), which discusses simple statistical functions. The book "Introduction to Computational Economics Using Fortran" also rolls its own versions of these functions.

Today, I am sure the majority of programmers prefer interpreted languages (Python, Julia, Matlab/Octave, R) for such work, or even spreadsheet programs (Excel, GraphPad, Origin, SPSS, etc.).

@arjenmarkus
Copy link
Member

arjenmarkus commented Jul 29, 2020 via email

@sblionel
Copy link
Member

An off-topic question: how does membership in national committees work? Browsing through the documents on the WG5 website, I had the feeling the national committee used to play an important role in driving Fortran development.

Briefly, each National Body has its own rules for membership. Typically, you must live in that country or be employed by a company with offices in that country. Each country has its own rules for fees and intellectual property. For Fortran specifically, WG5 (ISO/IEC) delegates the development of the standard to the US NB (PL22.3 aka J3). WG5 sets the feature list and votes on drafts. Practically speaking, there are several non-J3 members who regularly participate in the development work. The only NBs that actively participate in WG5 are Canada, Germany, Japan, UK and US.

@certik
Copy link
Member

certik commented Jul 29, 2020

Re interpreted languages: with both stdlib and LFortran fully developed in a few years, I expect the experience with Fortran can be very similar as with Julia or Python in terms of interactive usage.

@wclodius2
Copy link
Contributor

In testing my builds, I am having troubles compiling the statistical modules using Makefile.manual. On my machine the Makefile.manual is invoking gfortran, I suspect version 10.2. stdlib_stats_moment and stdlib_stats_var are taking forever to compile. I am not having this slowdown using Cmake, which I believe also invokes gfortran. Using Makefile.manual gfortran is also issuing a number of warnings that I suspect indicate problems for large arrays. Examples of the warnings are

stdlib_stats_moment.f90:26261:12:

     n = count(mask, kind = int64)
        1

Warning: Possible change of value in conversion from INTEGER(8) to REAL(4) at (1) [-Wconversion]

and

stdlib_stats_moment.f90:1312:12:

     n = size(x, kind = int64)
        1

Warning: Possible change of value in conversion from INTEGER(8) to REAL(4) at (1) [-Wconversion]

I don't think there is any advantage to invoking count and size with kind=int64 if the results are assigned to a variable of kind int32.

@wclodius2
Copy link
Contributor

FWIW the command line for the Makefile.manual invocation of gfortran is

gfortran -Wall -Wextra -Wimplicit-interface -fPIC -g -fcheck=all -c stdlib_stats_moment.f90

@everythingfunctional
Copy link
Member

That looks to me like n is declared as a real, which is likely an error (although I haven't looked at the code).

@wclodius2
Copy link
Contributor

wclodius2 commented Sep 27, 2020 via email

@jvdp1
Copy link
Member Author

jvdp1 commented Sep 27, 2020

That looks to me like n is declared as a real, which is likely an error (although I haven't looked at the code).

n is the denominator of (possibly) multiple operations (divisions) between reals (sp,dp, or qp). Therefore, storing n as real requires only 1 conversion.

In testing my builds, I am having troubles compiling the statistical modules using >Makefile.manual. On my machine the Makefile.manual is invoking gfortran, I suspect >version 10.2. stdlib_stats_moment and stdlib_stats_var are taking forever to compile.

There are hundreds of functions to compile in both submodules. Limilting the RANK to 4 might reduce the compilation time.

@everythingfunctional
Copy link
Member

n is the denominator of (possibly) multiple operations (divisions) between reals (sp,dp, or qp). Therefore, storing n as real requires only 1 conversion.

That makes sense. At some point, as low hanging fruit for somebody, I'd recommend putting in the explicit conversion just to declutter the warning messages at least.

@milancurcic
Copy link
Member

At some point, as low hanging fruit for somebody, I'd recommend putting in the explicit conversion just to declutter the warning messages at least.

At the time of implementation, I recommended to let compiler use mixed-mode arithmetic and not do explicit conversion. I won't dig out the specific comment and thread, but I vaguely remember that this wasn't a universally preferred opinion and we just went with it.

I personally don't appreciate some of gfortran's overly paranoid warnings about correct use of the language. In this specific example, it's okay--it could be useful to a user to know that there is implicit conversion happening. A more grave example is when gfortran warns you about trying to allocate a string on assignment (triggers -Wuninitialized). This is why I prefer carefully disabling some warnings rather than adding unnecessary explicit stuff to the code just to make the compiler happy.

But it's not only about the compiler. The code may be easier to understand with an explicit real() conversion. It's just that I am so used to mixed-mode arithmetic that redundant real() calls only add noise. But that's just me. If people prefer explicit, although unnecessary, type conversions in the code, I agree with the change.

@certik
Copy link
Member

certik commented Sep 28, 2020

I personally (I think) prefer explicit conversions between arithmetic, it helps to prevent unintended loss of accuracy.

@milancurcic
Copy link
Member

@certik There's no loss of accuracy. Compiler will correctly promote the type as per language rules. This is purely about explicit and verbose vs. implicit and concise.

$ cat mixed_mode1.f90 
integer :: a = 3
real :: b = 1.234, c
c = a * b
print *, c
end
$ cat mixed_mode2.f90 
integer :: a = 3
real :: b = 1.234, c
c = real(a) * b
print *, c
end
$ gfortran -S mixed_mode1.f90
$ gfortran -S mixed_mode2.f90
$ diff mixed_mode1.s mixed_mode2.s 
1c1
< 	.file	"mixed_mode1.f90"
---
> 	.file	"mixed_mode2.f90"
5c5
< 	.string	"mixed_mode1.f90"
---
> 	.string	"mixed_mode2.f90"

@certik
Copy link
Member

certik commented Sep 28, 2020

@milancurcic I don't think there is any warning in the example you posted:

integer :: a = 3
real :: b = 1.234, c
c = a * b
print *, c
end

Such usage is fine and indeed there is no loss of accuracy. However in this example:

integer :: a = 3, c
real :: b = 1.234
c = a * b
print *, c
end

You get a warning and I personally like this warning, because you lose accuracy, and it might not have been intended by me:

$ gfortran -Wall a.f90 
a.f90:3:4:

 c = a * b
    1
Warning: Possible change of value in conversion from REAL(4) to INTEGER(4) at (1) [-Wconversion]

@milancurcic
Copy link
Member

Okay, yours is a better example. But explicit cast still doesn't help you:

$ cat mixed_mode3.f90 
integer :: a = 3, c
real :: b = 1.234
c = real(a) * b
print *, c
end
$ gfortran -Wall mixed_mode3.f90 
mixed_mode3.f90:3:4:

 c = real(a) * b
    1
Warning: Possible change of value in conversion from REAL(4) to INTEGER(4) at (1) [-Wconversion]

How would you in this example use explicit cast to get rid of the warning?

@milancurcic
Copy link
Member

Sorry, I don't think this is a good example either.

The relevant example is one upthread:

     n = size(x, kind = int64)
        1
Warning: Possible change of value in conversion from INTEGER(8) to REAL(4) at (1) [-Wconversion]

where you can put that inside of a real() as Brad suggested.

My point is that adding a real() here won't do anything for precision, but only for readability.

@certik
Copy link
Member

certik commented Sep 28, 2020

The example I posted gets fixed by explicit cast to an int (saying as a user to the compiler that I am explicitly trimming the real to an integer):

integer :: a = 3, c
real :: b = 1.234
c = int(a * b)
print *, c
end

In the example you posted:

     n = size(x, kind = int64)

The issue is that n must be real(4) (btw, shouldn't that be an integer)? The warnings comes from the fact that the 64bit integer can't always fit into 32 bit real, and thus you lose accuracy. I.e., this works (no warning, no loss of accuracy):

integer(4) :: a = 3
real(4) :: b = 1.234
b = a
print *, b
end

but this produces a warning (and there is a possible loss of accuracy if the integer was large enough):

integer(8) :: a = 3
real(4) :: b = 1.234
b = a
print *, b
end

As a user, I personally like that, as it almost always (for me) means I made a mistake and didn't realize there is a loss of accuracy in the conversion. If I want it, I can always type it explicitly, then it is clear to both the (human) reader as well as the compiler what is intended.

@milancurcic
Copy link
Member

Good point, I didn't consider all the possibilities and especially the one of integer being too large. Jeremie explained here why n is real.

@certik
Copy link
Member

certik commented Sep 28, 2020

It looks like the warning worked as expected: you didn't realize that there is a possible loss of accuracy. :) And the fix is to put an explicit cast to real(4) in the code, then it is clear to everybody.

Update: however the compiler warning should have been worded differently: it should say that there is a possible loss of accuracy because the integer(8) might not always fit into real(4).

@jvdp1
Copy link
Member Author

jvdp1 commented Sep 28, 2020

It looks like the warning worked as expected: you didn't realize that there is a possible loss of accuracy. :) And the fix is to put an explicit cast to real(4) in the code, then it is clear to everybody.

Thank you for the explanations. I will have a look and open a PR with explicit casts for these several warnings.

@certik
Copy link
Member

certik commented Sep 28, 2020

Now, to be completely precise, you lose accuracy (in principle) any time you assign integer to real even of the same size. Here is an example:

integer(4) :: a = 1234567890
real(4) :: b = 1.234
b = a
print *, a
print *, b
end

Which does not warn, but prints:

  1234567890
   1.23456794E+09

So the last digit is now 4 instead of 0. But I assume this is such a common operation (32bit integer to 32bit real) that the compiler does not warn by default (you only lose "a little" of accuracy), and only warns if you assign a 64bit integer to 32bit real, as you might lose a lot (half the digits of the integer number in some cases I think).

@milancurcic
Copy link
Member

And the fix is to put an explicit cast to real(4) in the code, then it is clear to everybody.

Well, this is only a fix for clarity of the code, right?

If we really wanted to fix the possible loss of precision, shouldn't we use a real64 for n to accommodate very large arrays?

$ cat huge.f90 
use iso_fortran_env
print *, huge(1_int32), huge(1_int64)
print *, real(huge(1_int64), kind=real32)
print *, real(huge(1_int64), kind=real64)
end
$ gfortran -Wall huge.f90 && ./a.out 
huge.f90:3:14:

 print *, real(huge(1_int64), kind=real32)
              1
Warning: Change of value in conversion from ‘INTEGER(8)’ to ‘REAL(4)’ at (1) [-Wconversion]
huge.f90:4:14:

 print *, real(huge(1_int64), kind=real64)
              1
Warning: Change of value in conversion from ‘INTEGER(8)’ to ‘REAL(8)’ at (1) [-Wconversion]
  2147483647  9223372036854775807
   9.22337204E+18
   9.2233720368547758E+018

@jvdp1
Copy link
Member Author

jvdp1 commented Sep 28, 2020

If we really wanted to fix the possible loss of precision, shouldn't we use a real64 for n to accommodate very large arrays?

The loss of precision would appear at another stage, because the n is used as the denominator in the result of the function (that is real(int32) in this case), right?
The kind=int64 in the intrinsic size is used to avoid issues with arrays of size that does not fit in int32 (which can be easily reached, especially when multiple dimensions are used).

@certik
Copy link
Member

certik commented Sep 28, 2020

Well, this is only a fix for clarity of the code, right?

Yes. That we have thought about the issue and we "know what we are doing". That it is not an oversight.

@wclodius2
Copy link
Contributor

wclodius2 commented Sep 28, 2020 via email

@jvdp1
Copy link
Member Author

jvdp1 commented Sep 29, 2020

So the last digit is now 4 instead of 0. But I assume this is such a common operation (32bit integer to 32bit real) that the compiler does not warn by default (you only lose "a little" of accuracy),

Indeed. Such operations are mentioned by gfortran with the flag -Wconversion-extra (and there are many of them in stdlib )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
implementation Implementation in experimental and submission of a PR specification Discussion and iteration over the API topic: mathematics linear algebra, sparse matrices, special functions, FFT, random numbers, statistics, ...
Projects
None yet
Development

No branches or pull requests

9 participants