Initial round of probability distributions and statistical functions #240

Jim-215-Fisher · 2020-10-10T22:15:37Z

This is the first round of probability distributions and statistical functions. Anyone is welcomed to review it. The following source files were added:

src/stats_distribution_rvs.fypp
src/stats_distribution.fypp
src/stats_distribution_implementation.fypp

The specs documentation was added:

doc/specs/stdlib_stats_distribution.md

The test program was added:

src/tests/stats/test_distribution.f90

The following compilation files were modified:

src/CMakeLists.txt
src/Makefile.manual
src/tests/stats/CMakeLists.txt
src/tests/stats/Makefile.manual
doc/specs/index.md

Initial implementation of probability distributions and statistical functions

Jim-215-Fisher · 2020-10-12T14:41:59Z

Realized the Fortran standard functions 'random_seed' and 'random_number' may not have the same algorithm on different platforms and OS. Will update the random number generator.

TODO: modify the random_seed function to avoid using standard random_number function call
TODO: create new random number generator for random distribution.

milancurcic · 2020-10-12T15:06:10Z

Thanks a lot @Jim-215-Fisher! I will review it. On first look I only noticed two things:

random_seed if imported as is will make the intrinsic random_seed unavailable. I think we should rename it.
Minor style nit-pick, let's not use any non-newline code separators like

!---------------------------------------------------------------------------------------------
!                   Random seed
!---------------------------------------------------------------------------------------------

as they don't add value. I know, this is not mentioned in the style guide yet. Good rule of thumb is to just mimic the style of existing modules.

More later.

Jim-215-Fisher · 2020-10-12T15:15:04Z

The random_seed is an overloading to intrinsic, and need two arguments which is different from intrinsic. I thought it might be easier for users to use the same name.
I will remove those separators.

milancurcic · 2020-10-12T15:22:15Z

The random_seed is an overloading to intrinsic, and need two arguments which is different from intrinsic. I thought it might be easier for users to use the same name.

Ah, okay, I didn't realize. I think this is fine if it extends the intrinsic one while preserving the original behavior.

jvdp1 · 2020-10-12T15:56:18Z

Thank you @Jim-215-Fisher for this PR. I will try to review it this week.

14NGiestas · 2020-10-12T21:21:16Z

src/stdlib_stats_distribution.fypp

+!---------------------------------------------------------------------------------------------
+!                   Uniform Distribution
+!---------------------------------------------------------------------------------------------


Suggested change

!---------------------------------------------------------------------------------------------

! Uniform Distribution

!---------------------------------------------------------------------------------------------

as per @milancurcic , these separators add no value and is not in style, will remove in the next round.

Jim-215-Fisher · 2020-10-14T00:54:34Z

Just load modified version. 1)add 64bit random integer generator code, 2) modified random_seed procedure to use the new generator instead of random_number, 3) modified uniform distribution random generator, 4) add inversion transformation procedure for binomial random variate generator. It was noticed that both BTRD and BTPE from Alan Miller algorithm failed the chi-squared test.

Jim-215-Fisher · 2020-10-14T13:01:27Z

It looks like ubuntu platform does not like submodule, gcc7, 8 and 9 all show the same error: buffer overflow.

ivan-pi

Could it be that your hexadecimal constants are overflowing the int64 kind?

With gfortran you might need the flag -fno-range-check in that case (see here: #214 (comment)).

Jim-215-Fisher · 2020-10-14T15:44:18Z

I noticed that issue. I have changed that part into regular negative decimal. Right now, file stats_distribution_rv and stats_distribution have passed compilation, but stats_distribution_implementation fail to compile on ubuntu. The same file has passed on windows and Mac.

Jim-215-Fisher · 2020-10-14T22:39:01Z

This is likely a bug on gcc. See Bug 91773 - Buffer overflow for long module/submodule names

clean up for stdlib_stats_distribution_rvs.fypp

Changed the implementation submodule name to `stdlib_stats_distribution_imp`

…_implementation.o to stdlib_stats_distribution_imp.o

jvdp1

Thank you for this PR. Here are some comments on the specs.

Since there are mainly 3 very similar (in ternms of API) parts (uniform, norm, and binomial), I am wondering if it would not be easier to submit/review 3 different small PRs instead of this large one. What is your opinion @certik @milancurcic @ivan-pi ?
Meanwhile I will first focus on the uniform part.

doc/specs/index.md

doc/specs/stdlib_stats_distribution.md

jvdp1 · 2020-10-20T18:47:52Z

doc/specs/stdlib_stats_distribution.md

+
+The probability density function of the continuous uniform distribution.
+
+![equation](https://latex.codecogs.com/gif.latex?f(x)=\begin{cases}\frac{1}{scale}&loc\leqslant&space;x<loc&plus;scale\\\\0&x<loc,or,x>loc&plus;scale\end{cases})


FORD supports Latex: see here for more details.
Something like (I didn't check the formula):

Suggested change

![equation](https://latex.codecogs.com/gif.latex?f(x)=\begin{cases}\frac{1}{scale}&loc\leqslant&space;x<loc+scale\\\\0&x<loc,or,x>loc+scale\end{cases})

\( \frac{1}{scale}&loc\leqslant&space;x<loc+scale\\\\0&x<loc,or,x>loc+scale )\

FORD supports Latex: see here for more details.
Something like (I didn't check the formula):

I tried FORD LaTex support, It seems not working for my formulae.

Ok. Thanks for trying. I will try it later too. IMO it is preferable to not rely on a third party app.

jvdp1 · 2020-10-20T18:50:34Z

doc/specs/stdlib_stats_distribution.md

+
+Experimental
+
+### Description


Would it be possible to return an muti-dimension array (e.g., a ran-3 array), e.g., as in Matlab

Would it be possible to return an muti-dimension array (e.g., a ran-3 array), e.g., as in Matlab

Yes. the functions are elemental except the the random number generators, they should work on multi-dimension arrays.

It would be nice that the random number generator functions ("_rvs") also support arrays up to rank 7 or 15 (depending on the compiler). This could be easily generated with Fypp (maybe in a next PR?).

random number generator function itself has no issue with array support, it is elemental. The problem is how to initiate the function call. Normally, one uses one set of parameters to get either one or an array of random numbers. This is how current implementation worked. If we want array support up to rank 15, then we need a mask array as an optional argument to the function call. If peoples like the idea, we can certainly implement it next round.

I was thinking to extend the rank-1 function to multiple ranks, e.g. with multiple loops. Therefore, there is no need for optional arguments (there would rank+1 different procedures). But, as you mentioned, we can think about that in a next round if desired.

doc/specs/stdlib_stats_distribution.md

Jim-215-Fisher · 2020-10-20T20:29:48Z

Thank you for this PR. Here are some comments on the specs.

Since there are mainly 3 very similar (in ternms of API) parts (uniform, norm, and binomial), I am wondering if it would not be easier to submit/review 3 different small PRs instead of this large one. What is your opinion @certik @milancurcic @ivan-pi ?
Meanwhile I will first focus on the uniform part.

These three are just the initial part of whole statistical distributions, there are more distributions coming. All of them have similar API.

Co-authored-by: Jeremie Vandenplas <[email protected]>

src/stdlib_stats_distribution.fypp

peteroupc · 2020-10-22T04:15:10Z

Just for your information you should use algorithms avoiding floating-point arithmetic when possible (or at least make such implementations an option). For example, there are algorithms of the exponential distribution that use comparisons only (von Neumann's method, for one) as well as exact samplers of the Binomial distribution (e.g., Farach-Colton and Tsai, "Exact Sublinear Binomial Sampling"; the "Internal DLA" paper by Bringmann and others). Even the Poisson distribution can be sampled without using the exponential function (e.g. "On Buffon Machines and Numbers"). Perhaps the most familiar example of a distribution of this kind is the discrete (and continuous) normal distribution sampler by Karney. See also my section on "Specific Non-Uniform Distributions" in "Randomization and Sampling Methods".

In fact, as the authors of "Exact Sublinear Binomial Sampling" found, BTPE can oversample the tail of a binomial distribution (or at least the GNU Scientific Library implementation of BTPE at the time of the paper can).

For a simpler description of the Bringmann algorithm, see my page "Miscellaneous Observations on Randomization".

Jim-215-Fisher · 2020-10-24T03:16:13Z

Just for your information you should use algorithms avoiding floating-point arithmetic when possible (or at least make such implementations an option). For example, there are algorithms of the exponential distribution that use comparisons only (von Neumann's method, for one) as well as exact samplers of the Binomial distribution (e.g., Farach-Colton and Tsai, "Exact Sublinear Binomial Sampling"; the "Internal DLA" paper by Bringmann and others). Even the Poisson distribution can be sampled without using the exponential function (e.g. "On Buffon Machines and Numbers"). Perhaps the most familiar example of a distribution of this kind is the discrete (and continuous) normal distribution sampler by Karney. See also my section on "Specific Non-Uniform Distributions" in "Randomization and Sampling Methods".

In fact, as the authors of "Exact Sublinear Binomial Sampling" found, BTPE can oversample the tail of a binomial distribution (or at least the GNU Scientific Library implementation of BTPE at the time of the paper can).

Thanks for the info. I will take a look of it.

Jim-215-Fisher · 2020-12-03T22:48:51Z

I am going to update for the second round. Because of major change in module structure, should I initiate a new PR or just continue in this old one?

jvdp1 · 2020-12-04T09:38:44Z

I am going to update for the second round. Because of major change in module structure, should I initiate a new PR or just continue in this old one?

Does the new changes contain the same info? If yes, I suggest to update this PR.

Jim-215-Fisher · 2020-12-17T17:24:04Z

I am going to update for the second round. Because of major change in module structure, should I initiate a new PR or just continue in this old one?

Does the new changes contain the same info? If yes, I suggest to update this PR.

Instead of one big module plus one submodule structure, I split the statistical distributions into each individual module, which will be easier to maintain. Also, in the new update, it will contain more distributions and new function calls.

jvdp1 · 2020-12-17T19:47:45Z

I am going to update for the second round. Because of major change in module structure, should I initiate a new PR or just continue in this old one?

Does the new changes contain the same info? If yes, I suggest to update this PR.

Instead of one big module plus one submodule structure, I split the statistical distributions into each individual module, which will be easier to maintain. Also, in the new update, it will contain more distributions and new function calls.

Thank you for the details. Following your description, I would propose that you submit one PR per module, if possible. It would be easier to review than a very long PR.

Jim-215-Fisher added 5 commits October 10, 2020 11:59

Initial round

a89e084

Initial implementation of probability distributions and statistical functions

initial round

65192dd

initial round

16ba045

Update CMakeLists.txt

7c0d1ce

Add files via upload

634af73

milancurcic requested review from milancurcic, jvdp1 and ivan-pi October 12, 2020 14:58

14NGiestas reviewed Oct 12, 2020

View reviewed changes

Jim-215-Fisher added 2 commits October 13, 2020 20:32

Add files via upload

2b20aa7

Add files via upload

f618aed

Add files via upload

d2f6092

ivan-pi reviewed Oct 14, 2020

View reviewed changes

Jim-215-Fisher added 5 commits October 14, 2020 19:35

change hexadecimal into negative decimal

f7a294f

clean up for stdlib_stats_distribution_rvs.fypp

Shortened the submodule name

829bba1

Changed the implementation submodule name to `stdlib_stats_distribution_imp`

change in Makefile.manual, object name from stdlib_stats_distribution…

48cbc1b

…_implementation.o to stdlib_stats_distribution_imp.o

correct mistake on line 68.

e5239b1

Remove the mistake on continuation line for src

5c04b24

jvdp1 reviewed Oct 20, 2020

View reviewed changes

Update doc/specs/stdlib_stats_distribution.md

1298a08

Co-authored-by: Jeremie Vandenplas <[email protected]>

Jim-215-Fisher and others added 12 commits October 20, 2020 17:56

Update doc/specs/stdlib_stats_distribution.md

a6fb136

Co-authored-by: Jeremie Vandenplas <[email protected]>

Update doc/specs/stdlib_stats_distribution.md

ee02901

Co-authored-by: Jeremie Vandenplas <[email protected]>

Update doc/specs/stdlib_stats_distribution.md

b3d2186

Co-authored-by: Jeremie Vandenplas <[email protected]>

Test LaTex

27d65e8

Test LaTex

5bcbf44

Test LaTex $$

db9766b

Test LaTex \(

9f5751c

Test LaTex \(

56cf1b1

Add files via upload

389535f

no change on Latex part

18e927f

Update doc/specs/index.md

1d58799

Co-authored-by: Jeremie Vandenplas <[email protected]>

minor change

52981fb

jvdp1 reviewed Oct 21, 2020

View reviewed changes

src/stdlib_stats_distribution.fypp Outdated Show resolved Hide resolved

Update src/stdlib_stats_distribution.fypp

694ad8c

Jim-215-Fisher closed this Dec 19, 2020

Jim-215-Fisher deleted the stats_distribution branch December 19, 2020 18:17

Jim-215-Fisher restored the stats_distribution branch December 20, 2020 05:10

Jim-215-Fisher mentioned this pull request Dec 29, 2020

Probability Distribution and Statistical Functions -- Beta Distribution Module #286

Draft

Beliavsky mentioned this pull request Jun 1, 2021

Which non-uniform random number generators should be in the standard? j3-fortran/fortran_proposals#210

Open

Jim-215-Fisher deleted the stats_distribution branch October 7, 2021 00:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initial round of probability distributions and statistical functions #240

Initial round of probability distributions and statistical functions #240

Jim-215-Fisher commented Oct 10, 2020

Jim-215-Fisher commented Oct 12, 2020

milancurcic commented Oct 12, 2020

Jim-215-Fisher commented Oct 12, 2020

milancurcic commented Oct 12, 2020

jvdp1 commented Oct 12, 2020

14NGiestas Oct 12, 2020

Jim-215-Fisher Oct 13, 2020

Jim-215-Fisher commented Oct 14, 2020

Jim-215-Fisher commented Oct 14, 2020

ivan-pi left a comment •

edited

Loading

Jim-215-Fisher commented Oct 14, 2020

Jim-215-Fisher commented Oct 14, 2020

jvdp1 left a comment

jvdp1 Oct 20, 2020

Jim-215-Fisher Oct 20, 2020

jvdp1 Oct 21, 2020

jvdp1 Oct 20, 2020

Jim-215-Fisher Oct 20, 2020

jvdp1 Oct 21, 2020

Jim-215-Fisher Oct 21, 2020

jvdp1 Oct 21, 2020

Jim-215-Fisher commented Oct 20, 2020

peteroupc commented Oct 22, 2020 •

edited

Loading

Jim-215-Fisher commented Oct 24, 2020

Jim-215-Fisher commented Dec 3, 2020

jvdp1 commented Dec 4, 2020

Jim-215-Fisher commented Dec 17, 2020

jvdp1 commented Dec 17, 2020

	!---------------------------------------------------------------------------------------------
	! Uniform Distribution
	!---------------------------------------------------------------------------------------------


		The probability density function of the continuous uniform distribution.

		![equation](https://latex.codecogs.com/gif.latex?f(x)=\begin{cases}\frac{1}{scale}&loc\leqslant&space;x<loc+scale\\\\0&x<loc,or,x>loc+scale\end{cases})

	![equation](https://latex.codecogs.com/gif.latex?f(x)=\begin{cases}\frac{1}{scale}&loc\leqslant&space;x<loc+scale\\\\0&x<loc,or,x>loc+scale\end{cases})
	\( \frac{1}{scale}&loc\leqslant&space;x<loc+scale\\\\0&x<loc,or,x>loc+scale )\


		Experimental

		### Description

Initial round of probability distributions and statistical functions #240

Initial round of probability distributions and statistical functions #240

Conversation

Jim-215-Fisher commented Oct 10, 2020

Jim-215-Fisher commented Oct 12, 2020

milancurcic commented Oct 12, 2020

Jim-215-Fisher commented Oct 12, 2020

milancurcic commented Oct 12, 2020

jvdp1 commented Oct 12, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Jim-215-Fisher commented Oct 14, 2020

Jim-215-Fisher commented Oct 14, 2020

ivan-pi left a comment • edited Loading

Choose a reason for hiding this comment

Jim-215-Fisher commented Oct 14, 2020

Jim-215-Fisher commented Oct 14, 2020

jvdp1 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Jim-215-Fisher commented Oct 20, 2020

peteroupc commented Oct 22, 2020 • edited Loading

Jim-215-Fisher commented Oct 24, 2020

Jim-215-Fisher commented Dec 3, 2020

jvdp1 commented Dec 4, 2020

Jim-215-Fisher commented Dec 17, 2020

jvdp1 commented Dec 17, 2020

ivan-pi left a comment •

edited

Loading

peteroupc commented Oct 22, 2020 •

edited

Loading