Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[STDLIB_STATS] need to upgrade stdlib_stats codes about compilation efficiency #438

Closed
zoziha opened this issue Jun 16, 2021 · 7 comments
Closed
Labels
bug Something isn't working build: cmake Issue with stdlib's CMake build files documentation Improvements or additions to documentation

Comments

@zoziha
Copy link
Contributor

zoziha commented Jun 16, 2021

Overview: Compilation time is too long.

When compiling, I found that compiling stdlib_stats uses a lot of computer resources, especially RAM, which is related to the high-dimensional matrix dimensions defined in stdlib_stats, which greatly reduces the efficiency of stdlib and improves the overall compilation time of stdlib.
image

It took my computer (CPU: intel i5 8250U) more than two hours to compile stdlib completely,
image
When RANK=15, the compiled volume of stdlib reached 747MB.

I took a quick look at the source code and thought that there might be a better way to replace the polymorphic interface with such a large number of multi-dimensional array arguments.
(see high-dimensional matrix dimensions)
(see RANK)

My understanding is: Rethink, need to be more flexible.

The length within a single dimension defined by Fortran can theoretically be infinitely expanded, but the number of dimensions needs to be manually defined by the user.
In the future, we will also build a large number of functions that use matrices. The current implementation of stdlib_stats is unreasonable, not adaptable and needs to be improved, (see stdlib_stats_moment.fypp).

stdlib_stats presets several basic dimensions to form a polymorphic interface, and sets multiple judgments (see condition judgments) on the number of processing dimensions, resulting in a decrease in compilation speed and an increase in compilation load.

#281
#283

My solution is: Set up a matrix parser, or use a single-dimensional matrix algorithm.

If it is not for the communication within the different dimensions, we can achieve the effect by only setting the one-dimensional column vector, and hand the specific dimensional operation to the user to improve the versatility and flexibility of stdlib.

Or we use the wiki solution in stdlib to set up a matrix parser and transform it when necessary to meet the polymorphic needs of multi-dimensional arrays.

I have seen another library, and its solution is also good: muesli!


I don't know much more about stdlib_stats, so there may be limitations of my idea. However, I think the multi-dimensional array polymorphic interface in stdlib_stats needs to be improved.
Hope to get the discussion, thank you all! 😍

@zoziha zoziha added the bug Something isn't working label Jun 16, 2021
@zoziha zoziha changed the title [STDLIB_STATS] need to upgrade stdlib_stats codes [STDLIB_STATS] need to upgrade stdlib_stats codes about compilation efficiency Jun 16, 2021
@jvdp1
Copy link
Member

jvdp1 commented Jun 16, 2021

Thank you for trying stdlib. This issue was often raised in the past and is mentined in the README (see the section Build wth CMake. A solution is to limit the number of ranks to e.g., 7.
I recognize that it could be more highlited in the README.

The aim of stdlib_stats is to provide procedures related to descriptive statistics for arrays (e.g. for computing means, variances, std, momemts of elements of arrays), similar to what is available in Matlab, Julia,....

The API of the functions in stdlib_stats are the same (or at least really similar) to the one of the intrinsic, e.g., sum. As such the procedures in stdlib_stats could be considered as extensions of the intrinsic sum IMO.

Due to the "complexitiy" of the API of sum (i.e., it supports arrays (from rank 1 to 15) of all types of integer, real, and complex, an argument dim, and a (scalar or array) mask), and due to the lack of generics in Fortran, the number of functions generated for a single generic procedure (e.g., mean) is quite huge. fypp was quite helpful for generating all the needed functions.

I don't think that a solution like the one proposed by muesli would be approriate for this, because the aim was to provide procedures for Fortran arrays (but I may be wrong; at least it is how I find stdlib_stats useful for my daily work), and not for a derived type, e.g., provided by stdlib`. I am not sure to understand the 2 other solutions.

Anyway, I agree that compilation of this part can be an issue, that could increase later with inroduction of new functions in stdlib_stats and with a similar API to mean (I have at least 3 more in mind).

@zoziha
Copy link
Contributor Author

zoziha commented Jun 16, 2021

Thanks, I understand. I don't know much about c/c++,and is it possible to implement generics through the interfaces of these languages and Fortran? As far as I know, Fortran comes with functions such as real(integer, kind) whose parameters are all integers, but they can return different float precisions, which is difficult to achieve in Fortran's existing syntax.
Is it possible for Fortran to perfect this generic programming in some form in the future?😁

@epagone
Copy link

epagone commented Jun 16, 2021

Hi @zoziha, concerning your last question, you might want to have a look here 😉

@ghost
Copy link

ghost commented Jun 21, 2021

I think -DCMAKE_MAXIMUM_RANK should be 4 by default, that means stdlib will work by default for almost everybody. This is specially useful for new users who are not familiar with stdlib.

@jvdp1
Copy link
Member

jvdp1 commented Jun 21, 2021

I think -DCMAKE_MAXIMUM_RANK should be 4 by default, that means stdlib will work by default for almost everybody. This is specially useful for new users who are not familiar with stdlib.

This is indeed a good idea. I am for it. @awvwgk @milancurcic @ivan-pi what is your opinion about making -DCMAKE_MAXIMUM_RANK=4 as default value?

@awvwgk
Copy link
Member

awvwgk commented Jun 21, 2021

I think, just because we can, doesn't mean we have to compile with full rank support, especially the stats modules get quite compilation intensive for no good reason. I'm usually compiling with 4 anyway, sometimes with 7 if I make system-wide installations, but I have yet to exceeded rank 4 in any actual application of stdlib. The CMake template for stdlib also reduces the max rank by default. A max rank of 4 sounds like more sensible default.

I'm still looking forward to package stdlib, once we start putting a version on it, where a higher maximum rank than 4 might be much more relevant, because the end-user can't recompile if they depend on a binary distribution.

@awvwgk awvwgk added build: cmake Issue with stdlib's CMake build files documentation Improvements or additions to documentation labels Sep 18, 2021
@awvwgk
Copy link
Member

awvwgk commented Sep 18, 2021

Resolved by changing the default maximum rank in the CMake build files

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working build: cmake Issue with stdlib's CMake build files documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

4 participants