Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test failure with -fbounds-check #2657

Closed
orlitzky opened this issue Jun 11, 2020 · 9 comments
Closed

Test failure with -fbounds-check #2657

orlitzky opened this issue Jun 11, 2020 · 9 comments

Comments

@orlitzky
Copy link

The -fbounds-check or -fcheck=bounds flags to gfortran

Enable generation of run-time checks for array subscripts and against the declared minimum and maximum values. It also checks array indices for assumed and deferred shape arrays against the actual allocated bounds and ensures that all string lengths are equal for character array constructors without an explicit typespec.

When I build OpenBLAS with FFLAGS="-fbounds-check" FCFLAGS="-fbounds-check", I get a reproducible test failure (on amd64):

TEST 27/27 kernel_regress:skx_avx At line 211 of file dgesvd.f
Fortran runtime error: Actual string length is shorter than the declared one for dummy argument 'jobvt' (-4618395555133915136/1)
make[1]: *** [Makefile:45: run_test] Error 2
make[1]: Leaving directory '/var/tmp/portage/sci-libs/openblas-0.3.9/work/openblas-0.3.9/utest'
make: *** [Makefile:126: tests] Error 2

I'm not sure if it's a problem with the test suite or the file itself. A user reported this to us at https://bugs.gentoo.org/726474

gentoo-bot pushed a commit to gentoo/gentoo that referenced this issue Jun 11, 2020
One OpenBLAS test fails when you enable bounds checks in your fortran
compiler via e.g. either "-fbounds-check" or "-fcheck=bounds" with
gfortran. This was reported upstream at

  OpenMathLib/OpenBLAS#2657

but in the meantime the easiest thing to do for *our* users is to
filter out those flags when USE=test is set. Thanks to Bernd for
reporting the problem on bug 726474.

Closes: https://bugs.gentoo.org/726474
Package-Manager: Portage-2.3.99, Repoman-2.3.22
Signed-off-by: Michael Orlitzky <[email protected]>
@martin-frbg
Copy link
Collaborator

Not sure I understand this (unless it is a side effect of tne general Fortran/C interfacing problem from #2154), in particular why would it flag jobvt but not the identically declared jobu here ? FWIW valgrind does not complain. Which version of gfortran and what other build options do you use ? (One possibility could be that your FFLAGS override Makefile defaults, e.g. dropping the -fno-optimize-sibling-calls used to work around the interface issue)

@martin-frbg
Copy link
Collaborator

FFLAGS were overridden as intended by adding required flags in Makefile.system, but the make.inc produced for the lapack-netlib (Reference-LAPACK) subtree got these without the buildtime override. So LAPACK got built with your "-fbounds-check" alone (dropping both the mandatory -frecursive and the -fno-optimize-sibling-calls).
With that fixed however I now get a similar - and similarly spurious - bounds check failure for a zpotrs call in test 33 (after similar calls to potrf in the same file completed without issues), which
I can only get around by adding the hidden argument for the size of the character argument. Clearly this requires changing the signature of zpotrs in common_interface.h , which would break the established API. (Which is why #2154 is still open, as Reference-LAPACK has not adopted a solution so far and GCC has backtracked on requiring hidden length arguments for character(1))

@orlitzky
Copy link
Author

Did I mess up my testing somehow? If I understand you correctly, you're saying there are two different problems:

  1. By passing FFLAGS to the build, the important -frecursive and -fno-optimize-sibling-calls flags are accidentally clobbered in some places.
  2. There's another unrelated test failure due to the bounds checking.

While testing -- to ensure that it was really the bounds-checking flag that was causing the problem -- I ran the same exact command with and without the bounds-checking flags. This command (inspired by the user's report) failed before gentoo/gentoo@57e2be1:

root # FFLAGS="-O2 -pipe -march=native -fstack-protector-strong -fstack-check -fbounds-check" FCFLAGS="-O2 -pipe -march=native -fstack-protector-strong -fstack-check -fbounds-check" USE=test FEATURES=test emerge -v1 =sci-libs/openblas-0.3.9

After that commit, which only filters out the -fbounds-check flag, the same command works. If passing any FFLAGS clobbered the -frecursive and -fno-optimize-sibling-calls flags, wouldn't they still get clobbered in my example above? After filtering the bounds-checking flag, we still wind up with FFLAGS="-O2 -pipe -march=native -fstack-protector-strong -fstack-check".

@martin-frbg
Copy link
Collaborator

martin-frbg commented Jun 11, 2020

Yes. For (1) the problem lies with the way the Reference-LAPACK source is docked into the OpenBLAS build system - we still utilize their approach of creating a definitions file (make.inc) with compiler settings and its FFLAGS get overriden by yours. The effects of missing -frecursive and -fno-optimize-sibling-calls will not be immediately obvious, but both can lead to wrong results or strange crashes.
For item (2) I am now playing with a new #ifdef HIDDENARGS that adds a set of alternate definitions for the LAPACK functions where the additional length arguments are appended. Defining this in the affected tests should solve the -fbounds-check` problem while user code
would still be able to use the conventional interface (that only gfortran 8 & 9 took issue with)

@orlitzky
Copy link
Author

Thanks for clarifying. What's bothering me is, why does gentoo/gentoo@57e2be1 appear to fix the problem? All I've done is remove the -fbounds-check flag from the user's FFLAGS. That still leaves all the other FFLAGS, so I would expect -frecursive and -fno-optimize-sibling-calls to still be overridden. I can't say for sure that that's not the case (I'm not 100% sure which lines to look at in the build log), but the build failures do disappear after stripping only -fbounds-check from FFLAGS="-O2 -pipe -march=native -fstack-protector-strong -fstack-check -fbounds-check".

@martin-frbg
Copy link
Collaborator

It fixes the problem as the only visible problem leading to your build failure was in the bounds check. You would notice the missing -frecursive only in some specific LAPACK functions when called in user code, and the missing -fno-optimize-sibling-calls could "only" lead to rare and hard to reproduce crashes (stack corruptions) in user code when compiling with gfortran version 8.x or 9.x The fix for the unwanted override would be to change the "FFLAGS=" to "override FFLAGS=" in the "lapack_prebuild" section of the toplevel Makefile in OpenBLAS, or to add the two gfortran options to your own FFLAGS.
(Proper fix for the bounds-check fault - instead of just stripping the option - is harder than I hoped
as I cannot find a solution that does not affect most of the upstream LAPACKE code)

@orlitzky
Copy link
Author

Thanks again. I took your advice and added "override" to that line. I also forced a rebuild of OpenBLAS for all of our users, in case any had compiled it with FFLAGS set in the past.

@martin-frbg
Copy link
Collaborator

Added an entry to the FAQ in the wiki

@thesamesam
Copy link

Note that this seems to have been done in 2d33e12 for #3390.

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants