Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenBLAS build fails due to segmentation fault on Power8 platform #2166

Closed
kavanabhat opened this issue Jun 19, 2019 · 6 comments · Fixed by #2167
Closed

OpenBLAS build fails due to segmentation fault on Power8 platform #2166

kavanabhat opened this issue Jun 19, 2019 · 6 comments · Fixed by #2167
Milestone

Comments

@kavanabhat
Copy link
Contributor

OpenBLAS build fails for Power8 target in 64-bit mode(Both Redhat Big endian and AIX ). Details are as below:

#make DEBUG=1 BINARY=64 TARGET=POWER8
Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:
#0 0x3FFF95A692C3
#1 0x3FFF95A69E23
#2 0x3FFF95CB0477
#3 0x10078920 in dtrmm_ounucopy at trmm_uncopy_4.c:93
OPENBLAS_NUM_THREADS=1 OMP_NUM_THREADS=1 ./cblat2 < ./cblat2.dat
OPENBLAS_NUM_THREADS=1 OMP_NUM_THREADS=1 ./zblat2 < ./zblat2.dat
/bin/sh: line 1: 54413 Segmentation fault (core dumped) OPENBLAS_NUM_THREADS=1 OMP_NUM_THREADS=1 ./dblat3 < ./dblat3.dat
make[1]: *** [level3] Error 139
make[1]: *** Waiting for unfinished jobs....
rm -f ?BLAT2.SUMM
OMP_NUM_THREADS=2 ./sblat2 < ./sblat2.dat
OMP_NUM_THREADS=2 ./dblat2 < ./dblat2.dat
OMP_NUM_THREADS=2 ./cblat2 < ./cblat2.dat
OMP_NUM_THREADS=2 ./zblat2 < ./zblat2.dat
make[1]: Leaving directory `/home/kavana/OpenBLAS/test'
make: *** [tests] Error 2
[root@pokndd8 OpenBLAS]# cd test
[root@pokndd8 test]# gdb ./dblat3
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-100.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "ppc64-redhat-linux-gnu".
For bug reporting instructions, please see:
http://www.gnu.org/software/gdb/bugs/...
Reading symbols from /home/kavana/OpenBLAS/test/dblat3...done.
(gdb) run < ./dblat3.dat
Starting program: /home/kavana/OpenBLAS/test/./dblat3 < ./dblat3.dat
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
[New Thread 0x3ffeb391f140 (LWP 54472)]
[New Thread 0x3ffeb311f140 (LWP 54473)]

.......

Program received signal SIGSEGV, Segmentation fault.
dtrmm_ounucopy (m=4294967297, n=12, a=0x3fffffff5ff0, lda=2, posX=0, posY=1, b=) at generic/trmm_uncopy_4.c:93
93 b[ 0] = data01;
(gdb) where
#0 dtrmm_ounucopy (m=4294967297, n=12, a=0x3fffffff5ff0, lda=2, posX=0, posY=1, b=) at generic/trmm_uncopy_4.c:93
#1 0x0000000010014f14 in dtrmm_RNUU (args=, range_m=, range_n=, sa=0x3ffeb3920000,
sb=0x3ffeb3cc0000, dummy=) at trmm_R.c:254
#2 0x000000001000fa64 in dtrmm_ (SIDE=, UPLO=, TRANS=, DIAG=,
M=, N=, alpha=, a=, ldA=0x3ffffffb3cd8, b=0x3ffffffe57e0, ldB=0x3ffffffb3cd4)
at trsm.c:381
#3 0x0000000010005bd8 in dchk3 (sname='DTRMM ', eps=2.2204460492503131e-16, thresh=16, nout=6, ntra=-1, trace=.FALSE., rewi=.FALSE.,
fatal=.FALSE., nidim=6, idim=..., nalf=3, alf=..., nmax=65, a=..., aa=..., as=..., b=..., bb=..., bs=..., ct=..., g=..., c=..., _sname=6)
at dblat3.f:1059
#4 0x000000001000e9a4 in dblat3 () at dblat3.f:292
#5 0x000000001000180c in main (argc=, argv=) at dblat3.f:355
#6 0x00003fffb7a06bec in generic_start_main (main=@0x100a0190: 0x100017e0

, argc=, ubp_av=0x3ffffffff4f8,
auxvec=0x3ffffffff5e8, init=, rtld_fini=, stack_end=, fini=)
at ../csu/libc-start.c:274
#7 0x00003fffb7a06e14 in __libc_start_main (argc=, ubp_av=, ubp_ev=, auxvec=,
rtld_fini=, stinfo=, stack_on_entry=) at ../sysdeps/unix/sysv/linux/powerpc/libc-start.c:91
#8 0x0000000000000000 in ?? ()
(gdb)

@kavanabhat
Copy link
Contributor Author

This issue is observed in the develop branch. But, the issue seems to be fixed in the master branch. The segmentation fault is due to stack corruption caused in kernel/power/dtrmm_kernel_16x4_power8.S. Closing this issue as it is fixed in master branch

@martin-frbg
Copy link
Collaborator

martin-frbg commented Jun 19, 2019

master branch is actually stale (and does not provide optimized assembly kernels for power8) but the power8 code in develop (and 0.3.6 release) is little-endian only. see #1997

@kavanabhat
Copy link
Contributor Author

Thanks for the information. The issue is not specific to endianness but the application mode. When compiled in 64-bit mode, this bug will be definitely corrupting the stack even on little endian systems. So, will reopen this.

@kavanabhat kavanabhat reopened this Jun 19, 2019
@martin-frbg
Copy link
Collaborator

martin-frbg commented Jun 19, 2019

Then I assume this must have been caused by (or at least related to) my PR #1317 almost two years ago. I do wonder why it would go unnoticed for so long as I would think 64bit is what is typically used on that platform ? @quickwritereader
(And I do need to correct what I wrote earlier - while master branch is stale at approximately v0.2.20,
it already had the power8 codes written by wernsaar - unfortunately around two years ago it was noticed that these contained several systematic bugs including clobbering the vsx registers. An IBM guy submitted a partial fix and I tried to apply "corresponding" changes to the remaining files back then - without access to actual hardware and with only a limited understanding of ppc assembly)

@martin-frbg
Copy link
Collaborator

Thanks for the fix. So it was essentially a stray line, somehow duplicated from the code I was editing. ☹️

martin-frbg added a commit that referenced this issue Jun 19, 2019

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
Fix DTRMMKERNEL register save for power8 64-bit mode (Fix for #2166)
@quickwritereader
Copy link
Contributor

as far as I know @martin-frbg your fixes worked and I did not have a problem with little endian. Still, fixes are welcomed. for power9 I was almost re-writing entry but just for le

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants