Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DTRMM doesn't give correct results #951

Closed
Sbte opened this issue Aug 29, 2016 · 29 comments
Closed

DTRMM doesn't give correct results #951

Sbte opened this issue Aug 29, 2016 · 29 comments

Comments

@Sbte
Copy link
Contributor

Sbte commented Aug 29, 2016

DTRMM doesn't give correct results for me. I built with CMake on intel skylake, and managed to chase the calls down to

dtrmm_RNUN (args=0x7fffffffdb70, range_m=0x0, range_n=0x0, sa=0x7ffff3ffa000, sb=0x7ffff40fa000, 
    dummy=0) at OpenBLAS/driver/level3/trmm_R.c:65

I tested it on a random 10x10 matrix. I called it like this from a C program

    dtrmm_("Right", "U", "No transpose", "Non-unit", &n, &n, &one, A, &n, X, &n);

The residual when comparing to the correct result looks like.

0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 
0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 
0.000000 0.000000 0.000000 0.000000 -0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 
0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 -0.000000 -0.000000 0.000000 0.000000 
0.000000 0.000000 0.000000 0.000000 -0.685723 -0.880845 -0.714575 0.161275 0.000000 -0.000000 
0.000000 0.000000 0.000000 0.000000 -0.000000 0.000000 0.000000 -0.000000 0.000000 -0.000000 
0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 -0.000000 0.000000 0.000000 
0.000000 0.000000 0.000000 -0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 
0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 
0.000000 0.000000 0.000000 0.000000 -0.533757 -0.751237 -0.555078 -0.153975 0.000000 0.000000

Edit: Since the thread is quite long, I put a short summary here.

This seems to be caused by the kernel/x86_64/dgemm_kernel_4x8_haswell.S file that was committed in 9bd962f . This file is only used for the TRMM implementation when on Haswell USE_TRMM is not set (see Makefile.L3). In that same commit, USE_TRMM was set to 1 on Haswell in Makefile.L3, but the same was not done for the cmake build. This was fixed in bcfc298 . Now what is left is that the TRMM implementation in kernel/x86_64/dgemm_kernel_4x8_haswell.S is still broken. This implementation is currently not used if everything is built correctly.

@martin-frbg
Copy link
Collaborator

Which version comes with Ubuntu 16.04 then - 0.2.18 from what I found on the 'net ? As far as I can see, the only non-trivial change in trmm_R.c happened just before the 0.2.18 release but should have affected "only" loop unrolling for the GEMM function(s). (Downgrading to 0.2.17 might tell)
Apart from that, only changes to the cpu-specific optimized GEMM functions would appear to be able to cause this - as far as I know, Skylake is treated identical to Haswell internally, perhaps you could try building OpenBLAS with TARGET=SANDYBRIDGE or even TARGET=NEHALEM to see if the error goes away ?

@Sbte
Copy link
Contributor Author

Sbte commented Aug 29, 2016

The version I built by myself used the latest version from git. If I build with the standard makefile and TARGET=SANDYBRIDGE everything works, because apparently the reference implementation is used? At least that is what my gdb trace looked like. The CMake build has never worked since it was introduced. I did a git bisect to confirm.

So I think in the latest Ubuntu, they also used CMake to build OpenBLAS, hence the error.

@martin-frbg
Copy link
Collaborator

Not sure if that means that the cmake setup is to blame, or does building with plain "make" without specifying a target (which I guess would end up with picking HASWELL) get you a working version as well ?
The alternative explanation would be that there is some hidden flaw in the Haswell-specific implementation of "something called by dtrmm - probably dgemm, i.e. kernel/x86_64/dgemm_kernel_4x8_haswell.S . If you have the time and ambition, you could try putting the equivalent sandybridge file in its place by changing the DGEMMKERNEL line in kernel/x86_64/KERNEL.HASWELL
to read dgemm_kernel_4x8_sandy.S like it is in KERNEL.SANDYBRIDGE .

@Sbte
Copy link
Contributor Author

Sbte commented Aug 31, 2016

Just using the makefile and no arguments also works (and gives the correct results). I used SANDYBRIDGE earlier because while debugging I found that earlier versions require the target option. So it's not that.

@martin-frbg
Copy link
Collaborator

That would (probably) mean that you get a different config.h and Makefile.conf when you use cmake ? (If so, the implications might go beyond just the Skylake architecture)

@martin-frbg
Copy link
Collaborator

Could be something like the changes to the default x86_64 KERNEL file choices for gemv functions from acdff55 would need propagating to cmake/kernel.cmake (but I am not sure I even understand
how the cmake build mechanism works)

@brada4
Copy link
Contributor

brada4 commented Sep 10, 2016

Can you count residuals with all zeroes and all ones (i.e 1+1 cuts sight better than rand()+rand() )

@Sbte
Copy link
Contributor Author

Sbte commented Sep 10, 2016

I'm quite busy for at least 10 more days. After that I will further investigate this bug.

@brada4
Copy link
Contributor

brada4 commented Sep 10, 2016

It is hard to guess what is behind 0x7fff ... pointers (you can eXamine the address to get value)

@Sbte
Copy link
Contributor Author

Sbte commented Oct 17, 2016

That would (probably) mean that you get a different config.h and Makefile.conf when you use cmake ? (If so, the implications might go beyond just the Skylake architecture)

It's cmake, so I don't think these files are generated since it does out-of-source builds.

Can you count residuals with all zeroes and all ones (i.e 1+1 cuts sight better than rand()+rand() )

0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 
0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 
0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 
0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 
0.000000 0.000000 0.000000 0.000000 -4.000000 -4.000000 -4.000000 -4.000000 0.000000 0.000000 
0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 
0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 
0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 
0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 
0.000000 0.000000 0.000000 0.000000 -4.000000 -4.000000 -4.000000 -4.000000 0.000000 0.000000 

It is hard to guess what is behind 0x7fff ... pointers (you can eXamine the address to get value)

(gdb) p *(double *)sa
$1 = 6.9533468655408825e-310
(gdb) p *(double *)sb
$2 = 6.9533469173474203e-310
(gdb) p *arg
$3 = {a = 0x63a130, b = 0x63a460, c = 0x0, d = 0x7ffff7560b2f <_IO_new_file_write+143>, alpha = 0x7ffff78ac620 <_IO_2_1_stdout_>, beta = 0x7fffffffdb38, m = 10, n = 10, k = 0, lda = 10, ldb = 10, 
  ldc = 140737346455072, ldd = 10, common = 0x4017b0 <_start>, nthreads = 2}
(gdb) p *(double *)arg.alpha
$4 = 2.0861575199901636e-314
(gdb) p *(double *)arg.beta
$5 = 1

Edit: Sorry this last part was with threading enabled. However, with 1 thread the results are the same, except that nthreads=1 and the arguments argument is named args instead of arg.

@Sbte
Copy link
Contributor Author

Sbte commented Oct 17, 2016

Note that the tests also fail for me:

 ******* FATAL ERROR - COMPUTED RESULT IS LESS THAN HALF ACCURATE *******
           EXPECTED RESULT   COMPUTED RESULT
       1     -0.142857         -0.292707    
      THESE ARE THE RESULTS FOR COLUMN   5
 ******* DTRMM  FAILED ON CALL NUMBER:
    794: DTRMM ('L','U','N','U',  1, 31, 1.0, A,  2, B,  2)        .

@martin-frbg
Copy link
Collaborator

Can you produce a call tree of OpenBLAS functions up to (or used by) the failing function ? If the problem occurs only with cmake builds and cmake does not use any of the config.h mechanism of the pure "make" build system, I can only guess that some difference in default function selection through the defines in kernel.cmake must be the culprit. (Though all the obvious candidates from the relatively recent changes to x86_64/KERNEL appear to be overriden by optimized functions in KERNEL.HASWELL).
Or another idea - could you run the offending code under valgrind or some other memory debugger to see if that flags any out-of-bounds accesses similar to #695,#783 ?

@Sbte
Copy link
Contributor Author

Sbte commented Oct 17, 2016

valgrind doesn't give anything. Checking what functions are called gives something interesting. The working version calls this:

Breakpoint 32 at 0x43df41: file kernel/generic/trmm_lncopy_4.c, line 51.
int dtrmm_olnncopy(BLASLONG, BLASLONG, double *, BLASLONG, BLASLONG, BLASLONG, 
    double *);
Breakpoint 33 at 0x43cff1: file kernel/generic/trmm_lncopy_4.c, line 51.
int dtrmm_olnucopy(BLASLONG, BLASLONG, double *, BLASLONG, BLASLONG, BLASLONG, 
    double *);
Breakpoint 34 at 0x429130: file kernel/generic/trmm_lncopy_8.c, line 58.
int dtrmm_ilnncopy(BLASLONG, BLASLONG, double *, BLASLONG, BLASLONG, BLASLONG, 
    double *);
Breakpoint 35 at 0x425969: file kernel/generic/trmm_lncopy_8.c, line 58.
int dtrmm_ilnucopy(BLASLONG, BLASLONG, double *, BLASLONG, BLASLONG, BLASLONG, 
    double *);
Breakpoint 36 at 0x441d81: file kernel/generic/trmm_ltcopy_4.c, line 51.
int dtrmm_oltncopy(BLASLONG, BLASLONG, double *, BLASLONG, BLASLONG, BLASLONG, 
    double *);
Breakpoint 37 at 0x440de8: file kernel/generic/trmm_ltcopy_4.c, line 51.
int dtrmm_oltucopy(BLASLONG, BLASLONG, double *, BLASLONG, BLASLONG, BLASLONG, 
    double *);
Breakpoint 38 at 0x437622: file kernel/generic/trmm_ltcopy_8.c, line 58.
int dtrmm_iltncopy(BLASLONG, BLASLONG, double *, BLASLONG, BLASLONG, BLASLONG, 
    double *);
Breakpoint 39 at 0x433dc0: file kernel/generic/trmm_ltcopy_8.c, line 58.
int dtrmm_iltucopy(BLASLONG, BLASLONG, double *, BLASLONG, BLASLONG, BLASLONG, 
    double *);
Breakpoint 40 at 0x43c01e: file kernel/generic/trmm_uncopy_4.c, line 51.
int dtrmm_ounncopy(BLASLONG, BLASLONG, double *, BLASLONG, BLASLONG, BLASLONG, 
    double *);
Breakpoint 41 at 0x43b083: file kernel/generic/trmm_uncopy_4.c, line 51.
int dtrmm_ounucopy(BLASLONG, BLASLONG, double *, BLASLONG, BLASLONG, BLASLONG, 
    double *);
Breakpoint 42 at 0x421f4e: file kernel/generic/trmm_uncopy_8.c, line 58.
int dtrmm_iunncopy(BLASLONG, BLASLONG, double *, BLASLONG, BLASLONG, BLASLONG, 
    double *);
Breakpoint 43 at 0x41e6f6: file kernel/generic/trmm_uncopy_8.c, line 58.
int dtrmm_iunucopy(BLASLONG, BLASLONG, double *, BLASLONG, BLASLONG, BLASLONG, 
    double *);
Breakpoint 44 at 0x43fe41: file kernel/generic/trmm_utcopy_4.c, line 51.
int dtrmm_outncopy(BLASLONG, BLASLONG, double *, BLASLONG, BLASLONG, BLASLONG, 
    double *);
Breakpoint 45 at 0x43eec3: file kernel/generic/trmm_utcopy_4.c, line 51.
int dtrmm_outucopy(BLASLONG, BLASLONG, double *, BLASLONG, BLASLONG, BLASLONG, 
    double *);
Breakpoint 46 at 0x430357: file kernel/generic/trmm_utcopy_8.c, line 58.
int dtrmm_iutncopy(BLASLONG, BLASLONG, double *, BLASLONG, BLASLONG, BLASLONG, 
    double *);
Breakpoint 47 at 0x42caf0: file kernel/generic/trmm_utcopy_8.c, line 58.
int dtrmm_iutucopy(BLASLONG, BLASLONG, double *, BLASLONG, BLASLONG, BLASLONG, 
    double *);
Breakpoint 8 at 0x414025: file kernel/x86_64/../generic/gemm_ncopy_4.c, line 52.
int dgemm_oncopy(BLASLONG, BLASLONG, double *, BLASLONG, double *);
Breakpoint 9 at 0x40fd97: file kernel/x86_64/../generic/gemm_ncopy_8.c, line 68.
int dgemm_incopy(BLASLONG, BLASLONG, double *, BLASLONG, double *);
Breakpoint 10 at 0x41485a: file kernel/x86_64/../generic/gemm_tcopy_4.c, line 53.
int dgemm_otcopy(BLASLONG, BLASLONG, double *, BLASLONG, double *);
Breakpoint 11 at 0x4112e2: file kernel/x86_64/../generic/gemm_tcopy_8.c, line 69.
int dgemm_itcopy(BLASLONG, BLASLONG, double *, BLASLONG, double *);

and the non-working version calls this

Breakpoint 66 at 0x426450: file kernel/generic/trmm_lncopy_4.c, line 42.
int dtrmm_ilnncopy(BLASLONG, BLASLONG, double *, BLASLONG, BLASLONG, BLASLONG, 
    double *);
Breakpoint 67 at 0x425c60: file kernel/generic/trmm_lncopy_4.c, line 42.
int dtrmm_ilnucopy(BLASLONG, BLASLONG, double *, BLASLONG, BLASLONG, BLASLONG, 
    double *);
Breakpoint 68 at 0x428a00: file kernel/generic/trmm_lncopy_8.c, line 42.
int dtrmm_olnncopy(BLASLONG, BLASLONG, double *, BLASLONG, BLASLONG, BLASLONG, 
    double *);
Breakpoint 69 at 0x426bb0: file kernel/generic/trmm_lncopy_8.c, line 42.
int dtrmm_olnucopy(BLASLONG, BLASLONG, double *, BLASLONG, BLASLONG, BLASLONG, 
    double *);
Breakpoint 70 at 0x42fd70: file kernel/generic/trmm_ltcopy_4.c, line 42.
int dtrmm_iltncopy(BLASLONG, BLASLONG, double *, BLASLONG, BLASLONG, BLASLONG, 
    double *);
Breakpoint 71 at 0x42f580: file kernel/generic/trmm_ltcopy_4.c, line 42.
int dtrmm_iltucopy(BLASLONG, BLASLONG, double *, BLASLONG, BLASLONG, BLASLONG, 
    double *);
Breakpoint 72 at 0x432400: file kernel/generic/trmm_ltcopy_8.c, line 42.
int dtrmm_oltncopy(BLASLONG, BLASLONG, double *, BLASLONG, BLASLONG, BLASLONG, 
    double *);
Breakpoint 73 at 0x430500: file kernel/generic/trmm_ltcopy_8.c, line 42.
int dtrmm_oltucopy(BLASLONG, BLASLONG, double *, BLASLONG, BLASLONG, BLASLONG, 
    double *);
Breakpoint 74 at 0x4218f0: file kernel/generic/trmm_uncopy_4.c, line 42.
int dtrmm_iunncopy(BLASLONG, BLASLONG, double *, BLASLONG, BLASLONG, BLASLONG, 
    double *);
Breakpoint 75 at 0x421190: file kernel/generic/trmm_uncopy_4.c, line 42.
int dtrmm_iunucopy(BLASLONG, BLASLONG, double *, BLASLONG, BLASLONG, BLASLONG, 
    double *);
Breakpoint 76 at 0x423df0: file kernel/generic/trmm_uncopy_8.c, line 42.
int dtrmm_ounncopy(BLASLONG, BLASLONG, double *, BLASLONG, BLASLONG, BLASLONG, 
    double *);
Breakpoint 77 at 0x421ff0: file kernel/generic/trmm_uncopy_8.c, line 42.
int dtrmm_ounucopy(BLASLONG, BLASLONG, double *, BLASLONG, BLASLONG, BLASLONG, 
    double *);
Breakpoint 78 at 0x42b0f0: file kernel/generic/trmm_utcopy_4.c, line 42.
int dtrmm_iutncopy(BLASLONG, BLASLONG, double *, BLASLONG, BLASLONG, BLASLONG, 
    double *);
Breakpoint 79 at 0x42a8d0: file kernel/generic/trmm_utcopy_4.c, line 42.
int dtrmm_iutucopy(BLASLONG, BLASLONG, double *, BLASLONG, BLASLONG, BLASLONG, 
    double *);
Breakpoint 80 at 0x42d6b0: file kernel/generic/trmm_utcopy_8.c, line 42.
int dtrmm_outncopy(BLASLONG, BLASLONG, double *, BLASLONG, BLASLONG, BLASLONG, 
    double *);
Breakpoint 81 at 0x42b870: file kernel/generic/trmm_utcopy_8.c, line 42.
int dtrmm_outucopy(BLASLONG, BLASLONG, double *, BLASLONG, BLASLONG, BLASLONG, 
    double *);
Breakpoint 82 at 0x412810: file kernel/x86_64/../generic/gemm_ncopy_4.c, line 42.
int dgemm_incopy(BLASLONG, BLASLONG, double *, BLASLONG, double *);
Breakpoint 83 at 0x413030: file kernel/x86_64/../generic/gemm_ncopy_8.c, line 42.
int dgemm_oncopy(BLASLONG, BLASLONG, double *, BLASLONG, double *);
Breakpoint 84 at 0x412bb0: file kernel/x86_64/../generic/gemm_tcopy_4.c, line 56.
int dgemm_itcopy(BLASLONG, BLASLONG, double *, BLASLONG, double *);
Breakpoint 85 at 0x413b10: file kernel/x86_64/../generic/gemm_tcopy_8.c, line 76.
int dgemm_otcopy(BLASLONG, BLASLONG, double *, BLASLONG, double *);

Note that they are different. For instance when the working version calls itcopy8, the version that doesn't work calls otcopy8.

@Sbte
Copy link
Contributor Author

Sbte commented Oct 17, 2016

Never mind that last comment. That difference was because I wasn't using haswell in the working version. Now they are both the same. I will look a bit further...

@Sbte
Copy link
Contributor Author

Sbte commented Oct 17, 2016

So the actual difference between the working and non-working call is this:

Extra in working:

Breakpoint 2 at 0x40b573: file ../common_linux.h, line 87.
static int my_mbind(void *, unsigned long, int, unsigned long *, unsigned long, unsigned int);
Breakpoint 6 at 0x40c075: file ../common_x86_64.h, line 97.
static BLASULONG rpcc(void);
Breakpoint 5 at 0x40d246: file ../common_x86_64.h, line 127.
static void cpuid(int, int *, int *, int *, int *);
Breakpoint 1 at 0x40b5b4: common.h:blas_unlock. (2 locations)
static void blas_unlock(volatile BLASULONG *);
Breakpoint 7 at 0x401a91: file common_thread.h, line 138.
static int num_cpu_avail(int);
Breakpoint 4 at 0x40ad94: common_x86_64.h:blas_quickdivide. (2 locations)
static int blas_quickdivide(unsigned int, unsigned int);
Breakpoint 3 at 0x40b526: common_x86_64.h:blas_lock. (2 locations)
static void blas_lock(volatile BLASULONG *);
Breakpoint 59 at 0x40b5c6: file driver/others/memory.c, line 178.
int get_num_procs(void);
Breakpoint 12 at 0x419bf2: file kernel/x86_64/dtrmm_kernel_4x8_haswell.c, line 207.
int dtrmm_kernel_LN(BLASLONG, BLASLONG, BLASLONG, double, double *, double *, double *, BLASLONG, 
    BLASLONG);
Breakpoint 13 at 0x41c942: file kernel/x86_64/dtrmm_kernel_4x8_haswell.c, line 207.
int dtrmm_kernel_LT(BLASLONG, BLASLONG, BLASLONG, double, double *, double *, double *, BLASLONG, 
    BLASLONG);
Breakpoint 14 at 0x41f742: file kernel/x86_64/dtrmm_kernel_4x8_haswell.c, line 209.
int dtrmm_kernel_RN(BLASLONG, BLASLONG, BLASLONG, double, double *, double *, double *, BLASLONG, 
    BLASLONG);
Breakpoint 15 at 0x422512: file kernel/x86_64/dtrmm_kernel_4x8_haswell.c, line 209.
int dtrmm_kernel_RT(BLASLONG, BLASLONG, BLASLONG, double, double *, double *, double *, BLASLONG, 
    BLASLONG);
Breakpoint 16 at 0x4199e5: kernel/x86_64/dtrmm_kernel_4x8_haswell.c:dtrmm_kernel_4x8. (4 locations)
static void dtrmm_kernel_4x8(BLASLONG, double *, double *, double *, double *, double *, double *, 
    double *, double *, double *, double *, double *);

Extra in non-working:

Breakpoint 29 at 0x40a750: file driver/others/init.c, line 696.
int get_node(void);
Breakpoint 30 at 0x40a620: file driver/others/init.c, line 658.
int get_node_equal(void);
Breakpoint 31 at 0x40a610: file driver/others/init.c, line 655.
int get_num_nodes(void);
Breakpoint 32 at 0x40a600: file driver/others/init.c, line 654.
int get_num_procs(void);
Breakpoint 35 at 0x40a650: file driver/others/init.c, line 662.
int gotoblas_set_affinity(int);
Breakpoint 33 at 0x40a780: file driver/others/init.c, line 703.
void gotoblas_affinity_init(void);
Breakpoint 34 at 0x40b6d0: file driver/others/init.c, line 827.
void gotoblas_affinity_quit(void);
Breakpoint 48 at 0x401630: file driver/others/memory.c, line 1295.
static void _init_thread_memory(void *);
Breakpoint 49 at 0x409230: file driver/others/memory.c, line 1260.
static void _touch_memory(blas_arg_t *, BLASLONG *, BLASLONG *, void *, 
    void *, BLASLONG);

At least that is what I got when doing a diff between two calls with gdb and rbreak .. If anyone has a better idea of how to get some useful logs then please let me know.

@martin-frbg
Copy link
Collaborator

Not sure what to make of the supposedly extra calls to get_num_threads etc - perhaps you want to check with ldd that both binaries link to the same system libraries (to exclude that one was built with OPENMP support and the other without or similar).
I assume you get some calls to dtrmm_kernel_RN et al. from inside dtrmm_kernel_4x8_haswell.c in the non-working case as well, just not these "extra" ones - or is it running a completely different dtrmm implementation ? (If this is not just the diff utility getting out of step)

@Sbte
Copy link
Contributor Author

Sbte commented Oct 17, 2016

They link to the same libraries.

Maybe it actually builds the haswell implementations (because they are there in my cmake build directory) but does not actually use them? What I did for the diff was first sort by file name, then by function name, and then take out all the addresses, breakpoint numbers and line numbers. So the lines in dtrmm_kernel_4x8_haswell.c are simply not there in the non-working case. It could also be that the cmake build and standard build forget to attach debug symbols for some files when built with debugging enabled.

@martin-frbg
Copy link
Collaborator

I do not see dtrmm_kernel_4x8_haswell.c (or DTRMM as such) defined anywhere except in KERNEL.HASWELL. What i did hit upon was a USE_TRMM variable in kernel/CMakeLists.txt
that got a special treatment for Haswell in 96f0bbe a.k.a. "fixed cmake bug on haswell".
Perhaps try "set(USE_TRMM true)" there unconditionally in case there is something funky with the STREQUAL checks ? (Or conversely force it to "false" if you are sure that it is now "true", and see how the cmake build behaves)

@Sbte
Copy link
Contributor Author

Sbte commented Oct 17, 2016

Oh right, fixed it. Thanks a lot!

Anyhow, doesn't this still mean it's broken with the standard implementation? Or was it just broken because it expected this implementation and not the standard one?

@martin-frbg
Copy link
Collaborator

martin-frbg commented Oct 17, 2016

That's one for the boss to answer - I am just a self-appointed janitor around here :-)
(But I suppose the cmake support is still considered to be somewhat experimental, e.g. lack of install target as noted in another recent issue. Maybe the "standard implementation" that got called accidentally in your case is simply not adequate for modern cpus, or only if all dependent functions all use the same level of (non-)precision)

@Sbte
Copy link
Contributor Author

Sbte commented Oct 17, 2016

Since it doesn't work with the precompiled version in Ubuntu 16.04 as well, and that version does not use cmake, there's still another bug out there. I'll try to fix it tomorrow. It should be easier to find now that I know that I probably need a different build target to be able to bisect it.

@martin-frbg
Copy link
Collaborator

martin-frbg commented Oct 17, 2016

What's the precompiled version in Ubuntu 16 based on then ? If it works in your copy of git develop, chances would seem to be that it is something that got fixed since whatever release they picked ?
(Replying to myself, 0.2.18 appears to be the latest and its buildlog does not show anything immediately obvious. Similarly the fix for #786 appears to be the only remotely relevant change on the develop branch since April. Obviously the Ubuntu build will be DYNAMIC_ARCH=1 but as far as I know this should not matter at runtime)

@xianyi
Copy link
Collaborator

xianyi commented Oct 18, 2016

@martin-frbg Thank you for the fix. I am not CMake expert. Thus, the codes are not elegant.

@Sbte
Copy link
Contributor Author

Sbte commented Oct 18, 2016

Some debugging so far:

In the debian changelog, it say that

  • debian/rules: remove TARGET=GENERIC flag when building dynamic arch binary.
    This flag creates a compilation failure and seems no longer needed.

Which happened somewhere after 14.04 (Ubuntu version where it worked). This compilation failure is caused by a85c278 as far as I can tell. The reason is that the TARGET is GENERIC, but it also tries to build other targets which don't have a trmm kernel.

Anyhow, trying to reproduce whatever they did when building the Ubuntu version did not give me any wrong answers yet. Maybe I need to build it on a different system...

@martin-frbg
Copy link
Collaborator

Not sure what you are comparing here - wouldn't the OpenBLAS libraries in Ubuntu 14&16 be based on entirely different releases (or even snapshots of the develop branch) ? Or are you comparing the
exact same release built on/for two Ubuntu releases (in which case gcc and glibc version might play a role IF build options are truely identical) ?
(And as far as I can tell "TARGET=GENERIC" was never a supported choice even with Kazushige Goto's original GotoBLAS, so looks more like a misunderstanding on the part of some Debian maintainer that went unnoticed until GENERIC was added as an internal option)

@Sbte
Copy link
Contributor Author

Sbte commented Oct 18, 2016

They are different releases but they are built in a different way as well. For me if I build things in the standard way everything works, so I'm trying to find out what the ubuntu maintainer does different from me when building the packages. For the 14.04 version, this was apparently at least that he was building with TARGET=GENERIC. But I'm trying to find more differences.

Anyhow, from this compilation failure, I found that in Makefile.L3, TARGET is never changed during a DYNAMIC_ARCH build even though it says TARGET = $(TARGET_CORE) at the top of the file. So now I'm wondering if that might be a problem. For now I just changed it to CORE since that always contains the right thing.

I'm also trying to build without USE_TRMM to see if that causes trouble.

Edit: Bingo! Not using USE_TRMM gives me this:

0.000000 0.000000 0.000000 0.000000 -4.000000 -4.000000 -4.000000 -4.000000 0.000000 0.000000 
0.000000 0.000000 0.000000 0.000000 -4.000000 -4.000000 -4.000000 -4.000000 0.000000 0.000000 
0.000000 0.000000 0.000000 0.000000 -4.000000 -4.000000 -4.000000 -4.000000 0.000000 0.000000 
0.000000 0.000000 0.000000 0.000000 -4.000000 -4.000000 -4.000000 -4.000000 0.000000 0.000000 
0.000000 0.000000 0.000000 0.000000 -4.000000 -4.000000 -4.000000 -4.000000 0.000000 0.000000 
0.000000 0.000000 0.000000 0.000000 -4.000000 -4.000000 -4.000000 -4.000000 0.000000 0.000000 
0.000000 0.000000 0.000000 0.000000 -4.000000 -4.000000 -4.000000 -4.000000 0.000000 0.000000 
0.000000 0.000000 0.000000 0.000000 -4.000000 -4.000000 -4.000000 -4.000000 0.000000 0.000000 
0.000000 0.000000 0.000000 0.000000 -4.000000 -4.000000 -4.000000 -4.000000 0.000000 0.000000 
0.000000 0.000000 0.000000 0.000000 -4.000000 -4.000000 -4.000000 -4.000000 0.000000 0.000000 

@martin-frbg
Copy link
Collaborator

Hmm. I am fairly certain that I saw the DYNAMIC_ARCH build looping through ...PRESCOTT, ...HASWELL etc. variants in the Ubuntu 16.04 build log on their server. Possibly the TARGET as such is
used only for the built-in tests in this case (where it would make sense to do it for to the current build cpu only, if at all) ? And wouldn't having to unset USE_TRMM now contradict yesterday's conclusion unless the dynamic-arch code (suddenly?) fails to identify a Haswell-generation cpu at runtime ?

@Sbte
Copy link
Contributor Author

Sbte commented Oct 18, 2016

DYNAMIC_ARCH loops just fine, but in the makefile itself where checks are performed I found that CORE LIBCORE and TARGET_CORE were set correctly (by Makefile_kernel.conf I believe) but TARGET was not. Maybe I did something wrong there. But since CORE is used in Makefile.L3 as well, maybe it's safer to use CORE (like in some other if statements).

Yesterday, we found that USE_TRMM was not set correctly for HASWELL. By changing the cmake file I made sure the TRMM kernel was used. So the problem seems to occur whenever this is not used. I now reproduced that with a normal makefile by unsetting it. In ubuntu 14.04 (v0.2.8), the USE_TRMM related if statements are not there, so at that point apparently it worked. So I'm bisecting at the moment to find where it went wrong (while removing all USE_TRMM=1 statements from the makefiles). I expect it to be done in a few hours.

Edit: Oh yeah, and apparently there's also something wrong with the identification of Haswell by the package that is provided by Ubuntu. Or else I wouldn't have encountered this problem. Maybe that's another bug? There's really too much going on here...

@martin-frbg
Copy link
Collaborator

So it looks as if USE_TRMM=1 is not set correctly during the Haswell part of a DYNAMIC_ARCH build ?
(0.2.8 is probably old enough to have used completely different code for trmm). Wouldn't it be easier to
make the Makefile write out the current setting of TARGET,CORE,USE_TRMM during each iteration of
the DYNAMIC_ARCH build to verify this hypothesis (as I think you already established that not using dtrmm_kernel_4x8_haswell.c is a very bad idea ?)

@Sbte
Copy link
Contributor Author

Sbte commented Oct 18, 2016

The first bad commit is 9bd962f but only in the case that USE_TRMM is undefined. So then I guess it uses the kernel/x86_64/dgemm_kernel_4x8_haswell.S file that was added there? Does that mean that there is a bug in that file? Or is this just wrong in general?

For the Ubuntu build: The logs contain this line

gcc -Wdate-time -D_FORTIFY_SOURCE=2 -g -O2 -fstack-protector-strong -Wformat -Werror=format-security -O2 -DMAX_STACK_ALLOC=2048 -Wall -m64 -DF_INTERFACE_GFORT -fPIC -DDYNAMIC_ARCH -DNO_LAPACKE -DSMP_SERVER -DNO_WARMUP -DMAX_CPU_NUMBER=64 -DASMNAME=dtrmm_kernel_LN_HASWELL -DASMFNAME=dtrmm_kernel_LN_HASWELL_ -DNAME=dtrmm_kernel_LN_HASWELL_ -DCNAME=dtrmm_kernel_LN_HASWELL -DCHAR_NAME=\"dtrmm_kernel_LN_HASWELL_\" -DCHAR_CNAME=\"dtrmm_kernel_LN_HASWELL\" -DNO_AFFINITY -DTS=_HASWELL -I.. -DBUILD_KERNEL -DTABLE_NAME=gotoblas_HASWELL -DDOUBLE  -UCOMPLEX -c -DTRMMKERNEL -DDOUBLE -UCOMPLEX -DLEFT -UTRANSA ../kernel/x86_64/dtrmm_kernel_4x8_haswell.c -o dtrmm_kernel_LN_HASWELL.o

So it seems to build the right file.

@martin-frbg
Copy link
Collaborator

As I see it, USE_TRMM must be set for Haswell so that the correct function is compiled. The bug would have to be either that USE_TRMM is not set automatically in that part of the DYNAMIC_ARCH build process, or (less likely I guess) that the library still manages to call a/the wrong implementation at runtime. If you have the log file from a TARGET=HASWELL build still available, you could check if
dtrmm_kernel_LN_HASWELL.o is the only file that dtrmm_kernel_4x8_haswell.c is compiled to or if there are more gcc lines using this file to build different flavors of the trmm function based on the CNAME argument.

@juliantaylor
Copy link

can you please post an example reproducing the issue so we can stop guessing what the issue might be. That there is a working code path with USE_TRMM or haswell is irrelevant, all code path must work.

@Sbte
Copy link
Contributor Author

Sbte commented Oct 18, 2016

@Sbte
Copy link
Contributor Author

Sbte commented Oct 18, 2016

Since Haswell has its own dtrmm implementation, would it be an idea to just remove all #if defined(TRMMKERNEL) stuff from the dgemm kernel? This seems to be the only thing that is bugged.

After this we still have to make sure that the right code path is used in Ubuntu, but I think that's a different issue.

@martin-frbg
Copy link
Collaborator

Most likely not - even if the file is called dgemm_something_haswell it may be used for sufficiently similar architectures as well. What appears to be broken is the behaviour on Haswell with USE_TRMM=false - which could be a bug or a feature depending on the intentions of the authors,
and the behaviour of a library build with DYNAMIC_ARCH=1 when it is used on Haswell/Skylake.

@Sbte
Copy link
Contributor Author

Sbte commented Oct 18, 2016

Yes, exactly.

@martin-frbg
Copy link
Collaborator

And it seems the Ubuntu maintainer for the OpenBLAS package has joined this thread. :-)

@brada4
Copy link
Contributor

brada4 commented Oct 18, 2016

1st as a workaround:
Use make build and slip valid .so after CMake builds

2nd nm -D libopenblas.so should yield same symbol list and their sizes should be identical (but list is not necessarily in same order)
There is alignment involved, it may give no luck, but there is some hope it points to some differences in build process.
If you cannot figure out diff - attach both nm-Dresults, maybe I have more luck...

@martin-frbg
Copy link
Collaborator

@brada4 the cmake problem is fixed already (though the fix was applied on the master branch, not develop, as that appears to be what Sbte's PR was based on).
The remaining issue is primarily why (at least the Ubuntu) builds using DYNAMIC_ARCH (and plain "make") appear to give wrong results for the test case when run on Haswell or later, while the native
build works. (And coupled to that is the question if Haswell code should work even with USE_TRMM
not set)

@juliantaylor
Copy link

juliantaylor commented Oct 19, 2016

I can't reproduce it on a haswell, nor any of the other x86 core types openblas supports (using the ubuntu package and master). So its maybe skylake dependent, though that does not really make sense as skylake just uses haswell code...
Does it work if you disable threading? (OPENBLAS_NUM_THREADS=1)

@Sbte
Copy link
Contributor Author

Sbte commented Oct 19, 2016

@juliantaylor Huh, it also seems to work on my system as well now. Must have been something else that caused this then. But this bug certainly existed before or else I would have never encountered it. Note that I never install anything globally on my system manually, so something in the Ubuntu repositories caused this to happen, but apparently not the openblas package. This just keeps getting weirder...

@Sbte
Copy link
Contributor Author

Sbte commented Oct 19, 2016

Actually that is not possible because the wrong implementation was never built (as far as I can tell from the build logs and the Makefile logic). So there is no way I could have been using it. Maybe I ran into another bug then that looks very similar to this one, but then found this one while trying to debug it (because I started building my own openblas using cmake). That's the only thing I can think of.

@martin-frbg
Copy link
Collaborator

Weird. How about closing this issue (as you found and fixed the bug in CMakeLists.txt that this one was originally about) and opening a new one if and when the DYNAMIC_ARCH gremlin reappears ?

@Sbte
Copy link
Contributor Author

Sbte commented Oct 19, 2016

Well, the implementation in kernel/x86_64/dgemm_kernel_4x8_haswell.S is still broken when compiled with -DTRMMKERNEL as far as I can tell. So this bug still exists in that file. I can open a new issue about that so at least it is documented that it is broken? I guess there's no high priority in fixing it since it is currently not used.

@martin-frbg
Copy link
Collaborator

martin-frbg commented Oct 19, 2016

But is there any combination of build options that actually leads to this getting built with -DTRMMKERNEL (and with wrong results, assuming this might be used by some non-Haswell KERNEL file as well) ? The whole file including the "ifdef TRMMKERNEL" branches appears to have come from a single commit by wernsaar in mid-2015, maybe it is a leftover from experiments that he did not think useful enough to upload, or in contrary a fallback path that made sense back then only.
(By all means open a "code cleanup" issue if you think it useful, but personally if and when wernsaar returns his attention to OpenBLAS I would rather see him address issues like 899 or 915(962))

@brada4
Copy link
Contributor

brada4 commented Oct 19, 2016

You can break about any kernel by pre-defining internal defines like TRMMKERNEL or CFLAGS.
You should take Makefile.rules as reference for public build parameters that at least somebody was able to use.

@Sbte
Copy link
Contributor Author

Sbte commented Oct 19, 2016

What I meant to say is that Makefile.L3 is able to do that if USE_TRMM is undefined. But for me this can be closed, if there is no need to fix the dgemm kernel. Probably if someone ever wants to actually use the TRMM part it will be well tested anyway.

@martin-frbg
Copy link
Collaborator

Only you as the submitter, or xianyi as the project owner can close this anyway.
My honest opinion is simply that 136 is an absurdly high number of supposedly-unresolved bugs for what is already a quite mature and useful library and adding cosmetic tasks that still require the attention of one of the few core developers will only bury more important issues and/or drive potential users away.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants