Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sdot yields wrong results for odd sizes > 2^24 and all sizes > 2^29 #1326

Closed
kohr-h opened this issue Oct 11, 2017 · 65 comments
Closed

sdot yields wrong results for odd sizes > 2^24 and all sizes > 2^29 #1326

kohr-h opened this issue Oct 11, 2017 · 65 comments
Milestone

Comments

@kohr-h
Copy link

kohr-h commented Oct 11, 2017

When I compute a dot product of a vector of all ones with itself, the result should be the size of the vector. This starts to no longer hold at size 2^24 + 1 as the following program shows:

// file tmp.c
#include <stdio.h>
#include <stdlib.h>
#include <assert.h>
#include "cblas.h"

int main(int argc, char **argv) {
    int exp = atoi(argv[1]);
    int offset = atoi(argv[2]);
    size_t size = (1 << exp) + offset;
    float *a = (float *) malloc(size * sizeof(float));
    assert(a != NULL);
    for (size_t i=0; i < size; ++i)
        a[i] = 1;

    float dot = cblas_sdot(size, a, 1, a, 1);
    printf("dot = %f\n", dot);
    printf("size = %zu = (1 << %i) + (%i)\n", size, exp, offset);

    return 0;
}

I get

$ ./tmp 24 0
dot = 16777216.000000
size = 16777216 = (1 << 24) + (0)

$ ./tmp 24 1
dot = 16777216.000000
size = 16777217 = (1 << 24) + (1)

$ ./tmp 24 2
dot = 16777218.000000
size = 16777218 = (1 << 24) + (2)

$ ./tmp 24 3
dot = 16777220.000000
size = 16777219 = (1 << 24) + (3)

$ ./tmp 24 4
dot = 16777220.000000
size = 16777220 = (1 << 24) + (4)

$ ./tmp 24 3
dot = 16777220.000000
size = 16777221 = (1 << 24) + (5)

and then, for sizes larger than 2^29 the result just stays constant at 2^29:

$ ./tmp 29 1
dot = 536870912.000000
size = 536870913 = (1 << 29) + (1)

$ ./tmp 29 2
dot = 536870912.000000
size = 536870914 = (1 << 29) + (2)

$ ./tmp 29 4
dot = 536870912.000000
size = 536870916 = (1 << 29) + (4)

$ ./tmp 30 -1
dot = 536870912.000000
size = 1073741823 = (1 << 30) + (-1)

$ ./tmp 30 0
dot = 536870912.000000
size = 1073741824 = (1 << 30) + (0)

I'm using the latest master develop version of OpenBLAS, and I statically link the library to not accidentally dynamically link an older version:

$ gcc -o tmp tmp.c [...]/OpenBLAS/build/lib/libopenblas.a -I[...]/OpenBLAS/build/include -std=c99
$ ldd tmp
        linux-vdso.so.1 (0x00007ffe2ed79000)
        libc.so.6 => /lib64/libc.so.6 (0x00007f58b2fdc000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f58b338c000)

Some more info:

$ uname -a
Linux sabayon 4.11.0-sabayon #1 SMP Wed Jul 26 12:57:06 UTC 2017 x86_64 Intel(R) Core(TM) i7-4800MQ CPU @ 2.70GHz GenuineIntel GNU/Linux

See also this related Numpy issue

@brianborchers
Copy link

You're running into floating point round-off errors here. In single precision, IEEE floating point numbers have only 24 bits of precision, so the largest positive integer that can be represented precisely is 16777216=w^24.

You can verify that dot products of this size work fine in double precision.

@kohr-h
Copy link
Author

kohr-h commented Oct 12, 2017

Thanks for the explanation, that makes sense. Unfortunately that conclusion is a bit unsatisfactory since one would have to switch from float arrays to double arrays just to avoid this issue, giving away both speed and storage advantage of singles. This issue The second part of this issue would go away if the accumulator was double precision, at the expense of speed, of course.

Question: Is it completely out of scope of OpenBLAS to solve this issue by deciding upfront whether or not a double accumulator will be necessary or not?
I can imagine some simple heuristics, like computing the average of n << size samples and computing the expected dot value. But I know too little about floating-point arithmetic, and I also don't know the overall policy (try to return correct results vs. return predictably wrong results).
I do realize that the issue with the representation of the final result doesn't go away if the return value should be float, but the issue from 2^29 on seems to be more related to the summation itself, so that should be fixable.

@martin-frbg
Copy link
Collaborator

That suggestion does not sound too convincing to me - if you know exactly what loss of precision your particular use case can tolerate you may be better off coding your own case-specific trickery to save speed and memory ? There is an INTERFACE64 build option to switch all BLAS integers to long
but this will probably not save you if you want to preserve the speed and footprint of int32 (assuming it actually does what you need, which I have not checked now). I expect you will encounter this problem no matter which BLAS implementation you will try. Perhaps you can change your own code to use either single or double precision depending on size - that would keep all the advantages of floats where possible while still yielding a correct result for larger problems.

@adler-j
Copy link

adler-j commented Oct 12, 2017

A short note of interest from the related numpy issue: numpy/numpy#9852:

It seems this "bug" does not happen when using intelMKL, so as it stands OpenBLAS is in a way "inferior" to intelMKL, and downstream users get significantly different results depending on the backend used, perhaps this is something to keep in mind.

@martin-frbg
Copy link
Collaborator

What CPU are you seeing this with ?

@kohr-h
Copy link
Author

kohr-h commented Oct 12, 2017

That suggestion does not sound too convincing to me - if you know exactly what loss of precision your particular use case can tolerate you may be better off coding your own case-specific trickery to save speed and memory ?

No, that was just a lame suggestion, and I realize that I can do that easily myself.

There is an INTERFACE64 build option to switch all BLAS integers to long
but this will probably not save you if you want to preserve the speed and footprint of int32 (assuming it actually does what you need, which I have not checked now).

I don't see how this is related since we're not hitting any integer overflow issues yet. Of course, arrays larger than maxint are problematic as well, but that's a separate issue.

I expect you will encounter this problem no matter which BLAS implementation you will try.

When using Intel MKL with the same vanilla implementation as above, right, it produces the same results. However, in Numpy the dot product with MKL backend does not suffer from this issue, as opposed to the OpenBLAS backend. See numpy/numpy#9852.
I don't know what Numpy does on their end to fix this or if MKL exposes methods that let you specify precision of accumulator and result, for example. Maybe one the Numpy people can comment on that.
cuBLAS, for instance, has a dot function with extended capabilities for just that.

Perhaps you can change your own code to use either single or double precision depending on size - that would keep all the advantages of floats where possible while still yielding a correct result for larger problems.

Well, "you" is in this case Numpy, and I don't know about their plans. I as a user will have to resort to workarounds.

@kohr-h
Copy link
Author

kohr-h commented Oct 12, 2017

When using Intel MKL with the same vanilla implementation as above, right, it produces the same results.

Follow-up to this: the results are the same for the first half (sizes <= 2^28), but the issue with "constant result" from 2^28 + 1 on does not occur with MKL.

@martin-frbg
Copy link
Collaborator

I don't see how this is related since we're not hitting any integer overflow

That was just in the context of the odl issue you had linked (if I understood that one correctly)

As OpenBLAS uses different assembly kernels for ?dot on different CPUs, it would probably help to know which one(s) you are seeing this with.

@kohr-h
Copy link
Author

kohr-h commented Oct 12, 2017

That was just in the context of the odl issue you had linked (if I understood that one correctly)

Right, there are two entry points into this issue :-)

As OpenBLAS uses different assembly kernels for ?dot on different CPUs, it would probably help to know which one(s) you are seeing this with.

In the first post I specified everything that matters (I hope):

$ uname -a
Linux sabayon 4.11.0-sabayon #1 SMP Wed Jul 26 12:57:06 UTC 2017 x86_64 Intel(R) Core(TM) i7-4800MQ CPU @ 2.70GHz GenuineIntel GNU/Linux

and

    [...]
    float dot = cblas_sdot(size, a, 1, a, 1);
    [...]

@kohr-h
Copy link
Author

kohr-h commented Oct 12, 2017

Here's the last block of /proc/cpuinfo:

processor	: 7
vendor_id	: GenuineIntel
cpu family	: 6
model		: 94
model name	: Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz
stepping	: 3
microcode	: 0x39
cpu MHz		: 800.048
cache size	: 8192 KB
physical id	: 0
siblings	: 8
core id		: 3
cpu cores	: 4
apicid		: 7
initial apicid	: 7
fpu		: yes
fpu_exception	: yes
cpuid level	: 22
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb intel_pt tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp
bugs		:
bogomips	: 8020.00
clflush size	: 64
cache_alignment	: 64
address sizes	: 39 bits physical, 48 bits virtual
power management:

@martin-frbg
Copy link
Collaborator

Thanks. I had indeed overlooked the uname info above, the cpuinfo is obviously from another system but both will be using the HASWELL target as far as OpenBLAS is concerned.

@kohr-h
Copy link
Author

kohr-h commented Oct 12, 2017

the cpuinfo is obviously from another system

Oops, that's me at my laptop yesterday and at my desktop machine today :-)

@brada4
Copy link
Contributor

brada4 commented Oct 12, 2017

_dot is naive loop.
If you had even very random input you would get very different result if you sort input one or other way.
adding say in chunks could level out round-offs a bit (like hint cc heuristics), but i fail to see how that improves general cases.

Other options:
Make sdot with quad-precision sum register, all conversions are slow but accuracy impecable
convert en-masse to double and use ddot
neither is BLAS API anymore.

@kohr-h
Copy link
Author

kohr-h commented Oct 12, 2017

Note that there are actually two different issues here which I believe have different sources. The first is the lack of bits to represent the final result (the 2^24 - 2^28 case), which is baked into the API. The second one is the failure to accumulate beyond 2^28, which seems to be due to implementation rather than the API. So at least the second issue could be addressed without changing the API.

I still don't quite know what happens there. Can anyone solve the mystery for me?

@brada4
Copy link
Contributor

brada4 commented Oct 12, 2017

Same limitation is present in reference implemetation.

@martin-frbg
Copy link
Collaborator

To me this still looks as if you are in the territory of undefined behaviour. Maybe MKL switches to double internally, maybe it happens to use a different kind and size of cpu register or splits up the summation task in different ways. If I read the other issue thread correctly, Apple's Accelerate failed as well ? Wonder what plain netlib BLAS does here (though it will probably depend on the compiler).

@brada4
Copy link
Contributor

brada4 commented Oct 12, 2017

that yields parity difference a compiler loop vectorisation property. Ill try to get to clang-vs-gcc at least.

@brada4
Copy link
Contributor

brada4 commented Oct 12, 2017

Another option - inline sdot in place of call, with fixed loop compilers will do better at vectorising

@brada4
Copy link
Contributor

brada4 commented Oct 12, 2017

You can add 2^24 +1.0, it will yield 2^24 because next representable floating number is 2^24+2 (+4 +6 etc ahead)
2^29 comes from vectorized summation (2^5 - interim sums of 32 floats were accumulated)
probably setting rounding away from zero get you ^25
It is floating point representation limitation, not exactly the easy to understand case of integer overflow you mentioned before. FPU does not signal rounding in action.

@brada4
Copy link
Contributor

brada4 commented Oct 12, 2017

actually kernel/x86_64/sdot.c uses double accumulator, in addition to passing 32 floats at once to assembly microkernels. I tend to think 2^29 actually is deviation from reference implementation, though agree with you it is better than reference (just that I guessed it from external effects, not looking at the source)

@quickwritereader
Copy link
Contributor

quickwritereader commented Oct 13, 2017

Could you manually accumulate it in double as guys said. Just blocksize it . You can use one for loop with desired step and 1 tail(lets say 8196 or bigger to call few sdots).

And also please accumulate it in float manually too . Imho if it will work for float accumulation too. We could add that internally too. Because in source code accumulation done using previous sum. So maybe in some point it will add big number +1 which could stay at previous sum. So later addind again + 1 could yield the same prev sum

(P.s im not able to check myself)

@quickwritereader
Copy link
Contributor

@brada4
main loop is using FLOAT in sdot.c . probably you meant "double dot declaration"

FLOAT mydot=0.0;

@brada4
Copy link
Contributor

brada4 commented Oct 13, 2017

I think I have same source file as you. Every 32 input elements single precision 'mydot' is added to double accumulator 'dot'

        double dot = 0.0 ;
        FLOAT mydot=0.0;
...
mydot = <kernel on 32 members>;
dot+=mydot;
...
return(dot)

@quickwritereader
Copy link
Contributor

quickwritereader commented Oct 13, 2017

@brada4
sdot.c
yes I mean most accumulation result stored in mydot which is FLOAT and even looking at c 32kernel we see FLOAT is used inside. then it sum tails in dot and finally add mydot to it.

this function has outter block size?

(Ps my laptop is broken. Im using my phone to look at source. I cant check it running)

@brada4
Copy link
Contributor

brada4 commented Oct 13, 2017

Line 99 in that file adds (float)mydot to (double)dot

@kohr-h
Copy link
Author

kohr-h commented Oct 13, 2017

I think @quickwritereader is correct, there is no loop in the sdot.c implementation. The line n1 = n & (BLASLONG)(-32); makes n1 a multiple of 32, then sdot_kernel_16(n1, x, y , &mydot ); runs the microkernel and indeed updates the float mydot along the way. So in fact, n1 elements are accumulated using single precision. The remainder is then accumulated in double dot, and the mydot added to it at the end.

So, no double accumulation.

@kohr-h
Copy link
Author

kohr-h commented Oct 13, 2017

I should add that I infer that from the C code, I don't know what the microkernel does from the assembly alone.

Follow-up question: Is looping in the microkernel significantly faster than doing the loop outside? With an outside loop one could actually accumulate with double precision, which would be nice unless it's too slow.

@brada4
Copy link
Contributor

brada4 commented Oct 13, 2017

Confusion comes that function scope variable is called "dot" and then later global double accumulator is called "dot"
Probably renaming one could help clarity.

@brada4
Copy link
Contributor

brada4 commented Oct 13, 2017

You can measure. dot (and Level 1 BLAS in general) are bound to memory access, not CPU.
There are other invisible assembly microkernels, you are peeking at C fallback.

@quickwritereader
Copy link
Contributor

quickwritereader commented Oct 15, 2017

@kohr-h
Also if you are familiar with inline assembly usage you could ignore these loops at all and change inner kernel a bit. Inner kernels accumulate sums in different places then after loop it adds all them to find total. You could change it and use double for total summation. It will not lose any performance and will handle your case. Fork openblas and give it a try. Afterall it is open source and its fine to have your own copy. You could also PR so that experienced members could check and decide to merge it

@brada4
Copy link
Contributor

brada4 commented Oct 15, 2017

Question is which way to push it - eliminate double accumulator, or make double accumulator be honoured in odd-number cases? Either is consistent, @kohr-h certainly will appreciate extra precision even it is unintended side effect of algorithm initially.

@brianborchers
Copy link

It's perhaps worth mentioning that the BLAS Technical Forum produced a standard for an extension to the BLAS in which certain routines would use extended precision internally to routines such as xDOT(). See Chapter 4 of the BLAS Report document. A version of the SDOT() routine that takes its arguments in single precision, computes the dot product in double precision, and returns the result in single precision was included. As far as I can tell, this has never been widely implemented.

@brada4
Copy link
Contributor

brada4 commented Oct 16, 2017

Thanks for a point in IMO good direction. I will try to make a PR with minimum change to odd wrapping to not sabotage good flow until reaching it.

@kohr-h
Copy link
Author

kohr-h commented Oct 16, 2017

@quickwritereader

Also if you are familiar with inline assembly usage you could ignore these loops at all and change inner kernel a bit.

Unfortunately I'm not, and while I would love to dig a bit into it I'm a bit short on time now. Maybe later this week.

@martin-frbg
Copy link
Collaborator

martin-frbg commented Oct 16, 2017

It goes to this Fortran code, no optimized kernels being used.

I do not think any of the original netlib BLAS functions are called by OpenBLAS, what does get called however seems to be the generic C implementation (kernel/generic/dot.c) which has no inline assembly. As a starting point, the "#if defined(DSDOT)" mechanisms could be copied from there, though the microkernels will most likely need changing to return a double precision result as well.

@kohr-h
Copy link
Author

kohr-h commented Oct 16, 2017

I do not think any of the original netlib BLAS functions are called by OpenBLAS

You're right, I went down the wrong code path. But as you write, the used implementation is also just a C loop without optimized microkernels. Optimization there would be great.

@martin-frbg
Copy link
Collaborator

Quick check suggests that a naive implementation leaving the microkernels in their current state, but casting all the floats to double everywhere else would be sufficient up to 2^29+31.

@kohr-h
Copy link
Author

kohr-h commented Oct 19, 2017

For me that solution would actually be perfectly fine, and the sdot code could remain untouched. I'll check with the Numpy guys. Not quite, since the microkernel still accumulates in float precision for everything except size % 32 entries. So it would still make sense to also look at the microkernel.

@martin-frbg
Copy link
Collaborator

That is why I called it a trivial implementation - It is expected to work for ranges of input such that addition of any 32 values will not overflow or otherwise lead to a loss of precision. Better than the current state but not applicable to a general case. As assembly programming is still new to me, my attempts at adding a ddot-like code path to the sdot microkernel have only led to "interesting" results so far.

@brada4
Copy link
Contributor

brada4 commented Oct 20, 2017

we dont lose much for general case, if it overflows in short range actual original values are marginal already, and probably accuracy is lost to limited range earlier, and doubles are needed there,
It should be a new kernel to exactly replicate reference function running cvt* instruction per each element to read into register. There is no instruction like "read 4 memory floats into doubles into AVX register"

@kohr-h
Copy link
Author

kohr-h commented Oct 20, 2017

That is why I called it a trivial implementation - It is expected to work for ranges of input such that addition of any 32 values will not overflow or otherwise lead to a loss of precision.

Correct me if I'm wrong, but that's not what happening with the current microkernel. My experiments with literally copying the microkernel code into a C file and calling it on an array shows that it computes the dot of everything up to the last max. 31 terms. That is, there is a loop in the microkernel, not just a single dot for 32 floats. So any loss of precision that happens there cannot be fixed by casting mydot to double afterwards.

As @quickwritereader suggested it should be possible to simplify the kernel to really only do 32 floats at a time, and then loop over the packets of 32 outside in C, without being significantly slower. The DSDOT codepath as added in #1329 would then automatically do the right casting.

So the logic in that implementation would be

  • Read 32 floats
  • Compute local dot product in single precision
  • Cast result to double
  • Add result to outside double dot
  • Do next 32 floats
    ...
  • Add tail part with or without casting, depending on sdot vs. dsdot

Finally, sdot would return single and dsdot double. To which extent sdot should also do the double casting business along the way may be a matter of taste.

@brada4
Copy link
Contributor

brada4 commented Oct 20, 2017

Microkernel is assembly that is called inside the loop.

@kohr-h
Copy link
Author

kohr-h commented Oct 20, 2017

@martin-frbg
Copy link
Collaborator

@kohr-h you are right of course - the current code has the loop inside the microkernel, and my change as uploaded in the PR does not yet contain the added outer loop in the C code. I'll redo this when I am more awake...

@martin-frbg
Copy link
Collaborator

PR updated now to include the loop around the microkernel. Have not done any timings though.

@kohr-h
Copy link
Author

kohr-h commented Oct 23, 2017

Nice 👍
I did do some rudimentary timing with the existing microkernel and packets-of-32 calls, giving about 8 % slower speed. Some folks in here suggest that the microkernel can be largely simplified to specifically do only 32 floats at a time without any looping, and that this would be way faster than the current one written with looping in mind. @brada4 @quickwritereader

@brada4
Copy link
Contributor

brada4 commented Oct 23, 2017

It is not exactly 32 or something. It is 2^n that matches single optimal memory fetch in compiled code. 32 is something like 4 or 8 of those.
I doubt you can measure 8% memory bandwidth regression, it is enough to have some graphics popup to eat that (dot is bound to RAm bandwidth)p

@martin-frbg
Copy link
Collaborator

Trivial to "#ifdef" out the now-unused "loop" from the microkernel, but it is just one subtraction and a comparison so I suspect the performance gain will be negligible.

@martin-frbg
Copy link
Collaborator

I have merged my quick-and-dirty implementation (including the "#ifndef DSDOT" in the microkernel to get rid of one each of addq,subq,jz) now and am inclined to leave it at that...

@kohr-h
Copy link
Author

kohr-h commented Nov 13, 2017

I just did a quick-n-dirty speed test. Overhead is about 1.9 % with 2^30 elements, compared to 8 % using manual chunking. For me that's sufficient.

So unless folks here want to do something similar to sdot, I'd be happy with the current solution.

When will the next release be? I will suggest the dsdot solution to the Numpy guys.

@martin-frbg
Copy link
Collaborator

Thanks for testing. FWIW, the next major release according to xianyi's statement in #1245 was to be "about December 2017" while the next minor one has been stalled since end of july.

@martin-frbg
Copy link
Collaborator

New PR expands the DSDOT hack to all x86_64 kernels that use the x86_64 sdot.c file - these are AMD Bulldozer/Steamroller/Piledriver/Excavator/Zen and Intel Nehalem/Sandybridge. The only recent kernel missing optimized DSDOT support here would now appear to be Atom (which has no optimized SDOT, but a pure assembly DDOT that could presumably be modified to take float input).

The ARM64 sdot.S for CortexA57 and ThundeX2T99 appears to have been written with DSDOT support in mind already, but that appears to be currently unused @ashwinyes ?
Similarly for MIPS P5600 and MIPS64, unless DSDOT support gets picked up automatically when not defined in any KERNEL file.
The POWER8 implementation of sdot.c appears to be fundamentally the same as the previous x86_64 version, so should be able to make use of the hack as well.

@martin-frbg
Copy link
Collaborator

ARM64 now has optimized DSDOT as well (from #1475, #1469)

@kohr-h
Copy link
Author

kohr-h commented Mar 5, 2018

Nice! It would be great to have this in a release (#1258) so dependent packages can pick it up.

So is the plan to keep sdot as it is (behaves accordding to spec, AFAICT) and advise users to use dsdot for better accuracy? If yes, I guess this issue can be closed.

@martin-frbg martin-frbg added this to the 0.3.0 milestone Apr 2, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants