-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sdot yields wrong results for odd sizes > 2^24 and all sizes > 2^29 #1326
Comments
You're running into floating point round-off errors here. In single precision, IEEE floating point numbers have only 24 bits of precision, so the largest positive integer that can be represented precisely is 16777216=w^24. You can verify that dot products of this size work fine in double precision. |
Thanks for the explanation, that makes sense. Unfortunately that conclusion is a bit unsatisfactory since one would have to switch from Question: Is it completely out of scope of OpenBLAS to solve this issue by deciding upfront whether or not a double accumulator will be necessary or not? |
That suggestion does not sound too convincing to me - if you know exactly what loss of precision your particular use case can tolerate you may be better off coding your own case-specific trickery to save speed and memory ? There is an INTERFACE64 build option to switch all BLAS integers to long |
A short note of interest from the related numpy issue: numpy/numpy#9852: It seems this "bug" does not happen when using intelMKL, so as it stands OpenBLAS is in a way "inferior" to intelMKL, and downstream users get significantly different results depending on the backend used, perhaps this is something to keep in mind. |
What CPU are you seeing this with ? |
No, that was just a lame suggestion, and I realize that I can do that easily myself.
I don't see how this is related since we're not hitting any integer overflow issues yet. Of course, arrays larger than maxint are problematic as well, but that's a separate issue.
When using Intel MKL with the same vanilla implementation as above, right, it produces the same results. However, in Numpy the dot product with MKL backend does not suffer from this issue, as opposed to the OpenBLAS backend. See numpy/numpy#9852.
Well, "you" is in this case Numpy, and I don't know about their plans. I as a user will have to resort to workarounds. |
Follow-up to this: the results are the same for the first half (sizes <= 2^28), but the issue with "constant result" from 2^28 + 1 on does not occur with MKL. |
That was just in the context of the odl issue you had linked (if I understood that one correctly) As OpenBLAS uses different assembly kernels for ?dot on different CPUs, it would probably help to know which one(s) you are seeing this with. |
Right, there are two entry points into this issue :-)
In the first post I specified everything that matters (I hope):
and [...]
float dot = cblas_sdot(size, a, 1, a, 1);
[...] |
Here's the last block of
|
Thanks. I had indeed overlooked the uname info above, the cpuinfo is obviously from another system but both will be using the HASWELL target as far as OpenBLAS is concerned. |
Oops, that's me at my laptop yesterday and at my desktop machine today :-) |
_dot is naive loop. Other options: |
Note that there are actually two different issues here which I believe have different sources. The first is the lack of bits to represent the final result (the 2^24 - 2^28 case), which is baked into the API. The second one is the failure to accumulate beyond 2^28, which seems to be due to implementation rather than the API. So at least the second issue could be addressed without changing the API. I still don't quite know what happens there. Can anyone solve the mystery for me? |
Same limitation is present in reference implemetation. |
To me this still looks as if you are in the territory of undefined behaviour. Maybe MKL switches to double internally, maybe it happens to use a different kind and size of cpu register or splits up the summation task in different ways. If I read the other issue thread correctly, Apple's Accelerate failed as well ? Wonder what plain netlib BLAS does here (though it will probably depend on the compiler). |
that yields parity difference a compiler loop vectorisation property. Ill try to get to clang-vs-gcc at least. |
Another option - inline sdot in place of call, with fixed loop compilers will do better at vectorising |
You can add 2^24 +1.0, it will yield 2^24 because next representable floating number is 2^24+2 (+4 +6 etc ahead) |
actually kernel/x86_64/sdot.c uses double accumulator, in addition to passing 32 floats at once to assembly microkernels. I tend to think 2^29 actually is deviation from reference implementation, though agree with you it is better than reference (just that I guessed it from external effects, not looking at the source) |
Could you manually accumulate it in double as guys said. Just blocksize it . You can use one for loop with desired step and 1 tail(lets say 8196 or bigger to call few sdots).
(P.s im not able to check myself) |
@brada4 FLOAT mydot=0.0; |
I think I have same source file as you. Every 32 input elements single precision 'mydot' is added to double accumulator 'dot'
|
@brada4
(Ps my laptop is broken. Im using my phone to look at source. I cant check it running) |
Line 99 in that file adds (float)mydot to (double)dot |
I think @quickwritereader is correct, there is no loop in the sdot.c implementation. The line So, no double accumulation. |
I should add that I infer that from the C code, I don't know what the microkernel does from the assembly alone. Follow-up question: Is looping in the microkernel significantly faster than doing the loop outside? With an outside loop one could actually accumulate with double precision, which would be nice unless it's too slow. |
Confusion comes that function scope variable is called "dot" and then later global double accumulator is called "dot" |
You can measure. dot (and Level 1 BLAS in general) are bound to memory access, not CPU. |
@kohr-h |
Question is which way to push it - eliminate double accumulator, or make double accumulator be honoured in odd-number cases? Either is consistent, @kohr-h certainly will appreciate extra precision even it is unintended side effect of algorithm initially. |
It's perhaps worth mentioning that the BLAS Technical Forum produced a standard for an extension to the BLAS in which certain routines would use extended precision internally to routines such as xDOT(). See Chapter 4 of the BLAS Report document. A version of the SDOT() routine that takes its arguments in single precision, computes the dot product in double precision, and returns the result in single precision was included. As far as I can tell, this has never been widely implemented. |
Thanks for a point in IMO good direction. I will try to make a PR with minimum change to odd wrapping to not sabotage good flow until reaching it. |
Unfortunately I'm not, and while I would love to dig a bit into it I'm a bit short on time now. Maybe later this week. |
I do not think any of the original netlib BLAS functions are called by OpenBLAS, what does get called however seems to be the generic C implementation (kernel/generic/dot.c) which has no inline assembly. As a starting point, the "#if defined(DSDOT)" mechanisms could be copied from there, though the microkernels will most likely need changing to return a double precision result as well. |
You're right, I went down the wrong code path. But as you write, the used implementation is also just a C loop without optimized microkernels. Optimization there would be great. |
Quick check suggests that a naive implementation leaving the microkernels in their current state, but casting all the floats to double everywhere else would be sufficient up to 2^29+31. |
|
That is why I called it a trivial implementation - It is expected to work for ranges of input such that addition of any 32 values will not overflow or otherwise lead to a loss of precision. Better than the current state but not applicable to a general case. As assembly programming is still new to me, my attempts at adding a ddot-like code path to the sdot microkernel have only led to "interesting" results so far. |
we dont lose much for general case, if it overflows in short range actual original values are marginal already, and probably accuracy is lost to limited range earlier, and doubles are needed there, |
Correct me if I'm wrong, but that's not what happening with the current microkernel. My experiments with literally copying the microkernel code into a C file and calling it on an array shows that it computes the dot of everything up to the last max. 31 terms. That is, there is a loop in the microkernel, not just a single dot for 32 floats. So any loss of precision that happens there cannot be fixed by casting As @quickwritereader suggested it should be possible to simplify the kernel to really only do 32 floats at a time, and then loop over the packets of 32 outside in C, without being significantly slower. The So the logic in that implementation would be
Finally, |
Microkernel is assembly that is called inside the loop. |
@kohr-h you are right of course - the current code has the loop inside the microkernel, and my change as uploaded in the PR does not yet contain the added outer loop in the C code. I'll redo this when I am more awake... |
PR updated now to include the loop around the microkernel. Have not done any timings though. |
Nice 👍 |
It is not exactly 32 or something. It is 2^n that matches single optimal memory fetch in compiled code. 32 is something like 4 or 8 of those. |
Trivial to "#ifdef" out the now-unused "loop" from the microkernel, but it is just one subtraction and a comparison so I suspect the performance gain will be negligible. |
I have merged my quick-and-dirty implementation (including the "#ifndef DSDOT" in the microkernel to get rid of one each of addq,subq,jz) now and am inclined to leave it at that... |
I just did a quick-n-dirty speed test. Overhead is about 1.9 % with 2^30 elements, compared to 8 % using manual chunking. For me that's sufficient. So unless folks here want to do something similar to When will the next release be? I will suggest the |
Thanks for testing. FWIW, the next major release according to xianyi's statement in #1245 was to be "about December 2017" while the next minor one has been stalled since end of july. |
New PR expands the DSDOT hack to all x86_64 kernels that use the x86_64 sdot.c file - these are AMD Bulldozer/Steamroller/Piledriver/Excavator/Zen and Intel Nehalem/Sandybridge. The only recent kernel missing optimized DSDOT support here would now appear to be Atom (which has no optimized SDOT, but a pure assembly DDOT that could presumably be modified to take float input). The ARM64 sdot.S for CortexA57 and ThundeX2T99 appears to have been written with DSDOT support in mind already, but that appears to be currently unused @ashwinyes ? |
Nice! It would be great to have this in a release (#1258) so dependent packages can pick it up. So is the plan to keep |
When I compute a dot product of a vector of all ones with itself, the result should be the size of the vector. This starts to no longer hold at size 2^24 + 1 as the following program shows:
I get
and then, for sizes larger than 2^29 the result just stays constant at 2^29:
I'm using the latest
master
develop
version of OpenBLAS, and I statically link the library to not accidentally dynamically link an older version:Some more info:
See also this related Numpy issue
The text was updated successfully, but these errors were encountered: