-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LAPACKE_sgesv crashes when run in row major format #717
Comments
As far as I know, the C-style LAPACKE_ wrappers for the LAPACK functions were a one-time contribution from Intel to the netlib lapack project - probably the result of some earlier private project of somebody they hired to work on MKL. Several bugs in them have been exposed lately, some of them related to the sizing of temporary arrays when working with row-major matrices. I suppose it would help if you could provide a minimal example, and/or eventually open a bug in the LAPACK developer forum at ics.cs.utk.edu/lapack-forum if you find that the bug still occurs with a current development snapshot of OpenBLAS (LAPACK-3.6.0 was released two weeks after 0.2.15 was, so you need to build a git snapshot of OpenBLAS to get it) |
Here is the minimal example. I'll try to build OpenBLAS...er, or not. I do not quite feel like going through the pain of installing all the Linux compilers just for this. I think I'll wait for the next release. Thanks for the reply. |
it looks like lapacke_sgesv_work.c (Lapack 3.6.0)miscaluclates size of transpose matrices starting line 48/49, then 64/69, then exceeds allocation in 75/76 |
Shades of #631 ? In any case closing this bug is probably not a good recipe for getting it fixed. |
Very well then, sorry about that. I'll give it a shot again when the new release comes out. |
Have not tried to bring your example into a form I can test on my system, but at least the lapacke_sgesv_row.c from the Intel MKL documentation found on software.intel.com runs flawlessly |
And #712 somehow rhymes too... |
Only recent change in lapacke_sgesv, lapacke_sgesv_work, lapacke_sge_isnan and lapacke_sge_trans appears to be a renaming of a local variable from matrix_order to matrix_layout. |
a is lda * n sized, transpose n * lda, actual code takes n * n |
The bug is in LAPACKE wrapper, that consists of few dozen hundreds of files taken from netlib, and let me bet 10c significant part of them miscalculates LAPACK (fortran call) result sizes... (It is really a hell to dig through to fix al of them) |
Not so fast. I see where things might blow up if lda is actually larger than n, but in the "minimal example" given above (and in the code I copied from the MKL documentation) the two are equal, hence this one should not crash. (Both lda_t and ldb_t get set to n, lda=5, n=5, ldb=3, nrhs=3 so if anything the b matrix has room to spare. Wrong testcase or different bug ?) |
That multi-gigabyte "unmanaged code" debugger could help some. I will count on my fingers to the end of sgesv_work tomorrow. |
I dont see more than this same potentially miscalculated sizes (which is correct in test case) aree used over and over again |
@mrakgr do you really get crashes with the exact code of the "minimal example" you provided, i.e. with a square matrix as input and LDA equal N, or did you change the dimensions as you tried to strip down the code ? |
lapacke_dgesv_work.c has same calculation in place. |
Several others as well probably - though the lapack folks seem to be making some inroads in this mostly thanks to the bug reports from btracey. Strange thing is that one of their developers expressed a desire to undo one such fix - the one for #631 - as he failed to make sense of the new code only a week after approving a related change elsewhere. And the automated spam filter on their phpBB forum is a total mess, made me drop most of the content from my post just to get any message across at all :-( |
@martin-frbg |
Strange. Unfortunately I cannot run the exact same code as I am trying this on Linux, but I see no functional difference between your example and the C one I am using. Just a wild idea - does the call to sgesvr() still crash if you comment out the preceding, successful call to sgesvc() ? (Just to exclude an effect like in #695, where data from an earlier function call somehow manage to slip into a later one- machine register not zeroed or something like that). It could be that the lapacke issue is just a red herring and the actual problem might be a Windows-specific stacksize problem similar to #697. In that case somebody needs to reproduce on Windows and include a backtrace before this can proceed. |
Yeah, it crashes even when I comment the successful call out. |
@mrakgr @martin-frbg , |
@xianyi The first difficulty I have is that when I try to install the mingw build it throws an error saying that it cannot download the repository file and aborts. This is roughly the kind of reason why I did an about face turn last time once I realized I would have to deal with Linux tools. I'll give rubenvb a try. ...Got all the tools running amazingly smoothly. The build went without a hitch. I tried the minimal example with the new libopenblas.dll, but it still crashes. I suppose now that I figured out how to build this, I could go a bit further if have any other ideas. |
I am not that familiar with Windows exception codes - does the exception number contain any hints to the actual error, i.e. illegal numerical operation or illegal memory access ? |
I am decently sure that in this case the error means memory corruption. Here is a direct quote from the Expert IL 2.0 Assembler book:
I tried setting arg_ldb to arg_nrhs, in other words 1, and it does not crash this time. Though I have to wonder, would an error even pop out if I tried running the function from native code and the memory got corrupted? I would guess not. Edit: It might be worth trying to run the function from C++. I do not suppose you have an example of how to call directly from a .dll, do you? |
Can't help with the dll call, but perhaps you could try running the test code I linked to (needs a few trivial changes to use the correct header instead of the mkl one) - that one has the expected result documented in its comments. So far we had looked only for sufficient dimensioning of temporary work arrays as I recall - perhaps the nrhs is used to determine an optimum stride for processing the array, and with ldb larger than life it falls off the end somewhere. (I cannot investigate this right now) |
Er, to be honest I am not sure how to run the example. I know how to program in C++, but I've never tried interfacing with libraries as complex as OpenBLAS. I only really went deeper into programming once I started working with F#. I do not think I'll be able to do it in Visual C++ at any rate. If you could give me more guidance I suppose I could give it a try, otherwise we might as well leave the issue for later. |
have you tried calling the fortran routines directly from C and comparing the result? i need some of these routines but if lapacke is buggy, it's a pain but i'll just build the fortran versions and use openmp. |
No, I haven't. I do not remember what it was, but inside the Diffsharp AD library the author (not me) used a bunch of tricks to make the solvers work even with transposed inputs. Due to that, getting the row major versions of the Lapacke functions was never a priority for either of us. I just noticed them while going through the documentation and decided to try them. I am a bit worried that memory corruption might go unnoticed in native code. If you decide to try building it yourself, it might be good to take an extra precaution and use some of these tools to check for it. |
The above is call stack after the access violation exception for the 0.2.16rc1 that I built, with the OPENBLAS_NUM_THREADS=1 environment variable added. I'll try the 0.2.15 binary next. |
Log:
Call stack:
The above is with the 0.2.15 binary. Should I also get gfortran? Here is the call to the function with the type annotations made explicit.
As F# is a statically typed language, changing the annotation to something else would give a compile time error. It is impossible for me to pass in a variable of a type other than what is declared near the top of the minimal example. |
Can you set OPENBLAS_NUM_THREADS=1 before running sample? |
It is supposed to be set to that. Is it using more than one thread? |
Here is the Threads table at the time of the access violation exception. I am not familiar with this so I have no idea if those threads are created by OpenBLAS or the .NET runtime. Edit: There is a function to set the number of threads manually I think. Should I try that? |
Unless you get any register in crash register file pointing finger to memory are where openblas.dll is loaded (best if EIP is there), problem does not look like related to openblas in any way. |
I would dump the registers, but I can't figure out how to copy them. Really, the only value which stands out before the access violation is this: It does not make sense as I using Windows 10 currently. I think I had the error on Windows 8 as well, but don't hold me on that. |
Hi, Best reagrds |
It looks more and more like corrupt .NET installation. Can you repeat crash of test case on other PC? |
It would be difficult, but I suppose I could give it a try. Give me a day or two. |
Although it was a massive pain, I found an old laptop, recompiled the program for .Net 3 and ran the example on it. As before, the thing crashes. Do you think the error could be in a .dll other than libopenblas.dll? This problem needs for somebody familiar with the library to try running my F# example on Windows. I am not sure I can help much here. |
I installed standalone F# compiler as here http://fsharp.org/use/windows/
Then this .net exception drops me into default debugger. Probably visual studio debug handler will help you to debug .NET code until you get executing openblas kernels contained in openblas dll (_NUM_THREADS=1 is just to make debug output cleaner) I cant help much more either. |
I've never had that exception thrown. I looked around for a bit and it seems to be related to 32/64 bit conflicts. I do not have much experience with the standalone compiler, so I can't help there, the most I can do is to tell you to download the 4GB VS Iso from Microsoft's site and install that. ...Though now that you've mentioned the VS debugger it occurs to me that I have a IL debugger that I installed just for a different issue I've investigated a while ago. I should have tried it before doing all of this, with it I'll be able to trace further than with high level code. Stay tuned... |
I've stepped through the file in the dnSpy debugger. The exception indeed occurs after the call to sgesv when the program tries to load a local variable arg_a (the pinned array), but the corruption occurs immediately after the call. I've redone the function so the order is passed as an additional parameter, but it does not matter either way. In the debugger, the local variable 10 (arg_a)'s fields and values disappear right after the call, which is unlike what happens when I pass in the column major as a parameter. So basically, I think we were both right in a way. I think OpenBLAS is corrupting the array in some way, but memory bugs being what they are, it is only detected in managed code. This thing is going to take some effort to fix if we want to go further, somebody will probably have to trace the call from my example all the way into Lapack code. This here is definite proof of memory corruption. |
I am still wondering if the "let mutable arg_ldb = n" in your example is correct, or if it should be arg_ldb=nrhs (though neither version crashed on my Linux machine, and it may well be just my ignorance showing). Would it make sense to rerun the test with the dimensions and matrix elements |
I suppose I'll try translating the example to F# then. It won't be hard. I'll have it done tomorrow. |
@martin-frbg You were right. I tested it on this example and it works fine. So...is this not a bug then? |
Next see what happens when you set ldb equal n instead of nrhs (which if I understand correctly might be the oversight in your problem case), does it then crash again ? (While it did not fail on Linux even when run under the valgrind debugger, maybe the memory layout of the array is different in your system environment ?) |
The result is incorrect, but apart from that, it is not crashing. |
Interesting insofar as it seems to match my results. What could be tried next, perhaps modify the b array so that it is a 5 by 1 matrix (or rather, vector) like in your crashing testcase to see if that is the corner case that fails with the (in my reasoning incorrect) ldb=n ? |
I tried it. The program runs to completion with nrhs=1 and nrhs=3, but crashes with nrhs set to 2,4 or 5. |
Did you vary nrhs or ldb ? (Experimenting with the value of nrhs could be opening a different can of worms I think) |
Ldb gets set to nrhs, so I varied the nrhs. Well, I expected it would crash if I went over the limit though. |
I must admit I have trouble wrapping my head around this at the moment, but I am not sure if varying nrhs with ldb set from it is equivalent to your original problem. There the value of nrhs was not in question (I think), but you had set ldb equal to n (the other dimension of B, fixed by the size of the square matrix A) where I thought (and the MKL example seems to suggest) it should be equal to nrhs. |
Honestly, I've been varying the nrhs for no good reason. No, actually I think I get now what the problem was. Here is the setup in the original example:
So in the above both lda and ldb (the leading dimension arguments) are set to n which is five. In the C example translation:
So in the translated example the leading dimension of b is set the same as in the C example which is 1, while in my own first example way up at the top the leading dimension of b is set to 5...which I am guessing is incorrect, which is why it is crashing. I think this might be it... Yeah, in the original example when I set arg_ldb to 1 it works even with random inputs. |
Yes, that was what I was getting at - I was just confused by the description in the comments of (LAPACK-without-"E") sgesv.f which state that LDB can be between 1 and N rather than the NRHS used in the MKL lapackE_sgesv example to signify "the entire B". |
@martin-frbg This is good, I was starting to get tired of this problem. I do not think I'll be able to help much from here. I am glad that I could help so far. I'll let you or somebody else decide when or whether to close this issue. |
Thank you for the cooperation. I have convinced myself that the issue is solved - basically you must have told lapacke_sgesv to transpose a bigger matrix than you handed it, and things went downhill from there. Catching the erroneous input in lapacke_sgesv does not seem feasible. |
@martin-frbg , thank you for your investigating this bug. @mrakgr , thank you for the feedback. I am going to close this issue. |
I am using 0.2.15 version of the library(27-10-2015), Windows 10, i7-4690k Intel processor. I am calling the function from F# Interactive and I am sure that I got the parameters right because it works perfectly in column major format. A while ago, I asked on the users group whether these functions are legacy.
I do not have a backtrace as I am calling into unmanaged code. It just throws an execution engine exception. Tell me if I can assist the debugging effort somehow.
The text was updated successfully, but these errors were encountered: