Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LAPACKE_sgesv crashes when run in row major format #717

Closed
mrakgr opened this issue Dec 13, 2015 · 67 comments
Closed

LAPACKE_sgesv crashes when run in row major format #717

mrakgr opened this issue Dec 13, 2015 · 67 comments

Comments

@mrakgr
Copy link

mrakgr commented Dec 13, 2015

I am using 0.2.15 version of the library(27-10-2015), Windows 10, i7-4690k Intel processor. I am calling the function from F# Interactive and I am sure that I got the parameters right because it works perfectly in column major format. A while ago, I asked on the users group whether these functions are legacy.

I do not have a backtrace as I am calling into unmanaged code. It just throws an execution engine exception. Tell me if I can assist the debugging effort somehow.

@martin-frbg
Copy link
Collaborator

As far as I know, the C-style LAPACKE_ wrappers for the LAPACK functions were a one-time contribution from Intel to the netlib lapack project - probably the result of some earlier private project of somebody they hired to work on MKL. Several bugs in them have been exposed lately, some of them related to the sizing of temporary arrays when working with row-major matrices. I suppose it would help if you could provide a minimal example, and/or eventually open a bug in the LAPACK developer forum at ics.cs.utk.edu/lapack-forum if you find that the bug still occurs with a current development snapshot of OpenBLAS (LAPACK-3.6.0 was released two weeks after 0.2.15 was, so you need to build a git snapshot of OpenBLAS to get it)

@mrakgr
Copy link
Author

mrakgr commented Dec 13, 2015

Here is the minimal example. I'll try to build OpenBLAS...er, or not. I do not quite feel like going through the pain of installing all the Linux compilers just for this.

I think I'll wait for the next release. Thanks for the reply.

@mrakgr mrakgr closed this as completed Dec 13, 2015
@brada4
Copy link
Contributor

brada4 commented Dec 13, 2015

it looks like lapacke_sgesv_work.c (Lapack 3.6.0)miscaluclates size of transpose matrices starting line 48/49, then 64/69, then exceeds allocation in 75/76

@martin-frbg
Copy link
Collaborator

Shades of #631 ? In any case closing this bug is probably not a good recipe for getting it fixed.

@mrakgr mrakgr reopened this Dec 13, 2015
@mrakgr
Copy link
Author

mrakgr commented Dec 13, 2015

Very well then, sorry about that. I'll give it a shot again when the new release comes out.

@martin-frbg
Copy link
Collaborator

Have not tried to bring your example into a form I can test on my system, but at least the lapacke_sgesv_row.c from the Intel MKL documentation found on software.intel.com runs flawlessly

@brada4
Copy link
Contributor

brada4 commented Dec 13, 2015

And #712 somehow rhymes too...

@martin-frbg
Copy link
Collaborator

Only recent change in lapacke_sgesv, lapacke_sgesv_work, lapacke_sge_isnan and lapacke_sge_trans appears to be a renaming of a local variable from matrix_order to matrix_layout.
(And I do not yet see a fault in the size calculation for or transposition of matrix b)

@brada4
Copy link
Contributor

brada4 commented Dec 14, 2015

a is lda * n sized, transpose n * lda, actual code takes n * n
b is ldb * nhrs, transpose nhrs * ldb, in code nhrs * n
lda and ldb are both defined as >=n

@brada4
Copy link
Contributor

brada4 commented Dec 14, 2015

The bug is in LAPACKE wrapper, that consists of few dozen hundreds of files taken from netlib, and let me bet 10c significant part of them miscalculates LAPACK (fortran call) result sizes... (It is really a hell to dig through to fix al of them)
Workaround would be to call proper smaller functions but with correct output buffer from your library.

@martin-frbg
Copy link
Collaborator

Not so fast. I see where things might blow up if lda is actually larger than n, but in the "minimal example" given above (and in the code I copied from the MKL documentation) the two are equal, hence this one should not crash. (Both lda_t and ldb_t get set to n, lda=5, n=5, ldb=3, nrhs=3 so if anything the b matrix has room to spare. Wrong testcase or different bug ?)

@brada4
Copy link
Contributor

brada4 commented Dec 14, 2015

That multi-gigabyte "unmanaged code" debugger could help some. I will count on my fingers to the end of sgesv_work tomorrow.

@brada4
Copy link
Contributor

brada4 commented Dec 15, 2015

I dont see more than this same potentially miscalculated sizes (which is correct in test case) aree used over and over again

@martin-frbg
Copy link
Collaborator

@mrakgr do you really get crashes with the exact code of the "minimal example" you provided, i.e. with a square matrix as input and LDA equal N, or did you change the dimensions as you tried to strip down the code ?

@brada4
Copy link
Contributor

brada4 commented Dec 19, 2015

lapacke_dgesv_work.c has same calculation in place.

@martin-frbg
Copy link
Collaborator

Several others as well probably - though the lapack folks seem to be making some inroads in this mostly thanks to the bug reports from btracey. Strange thing is that one of their developers expressed a desire to undo one such fix - the one for #631 - as he failed to make sense of the new code only a week after approving a related change elsewhere. And the automated spam filter on their phpBB forum is a total mess, made me drop most of the content from my post just to get any message across at all :-(

@mrakgr
Copy link
Author

mrakgr commented Dec 20, 2015

@martin-frbg
The above exact example with the square dimensions crashes. Just checked.

@martin-frbg
Copy link
Collaborator

Strange. Unfortunately I cannot run the exact same code as I am trying this on Linux, but I see no functional difference between your example and the C one I am using. Just a wild idea - does the call to sgesvr() still crash if you comment out the preceding, successful call to sgesvc() ? (Just to exclude an effect like in #695, where data from an earlier function call somehow manage to slip into a later one- machine register not zeroed or something like that). It could be that the lapacke issue is just a red herring and the actual problem might be a Windows-specific stacksize problem similar to #697. In that case somebody needs to reproduce on Windows and include a backtrace before this can proceed.

@mrakgr
Copy link
Author

mrakgr commented Dec 20, 2015

Yeah, it crashes even when I comment the successful call out.

@xianyi
Copy link
Collaborator

xianyi commented Feb 26, 2016

@mrakgr @martin-frbg ,
Sorry, I didn't notice this issue before.
Do you meet the same bug on 0.2.16.rc1?

@mrakgr
Copy link
Author

mrakgr commented Mar 2, 2016

@xianyi
Sorry, did not see your message before. When I worked on this a few months ago, I was busy, but I suppose I can give it another whack...

The first difficulty I have is that when I try to install the mingw build it throws an error saying that it cannot download the repository file and aborts. This is roughly the kind of reason why I did an about face turn last time once I realized I would have to deal with Linux tools. I'll give rubenvb a try.

...Got all the tools running amazingly smoothly.

The build went without a hitch. I tried the minimal example with the new libopenblas.dll, but it still crashes.

I suppose now that I figured out how to build this, I could go a bit further if have any other ideas.

@martin-frbg
Copy link
Collaborator

I am not that familiar with Windows exception codes - does the exception number contain any hints to the actual error, i.e. illegal numerical operation or illegal memory access ?
(And picking up from the earlier hunt, could you try setting "ldb" equal nrhs instead of equal n in your minimal example ? At least according to my reading of the lapack code it should not matter memory-wise, but perhaps the extra dimension introduces some other problem).
For reference the C demo code from the MKL documentation that I used for my tests was
https://software.intel.com/sites/products/documentation/doclib/mkl_sa/11/mkl_lapack_examples/lapacke_sgesv_row.c.htm

@mrakgr
Copy link
Author

mrakgr commented Mar 2, 2016

I am decently sure that in this case the error means memory corruption. Here is a direct quote from the Expert IL 2.0 Assembler book:

  • ExecutionEngineException. This is the generic exception, indicating that some sequence of IL instructions has brought the execution engine into a state of complete perplexity—as a rule, by corrupting the memory. Verifiable code cannot corrupt the memory and hence does not raise exceptions of this type.

I tried setting arg_ldb to arg_nrhs, in other words 1, and it does not crash this time. Though I have to wonder, would an error even pop out if I tried running the function from native code and the memory got corrupted? I would guess not.

Edit: It might be worth trying to run the function from C++.

I do not suppose you have an example of how to call directly from a .dll, do you?

@martin-frbg
Copy link
Collaborator

Can't help with the dll call, but perhaps you could try running the test code I linked to (needs a few trivial changes to use the correct header instead of the mkl one) - that one has the expected result documented in its comments. So far we had looked only for sufficient dimensioning of temporary work arrays as I recall - perhaps the nrhs is used to determine an optimum stride for processing the array, and with ldb larger than life it falls off the end somewhere. (I cannot investigate this right now)

@mrakgr
Copy link
Author

mrakgr commented Mar 2, 2016

Er, to be honest I am not sure how to run the example. I know how to program in C++, but I've never tried interfacing with libraries as complex as OpenBLAS. I only really went deeper into programming once I started working with F#. I do not think I'll be able to do it in Visual C++ at any rate. If you could give me more guidance I suppose I could give it a try, otherwise we might as well leave the issue for later.

@gaming-hacker
Copy link

have you tried calling the fortran routines directly from C and comparing the result? i need some of these routines but if lapacke is buggy, it's a pain but i'll just build the fortran versions and use openmp.

@mrakgr
Copy link
Author

mrakgr commented Mar 4, 2016

No, I haven't. I do not remember what it was, but inside the Diffsharp AD library the author (not me) used a bunch of tricks to make the solvers work even with transposed inputs. Due to that, getting the row major versions of the Lapacke functions was never a priority for either of us. I just noticed them while going through the documentation and decided to try them.

I am a bit worried that memory corruption might go unnoticed in native code. If you decide to try building it yourself, it might be good to take an extra precaution and use some of these tools to check for it.

@mrakgr
Copy link
Author

mrakgr commented Mar 7, 2016

Address          To               From             Comment                                                                                                                                   
000000E2F9DFEC40 00007FFDD2902CF1 00007FFE31F6139C return to 00007FFDD2902CF1 from clr.00007FFE31F6139C
000000E2F9DFEC48 00000294D953A880 00007FFDD2902CF1 return to 00000294D953A880 from 00007FFDD2902CF1
000000E2F9DFEC50 00000294D953A880 00000294D953A880 return to 00000294D953A880 from 00000294D953A880
000000E2F9DFEC58 00000294D7B317B0 00000294D953A880 return to 00000294D7B317B0 from 00000294D953A880
000000E2F9DFEC60 0000000000000000 00000294D7B317B0 return to 0000000000000000 from 00000294D7B317B0

The above is call stack after the access violation exception for the 0.2.16rc1 that I built, with the OPENBLAS_NUM_THREADS=1 environment variable added. I'll try the 0.2.15 binary next.

@mrakgr
Copy link
Author

mrakgr commented Mar 7, 2016

Log:

...
DLL Loaded: 00007FFE2A460000 C:\Windows\Microsoft.NET\Framework64\v4.0.30319\clrjit.dll
DLL Loaded: 00007FFE46450000 C:\Windows\System32\oleaut32.dll
DLL Loaded: 00007FFE29280000 C:\Windows\Microsoft.NET\Framework64\v4.0.30319\diasymreader.dll
TLS Callbacks: 2
2 invalid TLS callback addresses...
DLL Loaded: 000000006D7C0000 E:\Downloads\OpenBLAS-v0.2.15-Win64-int32\bin\libopenblas.dll
TLS Callbacks: 2
2 invalid TLS callback addresses...
DLL Loaded: 0000000000560000 C:\x86_64-w64-mingw32-gcc-4.8.0-win64_rubenvb\mingw64\bin\libgfortran-3.dll
TLS Callbacks: 2
2 invalid TLS callback addresses...
DLL Loaded: 0000000061440000 C:\x86_64-w64-mingw32-gcc-4.8.0-win64_rubenvb\mingw64\bin\libgcc_s_seh-1.dll
TLS Callbacks: 2
2 invalid TLS callback addresses...
DLL Loaded: 000000006CF00000 C:\x86_64-w64-mingw32-gcc-4.8.0-win64_rubenvb\mingw64\bin\libquadmath-0.dll
First chance exception on 00007FFE31F6139C (C0000005, EXCEPTION_ACCESS_VIOLATION)!

Call stack:

Address          To               From             Comment                                                                                                                                   
000000DD7F5FED60 00007FFDD2932CF1 00007FFE31F6139C return to 00007FFDD2932CF1 from clr.00007FFE31F6139C
000000DD7F5FED68 000002A15912A880 00007FFDD2932CF1 return to 000002A15912A880 from 00007FFDD2932CF1
000000DD7F5FED70 000002A15912A880 000002A15912A880 return to 000002A15912A880 from 000002A15912A880
000000DD7F5FED78 000002A1576D17B0 000002A15912A880 return to 000002A1576D17B0 from 000002A15912A880
000000DD7F5FED80 0000000000000000 000002A1576D17B0 return to 0000000000000000 from 000002A1576D17B0

The above is with the 0.2.15 binary. Should I also get gfortran?

Here is the call to the function with the type annotations made explicit.

let arg_info = LAPACKE_sgesv((arg_order : int), (arg_n : int), (arg_nrhs : int), (arg_a.Ptr : nativeptr<float32>), (arg_lda : int), (arg_ipiv.Ptr : nativeptr<int>), (arg_b.Ptr : nativeptr<float32>), (arg_ldb : int))
[<SuppressUnmanagedCodeSecurity>]
[<DllImport(blas_path)>]
extern int LAPACKE_sgesv(int order, int n, int nrhs, float32 *a, int lda, int *ipiv, float32 *b, int ldb)

As F# is a statically typed language, changing the annotation to something else would give a compile time error. It is impossible for me to pass in a variable of a type other than what is declared near the top of the minimal example.

@brada4
Copy link
Contributor

brada4 commented Mar 7, 2016

Can you set OPENBLAS_NUM_THREADS=1 before running sample?
So far none of addresses fall in openblas.dll memory space.

@mrakgr
Copy link
Author

mrakgr commented Mar 7, 2016

It is supposed to be set to that. Is it using more than one thread?

@mrakgr
Copy link
Author

mrakgr commented Mar 7, 2016

Number   ID       Entry            TEB              RIP              Suspend Count  Priority     Wait Reason  Last Error  Name                                                                
5        20F0     00007FFE31F6EB18 00000050C76DE000 00007FFE490F5C24 0              Highest      Executive    00000000    
4        20BC     00007FFE320468A4 00000050C76DC000 00007FFE490F5C24 0              Normal       Executive    00000000    
Main     1A18     00000000004049DE 00000050C76D4000 00007FFE31F6139C 0              Normal       Executive    0000047E    Main Thread
1        1238     00007FFE4907B350 00000050C76D6000 00007FFE490F87E4 0              Normal       Executive    00000000    
2        B9C      00007FFE4907B350 00000050C76D8000 00007FFE490F87E4 0              Normal       Executive    00000000    
3        2078     00007FFE4907B350 00000050C76DA000 00007FFE490F87E4 0              Normal       Executive    00000000    

Here is the Threads table at the time of the access violation exception. I am not familiar with this so I have no idea if those threads are created by OpenBLAS or the .NET runtime.

Edit: There is a function to set the number of threads manually I think. Should I try that?

@brada4
Copy link
Contributor

brada4 commented Mar 7, 2016

Unless you get any register in crash register file pointing finger to memory are where openblas.dll is loaded (best if EIP is there), problem does not look like related to openblas in any way.

@mrakgr
Copy link
Author

mrakgr commented Mar 7, 2016

I would dump the registers, but I can't figure out how to copy them.

Really, the only value which stands out before the access violation is this:
LastError 0000047E (ERROR_OLD_WIN_VERSION)

It does not make sense as I using Windows 10 currently. I think I had the error on Windows 8 as well, but don't hold me on that.

@wernsaar
Copy link
Contributor

wernsaar commented Mar 8, 2016

Hi,
Yesterday I fixed some bugs for LAPACK and LAPACKE, but not
bug 148 and bug 149, because these fixes are wrong.

Best reagrds
Werner

@brada4
Copy link
Contributor

brada4 commented Mar 8, 2016

It looks more and more like corrupt .NET installation. Can you repeat crash of test case on other PC?

@mrakgr
Copy link
Author

mrakgr commented Mar 8, 2016

It would be difficult, but I suppose I could give it a try. Give me a day or two.

@mrakgr
Copy link
Author

mrakgr commented Mar 8, 2016

Although it was a massive pain, I found an old laptop, recompiled the program for .Net 3 and ran the example on it. As before, the thing crashes. Do you think the error could be in a .dll other than libopenblas.dll?

This problem needs for somebody familiar with the library to try running my F# example on Windows. I am not sure I can help much here.

@brada4
Copy link
Contributor

brada4 commented Mar 8, 2016

I installed standalone F# compiler as here http://fsharp.org/use/windows/
And no matter I compile 32bit binary or 64bit binary (i tried all possible "platforms") with 32bit 64/32bit or 64/64bit openblas DLL result is same:

Unhandled Exception: System.BadImageFormatException: An attempt was made to load a program with an incorrect format. (Exception from HRESULT: 0x8007000B)
   at Lapacke_sgesv_dump.LAPACKE_sgesv(Int32 order, Int32 n, Int32 nrhs, Single* a, Int32 lda, Int32
* ipiv, Single* b, Int32 ldb)
   at Lapacke_sgesv_dump.sgesvc(Single[,] a, Single[] b)
   at <StartupCode$lapacke_sgesv_dump>.$Lapacke_sgesv_dump$fsx.main@()

Then this .net exception drops me into default debugger. Probably visual studio debug handler will help you to debug .NET code until you get executing openblas kernels contained in openblas dll (_NUM_THREADS=1 is just to make debug output cleaner)

I cant help much more either.

@mrakgr
Copy link
Author

mrakgr commented Mar 8, 2016

I've never had that exception thrown. I looked around for a bit and it seems to be related to 32/64 bit conflicts. I do not have much experience with the standalone compiler, so I can't help there, the most I can do is to tell you to download the 4GB VS Iso from Microsoft's site and install that.

...Though now that you've mentioned the VS debugger it occurs to me that I have a IL debugger that I installed just for a different issue I've investigated a while ago. I should have tried it before doing all of this, with it I'll be able to trace further than with high level code. Stay tuned...

@mrakgr
Copy link
Author

mrakgr commented Mar 8, 2016

I've stepped through the file in the dnSpy debugger. The exception indeed occurs after the call to sgesv when the program tries to load a local variable arg_a (the pinned array), but the corruption occurs immediately after the call. I've redone the function so the order is passed as an additional parameter, but it does not matter either way.

In the debugger, the local variable 10 (arg_a)'s fields and values disappear right after the call, which is unlike what happens when I pass in the column major as a parameter.

So basically, I think we were both right in a way. I think OpenBLAS is corrupting the array in some way, but memory bugs being what they are, it is only detected in managed code.

This thing is going to take some effort to fix if we want to go further, somebody will probably have to trace the call from my example all the way into Lapack code. This here is definite proof of memory corruption.

@martin-frbg
Copy link
Collaborator

I am still wondering if the "let mutable arg_ldb = n" in your example is correct, or if it should be arg_ldb=nrhs (though neither version crashed on my Linux machine, and it may well be just my ignorance showing). Would it make sense to rerun the test with the dimensions and matrix elements
taken from the MKL example I linked to above (which might have the added benefit of a well-documented correct result, compared to the random values in the F# code example) ?

@mrakgr
Copy link
Author

mrakgr commented Mar 8, 2016

I suppose I'll try translating the example to F# then. It won't be hard. I'll have it done tomorrow.

@mrakgr
Copy link
Author

mrakgr commented Mar 9, 2016

@martin-frbg You were right. I tested it on this example and it works fine. So...is this not a bug then?

@martin-frbg
Copy link
Collaborator

Next see what happens when you set ldb equal n instead of nrhs (which if I understand correctly might be the oversight in your problem case), does it then crash again ? (While it did not fail on Linux even when run under the valgrind debugger, maybe the memory layout of the array is different in your system environment ?)

@mrakgr
Copy link
Author

mrakgr commented Mar 9, 2016

ldb = n
[[0.634240746f; 0.775075555f; 1.38294423f]
[6.19000006f; 4.0f; 0.126142904f]
[0.206646755f; -0.101255469f; -4.57000017f]
[-7.57000017f; 0.0010392624f; 1.21490288f]
[0.645664513f; 2.8599999f; 8.98999977f]]

ldb = nrhs // Same as C example solution.
[[-0.800714135f; -0.389621407f; 0.955464423f]
[-0.695243657f; -0.554427147f; 0.220659524f]
[0.593914866f; 0.842227399f; 1.90063655f]
[1.32172525f; -0.103801742f; 5.35766125f]
[0.565755785f; 0.105711028f; 4.04060221f]]

The result is incorrect, but apart from that, it is not crashing.

@martin-frbg
Copy link
Collaborator

Interesting insofar as it seems to match my results. What could be tried next, perhaps modify the b array so that it is a 5 by 1 matrix (or rather, vector) like in your crashing testcase to see if that is the corner case that fails with the (in my reasoning incorrect) ldb=n ?
But at least it is probably safe to say now that there is no general problem with lapacke_sgesv or calling it from F# code, and no problem with your dlls or .NET installation.

@mrakgr
Copy link
Author

mrakgr commented Mar 9, 2016

I tried it. The program runs to completion with nrhs=1 and nrhs=3, but crashes with nrhs set to 2,4 or 5.

@martin-frbg
Copy link
Collaborator

Did you vary nrhs or ldb ? (Experimenting with the value of nrhs could be opening a different can of worms I think)

@mrakgr
Copy link
Author

mrakgr commented Mar 9, 2016

Ldb gets set to nrhs, so I varied the nrhs. Well, I expected it would crash if I went over the limit though.

@martin-frbg
Copy link
Collaborator

I must admit I have trouble wrapping my head around this at the moment, but I am not sure if varying nrhs with ldb set from it is equivalent to your original problem. There the value of nrhs was not in question (I think), but you had set ldb equal to n (the other dimension of B, fixed by the size of the square matrix A) where I thought (and the MKL example seems to suggest) it should be equal to nrhs.

@mrakgr
Copy link
Author

mrakgr commented Mar 9, 2016

Honestly, I've been varying the nrhs for no good reason.

No, actually I think I get now what the problem was.

Here is the setup in the original example:

let a = Array2D.init 5 5 (fun _ _ -> rng.NextDouble() |> float32)
let b = Array.init 5 (fun _ -> rng.NextDouble() |> float32)

    let m = Array2D.length1 a // 5
    let n = Array2D.length2 a // 5
    let a' = Array2D.copy a
    let b' = Array.copy b
    let ipiv = Array.zeroCreate n
    let mutable arg_order = LapackeOrder.RowMajor |> int
    let mutable arg_n = n // 5
    let mutable arg_nrhs = 1
    let mutable arg_lda = n // 5
    let mutable arg_ldb = n // 5

So in the above both lda and ldb (the leading dimension arguments) are set to n which is five.

In the C example translation:

let a = [|
        6.80f; -6.05f; -0.45f;  8.32f; -9.67f;
        -2.11f; -3.30f;  2.58f;  2.71f; -5.14f;
        5.66f; 5.36f; -2.70f;  4.35f; -7.26f;
        5.97f; -4.44f;  0.27f; -7.17f; 6.08f;
        8.23f; 1.08f;  9.04f;  2.14f; -6.87f|]
        |> fun x ->
            Array2D.init 5 5 (fun r c -> x.[r*5+c])

let b = [|4.02f;
          6.19f;
          -8.22f;
          -7.57f;
          -3.03f;|]

// So a here is of dimension 5,5 and b is 5

    let n = 5
    let nrhs = 1
    let lda = n
    let ldb = nrhs
    let a' = Array2D.copy a
    let b' = Array.copy b
    let ipiv = Array.zeroCreate n
    let mutable arg_order = order
    let mutable arg_n = n // 5
    let mutable arg_nrhs = nrhs
    let mutable arg_lda = lda // 5
    let mutable arg_ldb = ldb // 1

So in the translated example the leading dimension of b is set the same as in the C example which is 1, while in my own first example way up at the top the leading dimension of b is set to 5...which I am guessing is incorrect, which is why it is crashing.

I think this might be it... Yeah, in the original example when I set arg_ldb to 1 it works even with random inputs.

@martin-frbg
Copy link
Collaborator

Yes, that was what I was getting at - I was just confused by the description in the comments of (LAPACK-without-"E") sgesv.f which state that LDB can be between 1 and N rather than the NRHS used in the MKL lapackE_sgesv example to signify "the entire B".
Apparently the LAPACKE interface (which I never used) differs from regular LAPACK in this regard - but only in the row-major case that is a LAPACKE specialty anyway. And on further digging our "ldb" input to lapacke_sgesv in the row-major case is actually NOT the "ldb" that lapacke_sgesv_work uses to call the actual LAPACK function: When called with "row major", it does some transposing magic to shuffle the input matrices so that they conform to the column major order that sgesv supports exclusively. Only when the input signals column-major operation are the parameters passed through directly, else the now-misleading "ldb" is used to define the subset of nrhs to pull data from for building the temporary transposed matrix, and a new "ldb" is calculated internally for the sgesv call.
Phew. But at least no bug in OpenBLAS or LAPACK, just a peculiarity that your code stumbled over.

@mrakgr
Copy link
Author

mrakgr commented Mar 10, 2016

@martin-frbg This is good, I was starting to get tired of this problem. I do not think I'll be able to help much from here. I am glad that I could help so far.

I'll let you or somebody else decide when or whether to close this issue.

@martin-frbg
Copy link
Collaborator

Thank you for the cooperation. I have convinced myself that the issue is solved - basically you must have told lapacke_sgesv to transpose a bigger matrix than you handed it, and things went downhill from there. Catching the erroneous input in lapacke_sgesv does not seem feasible.
I do not have the priviledges to close this issue though - only you as the original reporter, or one of the developers can. So I guess the final decision rests with @xianyi

@xianyi
Copy link
Collaborator

xianyi commented Mar 11, 2016

@martin-frbg , thank you for your investigating this bug. @mrakgr , thank you for the feedback. I am going to close this issue.

@xianyi xianyi closed this as completed Mar 11, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants