-
-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
readdlm test failure on 64-bit Fedora with LLVM 3.4 #6757
Comments
Cc @tanmaykm |
Strange... Is the |
@tanmaykm As I said, I'm not able to reproduce the bug on my machine, only on the Fedora build VM. Could you suggest a few commands I could add so that the needed debug info is printed when building the package? |
Sure. Start the REPL in the test folder:
And run the following in the REPL:
|
@tanmaykm I can't reproduce the bug at the REPL. It would be nice if you could think of a few things which could fail, so that I put a series of debugging statements in my RPM package, and get the logs after the build runs. I'll test that the file exists, but I can't see any reason why it could be missing, so I'd like to test a few other possibilities if you have ideas (each build takes some time to start). |
A complete traceback of the exception would help identify the source line where |
OK, got it. I think this is related to the fact that I build the RPM package against LLVM 3.4. Unfortunately, that's the only available version in Fedora 20. I've eventually simplified the command to this:
And I'm able to reproduce this locally when using LLVM 3.4. Do you think it's worth trying to fix? Without support for this version I'm not sure I'll be able to package Julia for Fedora -- though I need to investigate this issue since in the long-term this may prove problematic. |
I use llvm 3.4 on Mac and haven't seen this failure. |
That's weird. Going back to 3.3 (where possible, it's a little hackish on Fedora 20 now) clearly fixed the problem. |
Any ideas about how I could debug the problem further? |
I can reproduce on ubuntu 13.10. Here's a backtrace (using #6737)
|
Cool! :-) |
I was able to replicate this on a ubuntu 13.10 machine as well. Also observed that in method Adding debug statements in the method shifts the problem around. Looks like some corruption? |
Was that in llvm 3.4 or 3.3? |
llvm 3.4 |
Also happens in 3.5. |
Removing 0.3 milestone, as 0.3 uses LLVM 3.3 only. We can reprioritize if this turns out to happen with LLVM 3.3 as well. |
Yet it's going to prevent me from packaging 0.3 in Fedora. Not saying it should be a blocker, but it's still relatively annoying. |
Is that because fedora will only support llvm 3.4? They really shouldn't do that. Different versions of llvm are in general quite incompatible. |
Yeah, that's something I'm going to discuss with them. LLVM maintainers recently updated to 3.4, and for now I seem to be forced to follow this change. |
That is a bummer if we can't be in the next fedora release. |
I've just asked LLVM Fedora maintainers about this problem: https://bugzilla.redhat.com/show_bug.cgi?id=1098534 |
If the fix does not require major work, it would be really nice if we can fix this issue for LLVM 3.4 for the 0.3 release. |
Yeah, that and all the other bugs we don't yet know about. |
I am sure there are others. Let's see what the Fedora LLVM maintainers say. Otherwise, IIUC, we could be in the fedora-updates repository. |
Yeah, there may well be other bugs hidden somewhere... At least tests should catch the most important ones. @ViralBShah fedora-updates is the normal place were new packages appear, and that's really not a problem to be there. The question is rather, can Julia be included at all? But we'll likely figure out a solution. |
I completely understand you not wanting to maintain LLVM 3.3 for Debian, but declaring a version that is less than a year old completely dead is kind of strange. We may just have to declare that LLVM is part of the Julia source and not even pretend that linking against it as a shared library provided by the distro is going to work. |
@StefanKarpinski The decision that LLVM 3.3 is dead was taken by the LLVM developers, not by Debian. Embedding a copy of LLVM 3.3 in Julia will not solve the issue, at least for Debian, for reasons explained above. |
I understand that the constraints of a distro like Debian are different than yours. And at a personal level, I cannot do much about the decision to remove LLVM 3.3 (I tried to argue with the Debian maintainer), and I cannot embed a copy of LLVM 3.3 in the Julia package because this completely goes against the spirit of a distro and could create problems with other parties (such as the security team). My current plan is to ship Julia 0.3 with LLVM 3.5. But if this is something that you definitely don't want to see happen because it could give a bad image of Julia, then the only option left is to not have Julia in the next Debian release. This would be sad, but this is not the end of the world either. |
Having a buggy version of Julia in the distribution is probably better and easier to fix later than no version at all, though that's up to Stefan, Jeff, Viral etc to decide. Sounds like this maintenance policy question really needs to be raised on the LLVM list, as more and more downstream projects depend on and distributions need to include specific older versions, and this is starting to interact poorly with LLVM's fast break-everything development style (which has other benefits, but gets in the way here). I know Rust has their own set of problems wrt distributions and bootstrapping, but what happens when Rust hits 1.0 for example? They're probably going to need to support a fixed version of LLVM for a while. |
I wasn't under the impression that Debian liked to ship buggy versions of things, but apparently they prefer that to allowing us to link to, you know, our dependencies. We could stand to get a better sense of how well julia 0.3 works with LLVM 3.5. We know there is a failing test, but we should try commenting out that test and see if everything else passes at least. |
I'm currently discussing the same issue for Fedora. I may be allowed to use LLVM 3.3 for the next release, but it's not sure yet. As more out-of-tree compilers like Julia and Rust rely on LLVM, it seems they really should review their release policy. |
@JeffBezanson On my 64-bit machine, I can confirm that |
That's nice. Keno and several others have been going above & beyond to try to keep julia working with LLVM 3.5 reasonably well. |
3.5 has OldJIT still... we could switch that back on with 3.5, although
|
Has anybody tested the performance of the old jit vs mcjit? I remember having read that the old jit was faster. Is it really an issue to have a package that has a statically linked llvm3.3? How do we do it with libuv? Isn't our version patched so that we cannot use upstream? |
I'll have another go at the readdlm thing next week. Hopefully I can figure it out in time for the 3.5 release. |
@tknopp Indeed libuv is included as an embedded copy in the Debian package. But first, this is supposed to be a temporary situation (see the JuliaLang/libuv#2). And second, libuv is a much smaller codebase than LLVM. Which means that it is much more manageable to maintain a forked embedded copy of libuv than one of LLVM. |
I've started to run some experiments on which IR passes cause this in order to try to identify the IR pattern that gets miscompiled. First results:
Encouraging isn't it ;)? EDIT: |
And in reverse:
|
Seems to be a problem with LLVM's Stack Slot coloring:
|
Specifically, stack slot sharing:
Feels like I'm getting closer :) |
Here's a debug dump from the stack coloring pass:
The second one is with sharing disabled. As you can see the first one deletes two stack slots. Of course this isn't necessarily a problem with the stack coloring pass. It might also be a problem with the analysis pass that it uses. |
@ihnorton Why were you thinking the tuple is passed on the stack. From looking at the generated code, the callee is expecting it in xmm. |
As far as I can tell, the stack coloring transformation is correct, so I wonder what's going on. |
this issue is quite subtle and originally I had assumes it was a bug in LLVM, because it heavily depends on which optimizations are enabled, but it turns out that it's actually a missing annotation in our IR. Before I describe what the underlying issue is, there is a similar description in the commit by @isanbard when this was originally found (in Clang, I presume) at [1]. The dlm_fill function has an interesting pattern, where, when it can't parse the given data with the requested type, it will try again by calling itself with a more generictype. This and the call to store_cell however, go through dynamic dispatch, so the dims tuple that is passed in unboxed, needs to be reboxed for dynamic dispatch. The problem seen in the original issue stemmed from the fact that the dims tuple, which was originally passed in as (2,2), suddenly became (1,2) for the second call. This happened by the following mechanism: dims=(2,2) gets passed in via XMM0 and gets almost immidiately spilled to the stack. Sometime before the generic call to store_cell (and potentially eariler functions), dims is reloaded from the stack slot in order to extract the components and box them individually for the generic call. At this point LLVM thinks that there is no use for the original stack slot in which dims was stored anymore and proceeds to reuse it (the optimization pass responsible for this is StackSlotColoring). We now get to store_cell, which throws an exception that longjmps us back to almost the beginning of dlm_fill where dims was still in the stack slot. However, said stack slot has been overwritten, thus causing the corrupted dims tuple in the second go-around. The solution to this on the llvm side is shown in [1] below, i.e. simply disabling the StackSlotColoring pass. However, this check did not fire on our IR, because we weren't marking the call site as return twice. For better or for worse, LLVM requires most attributes to be duplicated on the call site as well as the callee (where we had already set this attribute). [1] llvm-mirror/llvm@5edfbdc
Wow, nice sleuthing! (I missed your question last night...)
limping through the disassembly, more or less. |
Also, for posterity and general enjoyment, here's the code I wrote to track this down: https://gist.github.com/Keno/10a0c57dbb266e46df84 |
I see you use The application of julia metaprogramming to C++ here is absolutely mind-boggling. |
Whoa. When you go bug-hunting, you carry some heavy weaponry! Poor things don't stand a chance. |
Yeah, I love the
|
This is nuts. @Keno, you've outdone yourself. |
I see this failure when running tests in my RPM package in 64-bits on a Fedora build machine. This is with git master as of today. Funnily the test passes on my machine; is it because it's been fixed in the recent hours?
The text was updated successfully, but these errors were encountered: