Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: fatal error: malloc deadlock with Go 1.5.1 #13143

Closed
fjl opened this issue Nov 4, 2015 · 16 comments
Closed

runtime: fatal error: malloc deadlock with Go 1.5.1 #13143

fjl opened this issue Nov 4, 2015 · 16 comments

Comments

@fjl
Copy link

fjl commented Nov 4, 2015

A go-ethereum user reported a crash with the following stack trace:
https://gist.github.com/CJentzsch/31aab5b31fae6805f893

The issue occurred on linux/arm with go-ethereum built using Go 1.5.1.
The binary in the trace can be downloaded here.
It does not seem to be reproducible, but looks similar to #10469 (x/mobile issue on android/arm).

@karalabe
Copy link
Contributor

karalabe commented Nov 4, 2015

Just to add to the above report, although this is the first dump we manage to capture (was running on an Raspberry Pi), we've also seen two eerily similar scenarios (no crash dumps were generated for those so we couldn't place them), where our code crashed after processing the same amount of data. One of them was an Android cross compilation running on on Galaxy Nexus phone (direct cross compile, not through gomobile), and another was an ARM cross compilation running on the Tizen platform.

@ianlancetaylor ianlancetaylor added this to the Go1.5.2 milestone Nov 4, 2015
@ianlancetaylor
Copy link
Member

CC @aclements @dvyukov @rsc

@dvyukov
Copy link
Member

dvyukov commented Nov 4, 2015

The real bug here is this:

runtime.panicindex()
    /usr/local/go/src/runtime/panic.go:12 +0x48
runtime.mHeap_Grow(0xdef668, 0x8, 0x0)
    /usr/local/go/src/runtime/mheap.go:647 +0x25c
runtime.mHeap_AllocSpanLocked(0xdef668, 0x1, 0x0)
    /usr/local/go/src/runtime/mheap.go:532 +0x6c8

I suspect it can happen on OOM, some of the paths in mHeap_SysAlloc are not well tested.
I've mailed a fix a potential cause in cl/16622.

@gopherbot
Copy link
Contributor

CL https://golang.org/cl/16622 mentions this issue.

@aclements
Copy link
Member

I have now proven both by hand and using an automated theorem prover* that either:

  1. this index out of bounds is impossible,
  2. the traceback is lying about the npage argument to mHeap_Grow,
  3. there was some horrible memory corruption, or
  4. there are bugs in both of my proofs.

The automated proof did, however, uncover a wrap-around bug where a very large allocation could cause an out of bounds index rather than rejecting the allocation. That would suggest case 2. Assuming I can reproduce this in real code, the fix is trivial.

@fjl, @karalabe, is it possible your code attempted a particularly large allocation (like over 3GB)? Also, could you run "getconf PAGESIZE" on the system that generated the traceback? I assumed it was 4096 because that's what the runtime uses on 32-bit non-NACL ARM, but if it isn't that could be important.

* I used concolic execution of a model of the memory system, which doesn't lend itself to infinite code paths, so specifically I proved that no sequence of four allocations ending in an allocation of 8 pages can cause an index out of bounds on the last allocation. I chose four to exercise all combinations of code paths within the same final program state and because running four already took a really long time. :)

@aclements
Copy link
Member

Scratch the wrap-around I'd found. The wrap-around does happen, but assuming the kernel is correctly bounds-checking the mmap call itself (which I hadn't previously modeled in my proof), the allocation will fail anyway, just at a slightly later point.

@aclements
Copy link
Member

@fjl, @karalabe, what was the GOARM level on the machine that produced the traceback?

@aclements
Copy link
Member

Also, what kernel version were you running?

@aclements
Copy link
Member

I increased the fidelity of the system call model in my proof (and fixed a few bugs in STP) and finally found a way this can happen. I've also managed to reproduce it on linux/386 with a few helping hacks in the runtime.

The cause is ultimately quite simple: In mHeap_SysAlloc, if an allocation at arena_used would exceed arena_end (but wouldn't yet push us past arena_start+_MaxArean32), we try to extend the arena reservation by another 256 MB. Note that this could be a small allocation, which is consistent with this bug. We extend the arena by calling sysReserve, which calls mmap without MAP_FIXED on 32-bit, which means the address is just a hint and the kernel can put the mapping wherever it wants. If it happens to choose an address below arena_start (the kernel also chose arena_start, so there could be lots of space below it), then it's the beginning of the end. We don't detect that case in mHeap_SysAlloc. In fact, mHeap_SysAlloc will totally screw up arena_end and arena_used and then return the low pointer to mHeap_Grow, which will subtract arena_start from the pointer, which will underflow and produce a huge value of p, which we'll then use as an index in to h_spans, which will panic.

The fix is just to check for this case in mHeap_SysAlloc. I'll roll a CL tomorrow or Monday.

@RLH
Copy link
Contributor

RLH commented Nov 15, 2015

Nicely done...

On Saturday, November 14, 2015, Austin Clements [email protected]
wrote:

I increased the fidelity of the system call model in my proof and finally
found a way this can happen and I've managed to reproduce it with a few
helping hacks in the runtime.

The cause is ultimately quite simple: In mHeap_SysAlloc, if an allocation
at arena_used would exceed arena_end (but wouldn't yet push us past
arena_start+_MaxArean32), we try to extend the arena reservation by another
256 MB. Note that this could be a small allocation, which is consistent
with this bug. We extend the arena by calling sysReserve, which calls mmap
without MAP_FIXED on 32-bit, which means the address is just a hint and the
kernel can put the mapping wherever it wants. If it happens to choose an
address below arena_start (the kernel also chose arena_start, so there
could be lots of space below it), then it's the beginning of the end. We
don't detect that case in mHeap_SysAlloc. In fact, mHeap_SysAlloc will
totally screw up arena_end and arena_used and then return the low pointer
to mHeap_Grow, which will subtract arena_start from the pointer, which will
underflow and produce a huge value of p, which we'll then use as an index
in to h_spans, which will panic.

The fix is just to check for this case in mHeap_SysAlloc. I'll roll a CL
tomorrow or Monday.


Reply to this email directly or view it on GitHub
#13143 (comment).

@gopherbot
Copy link
Contributor

CL https://golang.org/cl/16926 mentions this issue.

@gopherbot
Copy link
Contributor

CL https://golang.org/cl/16927 mentions this issue.

aclements added a commit that referenced this issue Nov 16, 2015
If the area returned by sysReserve in mheap.sysAlloc is outside the
usable arena, we sysFree it. We pass a fake stat pointer to sysFree
because we haven't added the allocation to any stat at that point.
However, we pass a 0 stat, so sysFree panics when it decrements the
stat because the fake stat underflows.

Fix this by setting the fake stat to the allocation size.

Updates #13143 (this is a prerequisite to fixing that bug).

Change-Id: I61a6c9be19ac1c95863cf6a8435e19790c8bfc9a
Reviewed-on: https://go-review.googlesource.com/16926
Reviewed-by: Ian Lance Taylor <[email protected]>
@gopherbot
Copy link
Contributor

CL https://golang.org/cl/16961 mentions this issue.

@gopherbot
Copy link
Contributor

CL https://golang.org/cl/16987 mentions this issue.

@gopherbot
Copy link
Contributor

CL https://golang.org/cl/16988 mentions this issue.

aclements added a commit that referenced this issue Nov 17, 2015
Currently, if an allocation is large enough that arena_end + size
overflows (which is not hard to do on 32-bit), we go ahead and call
sysReserve with the impossible base and length and depend on this to
either directly fail because the kernel can't possibly fulfill the
requested mapping (causing mheap.sysAlloc to return nil) or to succeed
with a mapping at some other address which will then be rejected as
outside the arena.

In order to make this less subtle, less dependent on the kernel
getting all of this right, and to eliminate the hopeless system call,
add an explicit overflow check.

Updates #13143. This real issue has been fixed by 0de59c2, but this is
a belt-and-suspenders improvement on top of that. It was uncovered by
my symbolic modeling of that bug.

Change-Id: I85fa868a33286fdcc23cdd7cdf86b19abf1cb2d1
Reviewed-on: https://go-review.googlesource.com/16961
Run-TryBot: Austin Clements <[email protected]>
TryBot-Result: Gobot Gobot <[email protected]>
Reviewed-by: Ian Lance Taylor <[email protected]>
aclements added a commit that referenced this issue Nov 17, 2015
If the area returned by sysReserve in mheap.sysAlloc is outside the
usable arena, we sysFree it. We pass a fake stat pointer to sysFree
because we haven't added the allocation to any stat at that point.
However, we pass a 0 stat, so sysFree panics when it decrements the
stat because the fake stat underflows.

Fix this by setting the fake stat to the allocation size.

Updates #13143 (this is a prerequisite to fixing that bug).

Change-Id: I61a6c9be19ac1c95863cf6a8435e19790c8bfc9a
Reviewed-on: https://go-review.googlesource.com/16926
Reviewed-by: Ian Lance Taylor <[email protected]>
Reviewed-on: https://go-review.googlesource.com/16987
Run-TryBot: Austin Clements <[email protected]>
Reviewed-by: Russ Cox <[email protected]>
aclements added a commit that referenced this issue Nov 17, 2015
… below the arena

In mheap.sysAlloc, if an allocation at arena_used would exceed
arena_end (but wouldn't yet push us past arena_start+_MaxArean32), it
trie to extend the arena reservation by another 256 MB. It extends the
arena by calling sysReserve, which, on 32-bit, calls mmap without
MAP_FIXED, which means the address is just a hint and the kernel can
put the mapping wherever it wants. In particular, mmap may choose an
address below arena_start (the kernel also chose arena_start, so there
could be lots of space below it). Currently, we don't detect this case
and, if it happens, mheap.sysAlloc will corrupt arena_end and
arena_used then return the low pointer to mheap.grow, which will crash
when it attempts to index in to h_spans with an underflowed index.

Fix this by checking not only that that p+p_size isn't too high, but
that p isn't too low.

Fixes #13143.

Change-Id: I8d0f42bd1484460282a83c6f1a6f8f0df7fb2048
Reviewed-on: https://go-review.googlesource.com/16927
Run-TryBot: Austin Clements <[email protected]>
TryBot-Result: Gobot Gobot <[email protected]>
Reviewed-by: Ian Lance Taylor <[email protected]>
Reviewed-on: https://go-review.googlesource.com/16988
Reviewed-by: Russ Cox <[email protected]>
@aclements
Copy link
Member

For reference, this is the symbolic model and execution engine I used: https://gist.github.com/aclements/cd6cd1a93732099ff982

This requires fixes to the STP Python bindings that are currently in STP pull request stp/stp#204

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

7 participants