-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
proposal: a faster C-call mechanism for non-blocking C functions #16051
Comments
This would probably be more usefully discussed on golang-dev rather than on the issue tracker. In any case I think the first step is straightforward: measure the overhead of a cgo call and get a good estimate of how much could be saved by adopting this proposal. |
Sorry for posting this in the wrong place then. Overhead of cgo is about 170ns and up (600-800 cycles). Calling Go functions is about 100 times faster. Using the alternative call mechanism, Dmitry cites a real-world improvement of up to 50% of programs running under the race checker - that's for the entire program, presumably not all of which is spent calling into TSAN. Again not all programs will benefit from this, but there are a class of programs which do - including Go itself - and more importantly for which there are no practical alternatives. |
Note that the performance of Maybe the easiest thing to do would be to first investigate whether we can recover at least some of the go1 performance for Reference for this problem is issue #9704. |
I like the idea of nullifying the large overhead of cgo which does affect a class of programs. I'm a little concerned on how hard it would be for a programmer to prove that a C function is safe for the |
I'm not convinced that We could detect a deadlock easily enough in the sysmon thread: if a supposedly non-blocking cgocall doesn't return within some period of time, we can declare an error. We can't recover, though; all we can do at that point is crash. I don't know of a way to know which standard library functions do blocking syscalls. That is a somewhat different path than a cgo call, and should be more efficient--for example, it doesn't have to switch stacks, as syscalls don't use any stack space. Still it's true that there is some overhead. It's also true that the syscall package uses a more efficient call path for syscalls that don't block; they are the ones annotated with It's worth noting that we've been surprised in the past by which syscalls can not block. For example, when people use a FUSE file system, Dup can block (see issue #10202). This is the kind of problem that could vex |
Can we clarify for ignorant like myself what the risks are here if on an unusual code path a call blocks that is marked Remember that the alternatives when cgo performance is not enough, are more dangerous. They carry these exact same risks, plus are fragile to the Go implementation changing, plus involve assembly and access to go runtime internals. Worse yet, if one avoids doing it right by digging into the go runtime to change stacks, and just do the call on the goroutine stack - now you're making assumptions about how much stack space your C program needs - and that's an extremely unsafe assumption that can't be easily made. Who knows which standard library functions allocate a big buffer on the stack or use recursion. |
The most efficient implementation of To avoid that problem, we need to notify the scheduler that we are entering a different regime--the steps taken by No matter what, we need to switch to a new stack. A goroutine stack is often very small--as small as 128 bytes. No C function can be expected to run on a stack that small. So we must always change stacks. And, of course, not matter what, we must pass the arguments using the C ABI. But at this point--notify the scheduler, change stacks, pass C ABI arguments--we're pretty much doing everything that cgo does today. So |
Ok, now I understand the implications better. It's clear blocking could cause serious performance problems for the entire process - however, it's still not clear to me that it can deadlock, except in very unusual situations that don't arise in a typical Go program - I can live with that. If #cgo nonblocking is implemented it needs to be performant, as you say. Changing stacks is unavoidable and passing arguments with the C ABI is unavoidable (but a lot easier for the compiler to do efficiently.) The interaction with the scheduler is the part that would have to be dropped. I do know that the Go runtime makes exactly this tradeoff when it calls the TSAN runtime functions. They're known not to block, so that's considered safe enough and worth it for the 50% speedup. I have similar needs with an in-memory database, and I'm confident that those functions don't block, except maybe in a once-in-a-blue-moon scenario. I can live with that if it doesn't cause deadlocks in any real scenario. |
Basically, the main problem is that calling C functions from Go is slow, and has only become slower. This unfortunately makes Go less suitable for graphical applications and games that have to call into a C OpenGL or Vulkan API many times per second. I don't have any great ideas on how to solve this problem, but I insist that it is something that needs to be improved. |
@aclements mentioned him (or somebody) making defers allocated on the stack in Go 1.8. That is #14939. That will help with this bug too, since each cgo call does 2+ defers, IIRC. |
There seems to be two things that can be done here.
This issue should be about the second solution only. There are definitely things that can be done for (1). That will shrink the set of programs which need solution (2). However (1) will always leave performance on the floor because it has to do extra work to handle not well-behaved C calls. Now maybe CGO can be sped up enough that the performance difference is so small that the set of programs that would benefit from (2) is too small to be worth the effort of implementing and maintaining it. Or maybe it is deemed too dangerous to be used correctly - but then I question why the Go runtime uses it for the race detector. |
I don't think one can base any arguments on the race detector, which is a special case. It calls into the support library for every memory read or write. That is vastly more calls than even the most active cgo using program can possibly do. |
I'll grant that, but I'm still not content to leave that much performance on the floor. I'm going to try the race detector C calling approach and see how much of a difference it makes in my use case. I'll report back in a couple weeks with some numbers hopefully. A faster CGO mechanism allows calling C functions that do less work. e.g. if the C function takes 200ns and we speed up the CGO mechanism to 30ns (which seems to have been true in past Go versions), then the total calling time for that function goes from 400 to 230ns, almost twice as fast. If the C function itself takes 30ns, then it goes from 230ns to 60ns, which is a four fold improvement. That opens up more options when designing APIs. I just spent two days implementing a function in assembly in Go because it needed popcnt, prefetch, and bsr. It could have been implemented in two hours in C, but the CGO overhead makes it a non-starter. In other places I duplicate code between C and Go (using very unidiomatic Go with lots of unsafe) to avoid the CGO overhead. |
A C function that took 30ns is, optimistically, at most 50 machine What work can realistically be done in that few number of cycles that On Fri, 1 Jul 2016, 10:15 Daniel Eloff [email protected] wrote:
|
@eloff, there's been some discussion (though no concrete proposals to my knowledge) to add functions for things like popcount and bsr that would be compiled as intrinsics where supported (like math.Sqrt is today). With 1.7 we're trying this out in the runtime, which now has a ctz intrinsic available to it on amd64 SSA. Obviously that doesn't solve the overall problem, but it would chip away at one more reason to use cgo in a low-overhead context. |
@aclements That sounds very interesting, I'm looking forward to that! You're right, it would eliminate another reason for calling lightweight C functions from Go. |
Well, it would be useful, but it's a C world, and certainly for programming I think both approaches are needed, cgo calls need to be made faster in To throw out a wild idea if myself, maybe consecutive calls into C could be
|
How is that different from writing a function in C that makes those calls in sequence? |
The difference is that I would like to do it in go, not C. Sure, I could I tried to write a 2d game engine in go a few years ago, but the cgo Such a C block in go as I propose, would be a bit like extern "C" in C++,
|
Another gamedev here supporting this proposal. As it stands now, Go would be really a good candidate for a gamedev language, the GC is getting really fast and the language has a lot to offer to the industry. Perhaps on Linux there is a difference between a syscall and a cgo call, on Windows it does not seem to be the case tho. I wonder if, instead of tagging cgo calls we could tag the goroutine to use some kind of "special" stack and get ad-hoc treatment from the scheduler to be able to call into C with as close to 0 cost as possible. I do think this could improve the performance of the entire tech, I agree with rewriting as much as possible in Go, that's fine, but it's still a world of C based operating systems (and it will be for the foreseeable future) and, eventually, you'll have to call C stuff in order to make things happen.. be it a triangle on the screen or a packet over the network. |
Our current plan is to start by optimizing the existing cgo path and see how far we can get with that before we consider introducing a new mechanism (particularly one that could introduce new classes of bugs).
I'm not sure, but it might be possible to further optimize Windows system calls beyond the regular cgo path. Most other platforms don't use cgo for syscalls, and there's much less overhead. OTOH, Windows syscalls can have callbacks, so the generality may be necessary. |
As the OP, I applaud that strategy. I still think a specialized mechanism On Sat, Jul 9, 2016 at 3:57 PM, Austin Clements [email protected]
|
I think syscall cost is not significant here. But I might be wrong. If you have some real numbers to show one way or other, we could try and make it faster.
Very few Go windows apps use callbacks - the ones who do call syscall.NewCallback. We could optimize the app if it doesn't calls syscall.NewCallback. Alex |
Actually I did some tests right now, calling a function from a dll, using CGO ends up being faster than using syscall.Syscall to call the same function.. on Windows. You are right tho about the UDP thing, I did some actual tests/calculations and with a rate of about 20K packets/second I am looking at a 2-3ms overhead to call into Winsock so nothing to be worried about at all. |
That is surprising to me. Do you mind providing instructions to reproduce what you see? Thank you. Alex |
Sure, this is the relevant part of the code:
And this is the result on my PC: |
Thank you for the source. But I don't think you run test for long enough. I did
and I see not much difference between syscall and cgo. Alex PS: On Windows you cannot even measure it if it is less than 15ms |
Well I get the same results increasing the loop count: Syscall: 2.5059244s Perhaps I have a weird PC. |
I prepared a self-contained benchmark that highlights real-world cgo performance issues here: https://github.com/rasky/tdb-cgo-bench. It's an example that uses the cgo wrapper of TrailDB, which highlights a real-world case where the overhead of cgo is very high. The included benchmark shows a 6x performance regression when executing the same basic code in Go vs C. I hope this can be useful while working on cgo performances in the 1.8 cycle. |
@rasky, that's very helpful. Thanks! I was able to reproduce the 6x slowdown locally with your benchmark. I took a glance at the profile; much like other high-overhead cgo benchmarks (#9704), it spends a lot of time dealing with defers (~28%). We have a plan to fix that (#14939). Much of the remaining overhead goes into entersyscall/exitsyscall, which, unfortunately, are much trickier to optimize. |
Let's revisit this once the work to speed up defer is complete. |
CL https://golang.org/cl/29656 mentions this issue. |
This optimizes deferproc and deferreturn in various ways. The most important optimization is that it more carefully arranges to prevent preemption or stack growth. Currently we do this by switching to the system stack on every deferproc and every deferreturn. While we need to be on the system stack for the slow path of allocating and freeing defers, in the common case we can fit in the nosplit stack. Hence, this change pushes the system stack switch down into the slow paths and makes everything now exposed to the user stack nosplit. This also eliminates the need for various acquirem/releasem pairs, since we are now preventing preemption by preventing stack split checks. As another smaller optimization, we special case the common cases of zero-sized and pointer-sized defer frames to respectively skip the copy and perform the copy in line instead of calling memmove. This speeds up the runtime defer benchmark by 42%: name old time/op new time/op delta Defer-4 75.1ns ± 1% 43.3ns ± 1% -42.31% (p=0.000 n=8+10) In reality, this speeds up defer by about 2.2X. The two benchmarks below compare a Lock/defer Unlock pair (DeferLock) with a Lock/Unlock pair (NoDeferLock). NoDeferLock establishes a baseline cost, so these two benchmarks together show that this change reduces the overhead of defer from 61.4ns to 27.9ns. name old time/op new time/op delta DeferLock-4 77.4ns ± 1% 43.9ns ± 1% -43.31% (p=0.000 n=10+10) NoDeferLock-4 16.0ns ± 0% 15.9ns ± 0% -0.39% (p=0.000 n=9+8) This also shaves 34ns off cgo calls: name old time/op new time/op delta CgoNoop-4 122ns ± 1% 88.3ns ± 1% -27.72% (p=0.000 n=8+9) Updates #14939, #16051. Change-Id: I2baa0dea378b7e4efebbee8fca919a97d5e15f38 Reviewed-on: https://go-review.googlesource.com/29656 Reviewed-by: Keith Randall <[email protected]>
This optimizes deferproc and deferreturn in various ways. The most important optimization is that it more carefully arranges to prevent preemption or stack growth. Currently we do this by switching to the system stack on every deferproc and every deferreturn. While we need to be on the system stack for the slow path of allocating and freeing defers, in the common case we can fit in the nosplit stack. Hence, this change pushes the system stack switch down into the slow paths and makes everything now exposed to the user stack nosplit. This also eliminates the need for various acquirem/releasem pairs, since we are now preventing preemption by preventing stack split checks. As another smaller optimization, we special case the common cases of zero-sized and pointer-sized defer frames to respectively skip the copy and perform the copy in line instead of calling memmove. This speeds up the runtime defer benchmark by 42%: name old time/op new time/op delta Defer-4 75.1ns ± 1% 43.3ns ± 1% -42.31% (p=0.000 n=8+10) In reality, this speeds up defer by about 2.2X. The two benchmarks below compare a Lock/defer Unlock pair (DeferLock) with a Lock/Unlock pair (NoDeferLock). NoDeferLock establishes a baseline cost, so these two benchmarks together show that this change reduces the overhead of defer from 61.4ns to 27.9ns. name old time/op new time/op delta DeferLock-4 77.4ns ± 1% 43.9ns ± 1% -43.31% (p=0.000 n=10+10) NoDeferLock-4 16.0ns ± 0% 15.9ns ± 0% -0.39% (p=0.000 n=9+8) This also shaves 34ns off cgo calls: name old time/op new time/op delta CgoNoop-4 122ns ± 1% 88.3ns ± 1% -27.72% (p=0.000 n=8+9) Updates golang#14939, golang#16051. Change-Id: I2baa0dea378b7e4efebbee8fca919a97d5e15f38 Reviewed-on: https://go-review.googlesource.com/29656 Reviewed-by: Keith Randall <[email protected]>
We are going to decline this specific proposal. Marking some cgo calls as non-blocking can fail in too many subtle ways that are hard to understand. We are certainly interested in speeding up cgo calls in general, but this approach is not the one we will take. Note that we have sped up cgo calls in 1.8 in https://golang.org/cl/30080. |
@ianlancetaylor what about seperating C and Go in two processes which connected with pipe or mmap, and make a C-call is translate to send a msg to C-process and waiting for a response. |
@linkerlin this proposal was declined eight months ago. Please do not continue the conversation on closed issues. Please see https://golang.org/wiki/Questions for good places to ask. Thanks. |
(We shouldn't continue this conversation here, as @davecheney pointed out, but just to close the loop...) @linkerlin, putting them in separate processes would make it hard to pass pointers, which is one of the main features of cgo. But even putting them in separate threads in the same process wouldn't help with the overhead: the transition would still have to interact with the Go scheduler and then it would also have to interact with the OS scheduler. Nonetheless, if an application wanted to split up its Go and C components this way, it could certainly do so without runtime help. |
I support this idea. Maybe we can use some syntax like this: |
@chen3feng This issue is closed. Comments on closed issues are not tracked. Please open a new issue or use a forum; see https://golang.org/wiki/Questions . Thanks. I think it is extremely unlikely that we would ever let the programmer pick whether a cgo call blocks or not, as that is very easy to get wrong, and getting it wrong may cause the entire program to freeze. That said, please do not reply here. |
I'm not sure this is the place to submit this, the golang-dev list might also have been a good candidate.
The cgo FFI mechanism is very general, and makes some pessimistic assumptions about the function being called - specifically it has to handle the case where the called function may block, e.g. in a blocking syscall. Having to handle the worst case adds a lot of overhead to each call into C, including just calling a math function or other small function that does no syscalls or does not block.
The problem is for well-behaved calls that neither block nor call back into Go, if the called function does not do much work then the cgo overhead dominates the runtime and causes poor performance. This is a well known limitation (and often complained about on golang-nuts and elsewhere, google "cgo overhead" returns about 45,000 hits.) Some of this can be mitigated by making the C function do more work, e.g. exporting a chunkier vs chattier API - but that's not always feasible or desirable.
For some classes of problems this overhead is just not admissible, and people resort to doing bad things like calling C from assembly on a Go stack, which just isn't built for it. The Go runtime itself finds the C call overhead too much at times, and Dmitry Vyukov had to work around it when integrating the TSAN race checker with Go. A look at https://golang.org/src/runtime/race_amd64.s?m=text shows a custom C call mechanism that just puts the arguments into the right registers, switches stacks from the Go stack to the C stack, and calls the TSAN runtime function.
Since this seems to be a requirement that cannot always be worked around, and it's a requirement of both Go users and the Go runtime itself, and the workarounds are difficult, fraught, and fragile - I would like to see a solution for this in Go itself. Such a solution should very closely mirror the functionality in race_amd64.s, ideally because that approach is about as low overhead as possible, and because it can be used to replace that code. It could take any form, but since it needs to be backwards compatible and ideally one does not introduce new syntax - something like an attribute on the function declaration, or a cgo directive would be nice. It is up to the programmer to ensure that C functions called in this way do not block, and don't call back into Go.
Is this a reasonable feature for a future version of Go?
The text was updated successfully, but these errors were encountered: