Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: a faster C-call mechanism for non-blocking C functions #16051

Closed
eloff opened this issue Jun 13, 2016 · 38 comments
Closed

proposal: a faster C-call mechanism for non-blocking C functions #16051

eloff opened this issue Jun 13, 2016 · 38 comments

Comments

@eloff
Copy link

eloff commented Jun 13, 2016

I'm not sure this is the place to submit this, the golang-dev list might also have been a good candidate.

The cgo FFI mechanism is very general, and makes some pessimistic assumptions about the function being called - specifically it has to handle the case where the called function may block, e.g. in a blocking syscall. Having to handle the worst case adds a lot of overhead to each call into C, including just calling a math function or other small function that does no syscalls or does not block.

The problem is for well-behaved calls that neither block nor call back into Go, if the called function does not do much work then the cgo overhead dominates the runtime and causes poor performance. This is a well known limitation (and often complained about on golang-nuts and elsewhere, google "cgo overhead" returns about 45,000 hits.) Some of this can be mitigated by making the C function do more work, e.g. exporting a chunkier vs chattier API - but that's not always feasible or desirable.

For some classes of problems this overhead is just not admissible, and people resort to doing bad things like calling C from assembly on a Go stack, which just isn't built for it. The Go runtime itself finds the C call overhead too much at times, and Dmitry Vyukov had to work around it when integrating the TSAN race checker with Go. A look at https://golang.org/src/runtime/race_amd64.s?m=text shows a custom C call mechanism that just puts the arguments into the right registers, switches stacks from the Go stack to the C stack, and calls the TSAN runtime function.

Since this seems to be a requirement that cannot always be worked around, and it's a requirement of both Go users and the Go runtime itself, and the workarounds are difficult, fraught, and fragile - I would like to see a solution for this in Go itself. Such a solution should very closely mirror the functionality in race_amd64.s, ideally because that approach is about as low overhead as possible, and because it can be used to replace that code. It could take any form, but since it needs to be backwards compatible and ideally one does not introduce new syntax - something like an attribute on the function declaration, or a cgo directive would be nice. It is up to the programmer to ensure that C functions called in this way do not block, and don't call back into Go.


e.g. __attribute__ ((nonblocking)) in the declaration,
or // #cgo nonblocking directly preceeding the callsite or the declaration

Is this a reasonable feature for a future version of Go?

@ianlancetaylor ianlancetaylor changed the title A faster C-call mechanism for non-blocking C functions proposal: a faster C-call mechanism for non-blocking C functions Jun 13, 2016
@ianlancetaylor
Copy link
Member

This would probably be more usefully discussed on golang-dev rather than on the issue tracker.

In any case I think the first step is straightforward: measure the overhead of a cgo call and get a good estimate of how much could be saved by adopting this proposal.

@ianlancetaylor ianlancetaylor added this to the Proposal milestone Jun 13, 2016
@eloff
Copy link
Author

eloff commented Jun 13, 2016

Sorry for posting this in the wrong place then.

Overhead of cgo is about 170ns and up (600-800 cycles). Calling Go functions is about 100 times faster. Using the alternative call mechanism, Dmitry cites a real-world improvement of up to 50% of programs running under the race checker - that's for the entire program, presumably not all of which is spent calling into TSAN.

Again not all programs will benefit from this, but there are a class of programs which do - including Go itself - and more importantly for which there are no practical alternatives.

@ALTree
Copy link
Member

ALTree commented Jun 13, 2016

Note that the performance of cgocall has steadily decreased in the past 2 years. The penalty for a cgo call was around 30ns in go1, now (go1.6) we're at ~200ns.

Maybe the easiest thing to do would be to first investigate whether we can recover at least some of the go1 performance for cgocall ?

Reference for this problem is issue #9704.

@rasky
Copy link
Member

rasky commented Jun 15, 2016

I like the idea of nullifying the large overhead of cgo which does affect a class of programs. I'm a little concerned on how hard it would be for a programmer to prove that a C function is safe for the #cgo nonblocking behavior. Trivial functions are easy, but what about a moderately complex C library? How can I convince myself that there is no codepaths that call a blocking syscall? In fact, how can I know which standard library functions do call blocking syscalls? I find GODEBUG=cgocheck very useful, so maybe something similar could be attempted (thought I can't see how)?

@ianlancetaylor
Copy link
Member

I'm not convinced that #cgo nonblocking is a good idea, because, as you say, it's very easy to get wrong. In particular it's very easy to get wrong for unlikely error cases, leaving you with a program that normally works but occasionally deadlocks.

We could detect a deadlock easily enough in the sysmon thread: if a supposedly non-blocking cgocall doesn't return within some period of time, we can declare an error. We can't recover, though; all we can do at that point is crash.

I don't know of a way to know which standard library functions do blocking syscalls. That is a somewhat different path than a cgo call, and should be more efficient--for example, it doesn't have to switch stacks, as syscalls don't use any stack space. Still it's true that there is some overhead. It's also true that the syscall package uses a more efficient call path for syscalls that don't block; they are the ones annotated with sysnb comments rather than sys.

It's worth noting that we've been surprised in the past by which syscalls can not block. For example, when people use a FUSE file system, Dup can block (see issue #10202). This is the kind of problem that could vex #cgo nonblocking in unusual cases.

@eloff
Copy link
Author

eloff commented Jun 15, 2016

Can we clarify for ignorant like myself what the risks are here if on an unusual code path a call blocks that is marked #cgo nonblocking? As far as I understand it it's non-optimal, but not going to cause deadlocks. What would cause deadlocks is if the code did something that blocks on another goroutine which won't run now that we're blocking the scheduler. That's still a problem but a lot more specific than just it needs to be a function that never blocks. It needs to be a function that usually doesn't block, and that if it does it never blocks on another thread in the process - e.g. with a mutex, or something. That's a lot easier to guarantee. I can call functions from the C standard library not sure that they would block or not, but I can be sure if they will or won't block on another goroutine in my program - especially since any goroutine blocked in the scheduler cannot be in the middle of another C non blocking call. Like the mutex(s) in malloc aren't a problem because a goroutine can't get unscheduled in the middle of a malloc. Any mutexes only used inside C also won't be a problem because it's not possible to put a goroutine to sleep while it's in the middle of a C call. Of course my understanding may well be flawed, but if it's not, then it seems like this is not an unreasonable burden to place on the programmer for use of an opt-in, advanced feature.

Remember that the alternatives when cgo performance is not enough, are more dangerous. They carry these exact same risks, plus are fragile to the Go implementation changing, plus involve assembly and access to go runtime internals. Worse yet, if one avoids doing it right by digging into the go runtime to change stacks, and just do the call on the goroutine stack - now you're making assumptions about how much stack space your C program needs - and that's an extremely unsafe assumption that can't be easily made. Who knows which standard library functions allocate a big buffer on the stack or use recursion.

@ianlancetaylor
Copy link
Member

The most efficient implementation of #cgo nonblocking--and if we don't do an efficient implementation, why bother--can easily deadlock. The scheduler assumes that it can preempt any goroutine, in order to do a garbage collection or simply because the goroutine has been running for a long time. However, a goroutine blocked in C code can not be preempted. That is, a goroutine running in C code is essentially equivalent to a nonpreemptible goroutine, which leads us to the kinds of problems we see in #10958. Those problems are quite bad, but we live with it because it's hard to write a nonpreemptible goroutine. But with an efficient #cgo nonblocking, it's easy.

To avoid that problem, we need to notify the scheduler that we are entering a different regime--the steps taken by entersyscall in runtime/proc.go.

No matter what, we need to switch to a new stack. A goroutine stack is often very small--as small as 128 bytes. No C function can be expected to run on a stack that small. So we must always change stacks.

And, of course, not matter what, we must pass the arguments using the C ABI.

But at this point--notify the scheduler, change stacks, pass C ABI arguments--we're pretty much doing everything that cgo does today. So #cgo nonblocking isn't saving us much time, if any.

@eloff
Copy link
Author

eloff commented Jun 16, 2016

Ok, now I understand the implications better. It's clear blocking could cause serious performance problems for the entire process - however, it's still not clear to me that it can deadlock, except in very unusual situations that don't arise in a typical Go program - I can live with that.

If #cgo nonblocking is implemented it needs to be performant, as you say. Changing stacks is unavoidable and passing arguments with the C ABI is unavoidable (but a lot easier for the compiler to do efficiently.) The interaction with the scheduler is the part that would have to be dropped.

I do know that the Go runtime makes exactly this tradeoff when it calls the TSAN runtime functions. They're known not to block, so that's considered safe enough and worth it for the 50% speedup. I have similar needs with an in-memory database, and I'm confident that those functions don't block, except maybe in a once-in-a-blue-moon scenario. I can live with that if it doesn't cause deadlocks in any real scenario.

@beoran
Copy link

beoran commented Jun 30, 2016

Basically, the main problem is that calling C functions from Go is slow, and has only become slower. This unfortunately makes Go less suitable for graphical applications and games that have to call into a C OpenGL or Vulkan API many times per second. I don't have any great ideas on how to solve this problem, but I insist that it is something that needs to be improved.

@bradfitz
Copy link
Contributor

@aclements mentioned him (or somebody) making defers allocated on the stack in Go 1.8. That is #14939. That will help with this bug too, since each cgo call does 2+ defers, IIRC.

@eloff
Copy link
Author

eloff commented Jun 30, 2016

There seems to be two things that can be done here.

  1. Speed up existing CGO calls
  2. Add an additional way of calling a well-behaved restricted subset of C functions

This issue should be about the second solution only.

There are definitely things that can be done for (1). That will shrink the set of programs which need solution (2). However (1) will always leave performance on the floor because it has to do extra work to handle not well-behaved C calls.

Now maybe CGO can be sped up enough that the performance difference is so small that the set of programs that would benefit from (2) is too small to be worth the effort of implementing and maintaining it. Or maybe it is deemed too dangerous to be used correctly - but then I question why the Go runtime uses it for the race detector.

@ianlancetaylor
Copy link
Member

I don't think one can base any arguments on the race detector, which is a special case. It calls into the support library for every memory read or write. That is vastly more calls than even the most active cgo using program can possibly do.

@eloff
Copy link
Author

eloff commented Jul 1, 2016

I'll grant that, but I'm still not content to leave that much performance on the floor. I'm going to try the race detector C calling approach and see how much of a difference it makes in my use case. I'll report back in a couple weeks with some numbers hopefully.

A faster CGO mechanism allows calling C functions that do less work. e.g. if the C function takes 200ns and we speed up the CGO mechanism to 30ns (which seems to have been true in past Go versions), then the total calling time for that function goes from 400 to 230ns, almost twice as fast. If the C function itself takes 30ns, then it goes from 230ns to 60ns, which is a four fold improvement. That opens up more options when designing APIs.

I just spent two days implementing a function in assembly in Go because it needed popcnt, prefetch, and bsr. It could have been implemented in two hours in C, but the CGO overhead makes it a non-starter. In other places I duplicate code between C and Go (using very unidiomatic Go with lots of unsafe) to avoid the CGO overhead.

@davecheney
Copy link
Contributor

A C function that took 30ns is, optimistically, at most 50 machine
instructions, assuming the code cache is hot and the code in question
performs only memory accesses that are already in the L1 data cache.

What work can realistically be done in that few number of cycles that
cannot be rewritten in Go?

On Fri, 1 Jul 2016, 10:15 Daniel Eloff [email protected] wrote:

I'll grant that, but I'm still not content to leave that much performance
on the floor. I'm going to try the race detector C calling approach and see
how much of a difference it makes in my use case. I'll report back in a
couple weeks with some numbers hopefully.

A faster CGO mechanism allows calling C functions that do less work. e.g.
if the C function takes 200ns and we speed up the CGO mechanism to 30ns
(which seems to have been true in past Go versions), then the total calling
time for that function goes from 400 to 230ns, almost twice as fast. If the
C function itself takes 30ns, then it goes from 230ns to 60ns, which is a
four fold improvement. That opens up more options when designing APIs.

I just spent two days implementing a function in assembly in Go because it
needed popcnt, prefetch, and bsr. It could have been implemented in two
hours in C, but the CGO overhead makes it a non-starter. In other places I
duplicate code between C and Go (using very unidiomatic Go with lots of
unsafe) to avoid the CGO overhead.


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#16051 (comment), or mute
the thread
https://github.com/notifications/unsubscribe/AAAcA864gdwx_X3mtM1OEZFUon2m5hBaks5qRFwogaJpZM4I0hlW
.

@aclements
Copy link
Member

@eloff, there's been some discussion (though no concrete proposals to my knowledge) to add functions for things like popcount and bsr that would be compiled as intrinsics where supported (like math.Sqrt is today). With 1.7 we're trying this out in the runtime, which now has a ctz intrinsic available to it on amd64 SSA. Obviously that doesn't solve the overall problem, but it would chip away at one more reason to use cgo in a low-overhead context.

@eloff
Copy link
Author

eloff commented Jul 1, 2016

@aclements That sounds very interesting, I'm looking forward to that! You're right, it would eliminate another reason for calling lightweight C functions from Go.

@beoran
Copy link

beoran commented Jul 1, 2016

Well, it would be useful, but it's a C world, and certainly for programming
graphically intensive games, I need to call into C Apis very frequently.
All that stack swapping adds up a lot.

I think both approaches are needed, cgo calls need to be made faster in
general, but a way to mark certain C functions as special cases could
equally be useful.

To throw out a wild idea if myself, maybe consecutive calls into C could be
grouped together and use the same C stack without swapping back. This would
probably require a cgo directive...
On 1 Jul 2016 9:19 pm, "Daniel Eloff" [email protected] wrote:

@aclements https://github.com/aclements That sounds very interesting,
I'm looking forward to that! You're right, it would eliminate another
reason for calling lightweight C functions from Go.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#16051 (comment), or mute
the thread
https://github.com/notifications/unsubscribe/AAEWeUDyOm9kqAgwrqBzfLHRe45-llHRks5qRWg4gaJpZM4I0hlW
.

@bcmills
Copy link
Contributor

bcmills commented Jul 1, 2016

To throw out a wild idea if myself, maybe consecutive calls into C could be grouped together and use the same C stack without swapping back.

How is that different from writing a function in C that makes those calls in sequence?

@beoran
Copy link

beoran commented Jul 1, 2016

The difference is that I would like to do it in go, not C. Sure, I could
write a low level game engine in C and call into that from go. But that
defeats the purpose of using go in the first place.

I tried to write a 2d game engine in go a few years ago, but the cgo
overhead was too much. So I was forced to go back to C, and since all the
logic was in C, there wasn't much reason left to use go for that project.

Such a C block in go as I propose, would be a bit like extern "C" in C++,
you could still use a subset of go while handling the lower level API.
On 1 Jul 2016 9:45 pm, "Bryan C. Mills" [email protected] wrote:

To throw out a wild idea if myself, maybe consecutive calls into C could
be grouped together and use the same C stack without swapping back.

How is that different from writing a function in C that makes those calls
in sequence?


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#16051 (comment), or mute
the thread
https://github.com/notifications/unsubscribe/AAEWeTUDLzs0Fs-SmywjjLMualwbqWF9ks5qRW5ggaJpZM4I0hlW
.

@kunos
Copy link

kunos commented Jul 9, 2016

Another gamedev here supporting this proposal. As it stands now, Go would be really a good candidate for a gamedev language, the GC is getting really fast and the language has a lot to offer to the industry.
Sadly the cgo performance make it a no starter at the moment. I keep wondering how this cost is also impacting other areas like server applications with heavy UDP traffic. I am using Go for the server in my game and looking at the code it eventually ends up calling various WSA* functions on Windows.. so there you go, that price is there even when developing non strictly games software.

Perhaps on Linux there is a difference between a syscall and a cgo call, on Windows it does not seem to be the case tho.

I wonder if, instead of tagging cgo calls we could tag the goroutine to use some kind of "special" stack and get ad-hoc treatment from the scheduler to be able to call into C with as close to 0 cost as possible.

I do think this could improve the performance of the entire tech, I agree with rewriting as much as possible in Go, that's fine, but it's still a world of C based operating systems (and it will be for the foreseeable future) and, eventually, you'll have to call C stuff in order to make things happen.. be it a triangle on the screen or a packet over the network.

@aclements
Copy link
Member

Our current plan is to start by optimizing the existing cgo path and see how far we can get with that before we consider introducing a new mechanism (particularly one that could introduce new classes of bugs).

I keep wondering how this cost is also impacting other areas like server applications with heavy UDP traffic.

I'm not sure, but it might be possible to further optimize Windows system calls beyond the regular cgo path. Most other platforms don't use cgo for syscalls, and there's much less overhead. OTOH, Windows syscalls can have callbacks, so the generality may be necessary.

@eloff
Copy link
Author

eloff commented Jul 9, 2016

As the OP, I applaud that strategy. I still think a specialized mechanism
may be needed, but it seems relatively easy to roll my own. If I supply my
own C stacks then I don't need any fragile hooks into the go runtime or
scheduler, just a little assembly code. So if it's still a problem
afterward, there is an option that just requires a little effort - and I
think for a potentially unsafe and advanced feature that's even desirable.

On Sat, Jul 9, 2016 at 3:57 PM, Austin Clements [email protected]
wrote:

Our current plan is to start by optimizing the existing cgo path and see
how far we can get with that before we consider introducing a new mechanism
(particularly one that could introduce new classes of bugs).

I keep wondering how this cost is also impacting other areas like server
applications with heavy UDP traffic.

I'm not sure, but it might be possible to further optimize Windows system
calls beyond the regular cgo path. Most other platforms don't use cgo for
syscalls, and there's much less overhead. OTOH, Windows syscalls can have
callbacks, so the generality may be necessary.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#16051 (comment), or mute
the thread
https://github.com/notifications/unsubscribe/AABrtk08THykI6Q2OnEp9wXn7RlUcgkoks5qUCdagaJpZM4I0hlW
.

@alexbrainman
Copy link
Member

I keep wondering how this cost is also impacting other areas like server applications with heavy UDP traffic.

I think syscall cost is not significant here. But I might be wrong. If you have some real numbers to show one way or other, we could try and make it faster.

Windows syscalls can have callbacks, so the generality may be necessary

Very few Go windows apps use callbacks - the ones who do call syscall.NewCallback. We could optimize the app if it doesn't calls syscall.NewCallback.

Alex

@kunos
Copy link

kunos commented Jul 10, 2016

Actually I did some tests right now, calling a function from a dll, using CGO ends up being faster than using syscall.Syscall to call the same function.. on Windows.

You are right tho about the UDP thing, I did some actual tests/calculations and with a rate of about 20K packets/second I am looking at a 2-3ms overhead to call into Winsock so nothing to be worried about at all.

@alexbrainman
Copy link
Member

calling a function from a dll, using CGO ends up being faster than using syscall.Syscall to call the same function.. on Windows.

That is surprising to me. Do you mind providing instructions to reproduce what you see? Thank you.

Alex

@kunos
Copy link

kunos commented Jul 10, 2016

Sure, this is the relevant part of the code:


/*
#include <windows.h>
#include <mmsystem.h>

#cgo LDFLAGS: -lwinmm
*/
import "C"

var (
    winmm, _        = syscall.LoadLibrary("winmm.dll")
    ftimeGetTime, _ = syscall.GetProcAddress(winmm, "timeGetTime")
)

func tgt() uint32 {
    ret, _, _ := syscall.Syscall(ftimeGetTime, 0, 0, 0, 0)

    return uint32(ret)
}

func testCGOCall() {
    v := uint32(0)

    for i := 0; i < 20000; i++ {
        v += uint32(C.timeGetTime())
    }

}

func testSyscallCall() {
    v := uint32(0)

    for i := 0; i < 20000; i++ {
        v += tgt()
    }
}

And this is the result on my PC:
Syscall: 4.0056ms
CGO: 3.0032ms

@alexbrainman
Copy link
Member

And this is the result on my PC:
Syscall: 4.0056ms
CGO: 3.0032ms

Thank you for the source. But I don't think you run test for long enough. I did

c:\dev\src\issues\issue16051>type foo.go
package issue16051

import (
        "syscall"
)

/*
#include <windows.h>
#include <mmsystem.h>

#cgo LDFLAGS: -lwinmm
*/
import "C"

func Call_timeGetTime_CGO() uint32 {
        return uint32(C.timeGetTime())
}

var (
        winmm        = syscall.MustLoadDLL("winmm.dll")
        ftimeGetTime = winmm.MustFindProc("timeGetTime")
)

func Call_timeGetTime_Syscall() uint32 {
        ret, _, _ := syscall.Syscall(ftimeGetTime.Addr(), 0, 0, 0, 0)
        return uint32(ret)
}

c:\dev\src\issues\issue16051>type foo_test.go
package issue16051_test

import (
        "testing"

        "issues/issue16051"
)

func BenchmarkTimeGetTimeSyscall(b *testing.B) {
        for i := 0; i < b.N; i++ {
                issue16051.Call_timeGetTime_Syscall()
        }
}

func BenchmarkTimeGetTimeCGO(b *testing.B) {
        for i := 0; i < b.N; i++ {
                issue16051.Call_timeGetTime_CGO()
        }
}

c:\dev\src\issues\issue16051>go test -run=none -bench=.
testing: warning: no tests to run
BenchmarkTimeGetTimeSyscall     10000000               228 ns/op
BenchmarkTimeGetTimeCGO         10000000               229 ns/op
PASS
ok      issues/issue16051       5.087s

c:\dev\src\issues\issue16051>

and I see not much difference between syscall and cgo.

Alex

PS: On Windows you cannot even measure it if it is less than 15ms

@kunos
Copy link

kunos commented Jul 11, 2016

Well I get the same results increasing the loop count:

Syscall: 2.5059244s
CGO: 2.2626265s

Perhaps I have a weird PC.
And no, using the right timer on Windows you can get reasonable results even for <1ms .

@rasky
Copy link
Member

rasky commented Jul 31, 2016

I prepared a self-contained benchmark that highlights real-world cgo performance issues here: https://github.com/rasky/tdb-cgo-bench. It's an example that uses the cgo wrapper of TrailDB, which highlights a real-world case where the overhead of cgo is very high. The included benchmark shows a 6x performance regression when executing the same basic code in Go vs C. I hope this can be useful while working on cgo performances in the 1.8 cycle.

@aclements
Copy link
Member

@rasky, that's very helpful. Thanks! I was able to reproduce the 6x slowdown locally with your benchmark. I took a glance at the profile; much like other high-overhead cgo benchmarks (#9704), it spends a lot of time dealing with defers (~28%). We have a plan to fix that (#14939). Much of the remaining overhead goes into entersyscall/exitsyscall, which, unfortunately, are much trickier to optimize.

@adg
Copy link
Contributor

adg commented Sep 12, 2016

Let's revisit this once the work to speed up defer is complete.

@gopherbot
Copy link
Contributor

CL https://golang.org/cl/29656 mentions this issue.

gopherbot pushed a commit that referenced this issue Sep 26, 2016
This optimizes deferproc and deferreturn in various ways.

The most important optimization is that it more carefully arranges to
prevent preemption or stack growth. Currently we do this by switching
to the system stack on every deferproc and every deferreturn. While we
need to be on the system stack for the slow path of allocating and
freeing defers, in the common case we can fit in the nosplit stack.
Hence, this change pushes the system stack switch down into the slow
paths and makes everything now exposed to the user stack nosplit. This
also eliminates the need for various acquirem/releasem pairs, since we
are now preventing preemption by preventing stack split checks.

As another smaller optimization, we special case the common cases of
zero-sized and pointer-sized defer frames to respectively skip the
copy and perform the copy in line instead of calling memmove.

This speeds up the runtime defer benchmark by 42%:

name           old time/op  new time/op  delta
Defer-4        75.1ns ± 1%  43.3ns ± 1%  -42.31%   (p=0.000 n=8+10)

In reality, this speeds up defer by about 2.2X. The two benchmarks
below compare a Lock/defer Unlock pair (DeferLock) with a Lock/Unlock
pair (NoDeferLock). NoDeferLock establishes a baseline cost, so these
two benchmarks together show that this change reduces the overhead of
defer from 61.4ns to 27.9ns.

name           old time/op  new time/op  delta
DeferLock-4    77.4ns ± 1%  43.9ns ± 1%  -43.31%  (p=0.000 n=10+10)
NoDeferLock-4  16.0ns ± 0%  15.9ns ± 0%   -0.39%    (p=0.000 n=9+8)

This also shaves 34ns off cgo calls:

name       old time/op  new time/op  delta
CgoNoop-4   122ns ± 1%  88.3ns ± 1%  -27.72%  (p=0.000 n=8+9)

Updates #14939, #16051.

Change-Id: I2baa0dea378b7e4efebbee8fca919a97d5e15f38
Reviewed-on: https://go-review.googlesource.com/29656
Reviewed-by: Keith Randall <[email protected]>
unclejack pushed a commit to unclejack/go that referenced this issue Oct 24, 2016
This optimizes deferproc and deferreturn in various ways.

The most important optimization is that it more carefully arranges to
prevent preemption or stack growth. Currently we do this by switching
to the system stack on every deferproc and every deferreturn. While we
need to be on the system stack for the slow path of allocating and
freeing defers, in the common case we can fit in the nosplit stack.
Hence, this change pushes the system stack switch down into the slow
paths and makes everything now exposed to the user stack nosplit. This
also eliminates the need for various acquirem/releasem pairs, since we
are now preventing preemption by preventing stack split checks.

As another smaller optimization, we special case the common cases of
zero-sized and pointer-sized defer frames to respectively skip the
copy and perform the copy in line instead of calling memmove.

This speeds up the runtime defer benchmark by 42%:

name           old time/op  new time/op  delta
Defer-4        75.1ns ± 1%  43.3ns ± 1%  -42.31%   (p=0.000 n=8+10)

In reality, this speeds up defer by about 2.2X. The two benchmarks
below compare a Lock/defer Unlock pair (DeferLock) with a Lock/Unlock
pair (NoDeferLock). NoDeferLock establishes a baseline cost, so these
two benchmarks together show that this change reduces the overhead of
defer from 61.4ns to 27.9ns.

name           old time/op  new time/op  delta
DeferLock-4    77.4ns ± 1%  43.9ns ± 1%  -43.31%  (p=0.000 n=10+10)
NoDeferLock-4  16.0ns ± 0%  15.9ns ± 0%   -0.39%    (p=0.000 n=9+8)

This also shaves 34ns off cgo calls:

name       old time/op  new time/op  delta
CgoNoop-4   122ns ± 1%  88.3ns ± 1%  -27.72%  (p=0.000 n=8+9)

Updates golang#14939, golang#16051.

Change-Id: I2baa0dea378b7e4efebbee8fca919a97d5e15f38
Reviewed-on: https://go-review.googlesource.com/29656
Reviewed-by: Keith Randall <[email protected]>
@ianlancetaylor
Copy link
Member

We are going to decline this specific proposal. Marking some cgo calls as non-blocking can fail in too many subtle ways that are hard to understand. We are certainly interested in speeding up cgo calls in general, but this approach is not the one we will take.

Note that we have sped up cgo calls in 1.8 in https://golang.org/cl/30080.

@linkerlin
Copy link

@ianlancetaylor what about seperating C and Go in two processes which connected with pipe or mmap, and make a C-call is translate to send a msg to C-process and waiting for a response.
This way can make C-code run faster and keep Go runtime happy.

@davecheney
Copy link
Contributor

@linkerlin this proposal was declined eight months ago. Please do not continue the conversation on closed issues. Please see https://golang.org/wiki/Questions for good places to ask. Thanks.

@aclements
Copy link
Member

aclements commented Sep 12, 2017

(We shouldn't continue this conversation here, as @davecheney pointed out, but just to close the loop...)

@linkerlin, putting them in separate processes would make it hard to pass pointers, which is one of the main features of cgo. But even putting them in separate threads in the same process wouldn't help with the overhead: the transition would still have to interact with the Go scheduler and then it would also have to interact with the OS scheduler. Nonetheless, if an application wanted to split up its Go and C components this way, it could certainly do so without runtime help.

@chen3feng
Copy link

chen3feng commented Jun 29, 2018

I support this idea.
But I think it will be better if we can control blocking/nonblocking at each invoking point.
For example, say the write system call, if it is writing to a disk file, it will not block, but if it writes to a socket/pile, it may be blocked.

Maybe we can use some syntax like this:
C.write(fd, data, size) // will switch thread
C.nonblocking.write(fd, data, size) // will not switch thread

@ianlancetaylor
Copy link
Member

@chen3feng This issue is closed. Comments on closed issues are not tracked. Please open a new issue or use a forum; see https://golang.org/wiki/Questions . Thanks.

I think it is extremely unlikely that we would ever let the programmer pick whether a cgo call blocks or not, as that is very easy to get wrong, and getting it wrong may cause the entire program to freeze. That said, please do not reply here.

@golang golang locked as resolved and limited conversation to collaborators Jun 29, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests