proposal: a faster C-call mechanism for non-blocking C functions #16051

eloff · 2016-06-13T16:51:09Z

I'm not sure this is the place to submit this, the golang-dev list might also have been a good candidate.

The cgo FFI mechanism is very general, and makes some pessimistic assumptions about the function being called - specifically it has to handle the case where the called function may block, e.g. in a blocking syscall. Having to handle the worst case adds a lot of overhead to each call into C, including just calling a math function or other small function that does no syscalls or does not block.

The problem is for well-behaved calls that neither block nor call back into Go, if the called function does not do much work then the cgo overhead dominates the runtime and causes poor performance. This is a well known limitation (and often complained about on golang-nuts and elsewhere, google "cgo overhead" returns about 45,000 hits.) Some of this can be mitigated by making the C function do more work, e.g. exporting a chunkier vs chattier API - but that's not always feasible or desirable.

For some classes of problems this overhead is just not admissible, and people resort to doing bad things like calling C from assembly on a Go stack, which just isn't built for it. The Go runtime itself finds the C call overhead too much at times, and Dmitry Vyukov had to work around it when integrating the TSAN race checker with Go. A look at https://golang.org/src/runtime/race_amd64.s?m=text shows a custom C call mechanism that just puts the arguments into the right registers, switches stacks from the Go stack to the C stack, and calls the TSAN runtime function.

Since this seems to be a requirement that cannot always be worked around, and it's a requirement of both Go users and the Go runtime itself, and the workarounds are difficult, fraught, and fragile - I would like to see a solution for this in Go itself. Such a solution should very closely mirror the functionality in race_amd64.s, ideally because that approach is about as low overhead as possible, and because it can be used to replace that code. It could take any form, but since it needs to be backwards compatible and ideally one does not introduce new syntax - something like an attribute on the function declaration, or a cgo directive would be nice. It is up to the programmer to ensure that C functions called in this way do not block, and don't call back into Go.


e.g. __attribute__ ((nonblocking)) in the declaration,
or // #cgo nonblocking directly preceeding the callsite or the declaration

Is this a reasonable feature for a future version of Go?

The text was updated successfully, but these errors were encountered:

ianlancetaylor · 2016-06-13T17:07:42Z

This would probably be more usefully discussed on golang-dev rather than on the issue tracker.

In any case I think the first step is straightforward: measure the overhead of a cgo call and get a good estimate of how much could be saved by adopting this proposal.

eloff · 2016-06-13T17:29:59Z

Sorry for posting this in the wrong place then.

Overhead of cgo is about 170ns and up (600-800 cycles). Calling Go functions is about 100 times faster. Using the alternative call mechanism, Dmitry cites a real-world improvement of up to 50% of programs running under the race checker - that's for the entire program, presumably not all of which is spent calling into TSAN.

Again not all programs will benefit from this, but there are a class of programs which do - including Go itself - and more importantly for which there are no practical alternatives.

ALTree · 2016-06-13T18:29:16Z

Note that the performance of cgocall has steadily decreased in the past 2 years. The penalty for a cgo call was around 30ns in go1, now (go1.6) we're at ~200ns.

Maybe the easiest thing to do would be to first investigate whether we can recover at least some of the go1 performance for cgocall ?

Reference for this problem is issue #9704.

rasky · 2016-06-15T16:18:26Z

I like the idea of nullifying the large overhead of cgo which does affect a class of programs. I'm a little concerned on how hard it would be for a programmer to prove that a C function is safe for the #cgo nonblocking behavior. Trivial functions are easy, but what about a moderately complex C library? How can I convince myself that there is no codepaths that call a blocking syscall? In fact, how can I know which standard library functions do call blocking syscalls? I find GODEBUG=cgocheck very useful, so maybe something similar could be attempted (thought I can't see how)?

ianlancetaylor · 2016-06-15T16:47:06Z

I'm not convinced that #cgo nonblocking is a good idea, because, as you say, it's very easy to get wrong. In particular it's very easy to get wrong for unlikely error cases, leaving you with a program that normally works but occasionally deadlocks.

We could detect a deadlock easily enough in the sysmon thread: if a supposedly non-blocking cgocall doesn't return within some period of time, we can declare an error. We can't recover, though; all we can do at that point is crash.

I don't know of a way to know which standard library functions do blocking syscalls. That is a somewhat different path than a cgo call, and should be more efficient--for example, it doesn't have to switch stacks, as syscalls don't use any stack space. Still it's true that there is some overhead. It's also true that the syscall package uses a more efficient call path for syscalls that don't block; they are the ones annotated with sysnb comments rather than sys.

It's worth noting that we've been surprised in the past by which syscalls can not block. For example, when people use a FUSE file system, Dup can block (see issue #10202). This is the kind of problem that could vex #cgo nonblocking in unusual cases.

eloff · 2016-06-15T19:56:47Z

Can we clarify for ignorant like myself what the risks are here if on an unusual code path a call blocks that is marked #cgo nonblocking? As far as I understand it it's non-optimal, but not going to cause deadlocks. What would cause deadlocks is if the code did something that blocks on another goroutine which won't run now that we're blocking the scheduler. That's still a problem but a lot more specific than just it needs to be a function that never blocks. It needs to be a function that usually doesn't block, and that if it does it never blocks on another thread in the process - e.g. with a mutex, or something. That's a lot easier to guarantee. I can call functions from the C standard library not sure that they would block or not, but I can be sure if they will or won't block on another goroutine in my program - especially since any goroutine blocked in the scheduler cannot be in the middle of another C non blocking call. Like the mutex(s) in malloc aren't a problem because a goroutine can't get unscheduled in the middle of a malloc. Any mutexes only used inside C also won't be a problem because it's not possible to put a goroutine to sleep while it's in the middle of a C call. Of course my understanding may well be flawed, but if it's not, then it seems like this is not an unreasonable burden to place on the programmer for use of an opt-in, advanced feature.

Remember that the alternatives when cgo performance is not enough, are more dangerous. They carry these exact same risks, plus are fragile to the Go implementation changing, plus involve assembly and access to go runtime internals. Worse yet, if one avoids doing it right by digging into the go runtime to change stacks, and just do the call on the goroutine stack - now you're making assumptions about how much stack space your C program needs - and that's an extremely unsafe assumption that can't be easily made. Who knows which standard library functions allocate a big buffer on the stack or use recursion.

ianlancetaylor · 2016-06-16T04:04:11Z

The most efficient implementation of #cgo nonblocking--and if we don't do an efficient implementation, why bother--can easily deadlock. The scheduler assumes that it can preempt any goroutine, in order to do a garbage collection or simply because the goroutine has been running for a long time. However, a goroutine blocked in C code can not be preempted. That is, a goroutine running in C code is essentially equivalent to a nonpreemptible goroutine, which leads us to the kinds of problems we see in #10958. Those problems are quite bad, but we live with it because it's hard to write a nonpreemptible goroutine. But with an efficient #cgo nonblocking, it's easy.

To avoid that problem, we need to notify the scheduler that we are entering a different regime--the steps taken by entersyscall in runtime/proc.go.

No matter what, we need to switch to a new stack. A goroutine stack is often very small--as small as 128 bytes. No C function can be expected to run on a stack that small. So we must always change stacks.

And, of course, not matter what, we must pass the arguments using the C ABI.

But at this point--notify the scheduler, change stacks, pass C ABI arguments--we're pretty much doing everything that cgo does today. So #cgo nonblocking isn't saving us much time, if any.

eloff · 2016-06-16T17:01:07Z

Ok, now I understand the implications better. It's clear blocking could cause serious performance problems for the entire process - however, it's still not clear to me that it can deadlock, except in very unusual situations that don't arise in a typical Go program - I can live with that.

If #cgo nonblocking is implemented it needs to be performant, as you say. Changing stacks is unavoidable and passing arguments with the C ABI is unavoidable (but a lot easier for the compiler to do efficiently.) The interaction with the scheduler is the part that would have to be dropped.

I do know that the Go runtime makes exactly this tradeoff when it calls the TSAN runtime functions. They're known not to block, so that's considered safe enough and worth it for the 50% speedup. I have similar needs with an in-memory database, and I'm confident that those functions don't block, except maybe in a once-in-a-blue-moon scenario. I can live with that if it doesn't cause deadlocks in any real scenario.

beoran · 2016-06-30T22:03:04Z

Basically, the main problem is that calling C functions from Go is slow, and has only become slower. This unfortunately makes Go less suitable for graphical applications and games that have to call into a C OpenGL or Vulkan API many times per second. I don't have any great ideas on how to solve this problem, but I insist that it is something that needs to be improved.

bradfitz · 2016-06-30T22:06:42Z

@aclements mentioned him (or somebody) making defers allocated on the stack in Go 1.8. That is #14939. That will help with this bug too, since each cgo call does 2+ defers, IIRC.

eloff · 2016-06-30T22:32:30Z

There seems to be two things that can be done here.

Speed up existing CGO calls
Add an additional way of calling a well-behaved restricted subset of C functions

This issue should be about the second solution only.

There are definitely things that can be done for (1). That will shrink the set of programs which need solution (2). However (1) will always leave performance on the floor because it has to do extra work to handle not well-behaved C calls.

Now maybe CGO can be sped up enough that the performance difference is so small that the set of programs that would benefit from (2) is too small to be worth the effort of implementing and maintaining it. Or maybe it is deemed too dangerous to be used correctly - but then I question why the Go runtime uses it for the race detector.

ianlancetaylor · 2016-06-30T22:45:30Z

I don't think one can base any arguments on the race detector, which is a special case. It calls into the support library for every memory read or write. That is vastly more calls than even the most active cgo using program can possibly do.

eloff · 2016-07-01T00:15:35Z

I'll grant that, but I'm still not content to leave that much performance on the floor. I'm going to try the race detector C calling approach and see how much of a difference it makes in my use case. I'll report back in a couple weeks with some numbers hopefully.

A faster CGO mechanism allows calling C functions that do less work. e.g. if the C function takes 200ns and we speed up the CGO mechanism to 30ns (which seems to have been true in past Go versions), then the total calling time for that function goes from 400 to 230ns, almost twice as fast. If the C function itself takes 30ns, then it goes from 230ns to 60ns, which is a four fold improvement. That opens up more options when designing APIs.

I just spent two days implementing a function in assembly in Go because it needed popcnt, prefetch, and bsr. It could have been implemented in two hours in C, but the CGO overhead makes it a non-starter. In other places I duplicate code between C and Go (using very unidiomatic Go with lots of unsafe) to avoid the CGO overhead.

davecheney · 2016-07-01T00:22:54Z

A C function that took 30ns is, optimistically, at most 50 machine
instructions, assuming the code cache is hot and the code in question
performs only memory accesses that are already in the L1 data cache.

What work can realistically be done in that few number of cycles that
cannot be rewritten in Go?

On Fri, 1 Jul 2016, 10:15 Daniel Eloff [email protected] wrote:

I'll grant that, but I'm still not content to leave that much performance
on the floor. I'm going to try the race detector C calling approach and see
how much of a difference it makes in my use case. I'll report back in a
couple weeks with some numbers hopefully.

A faster CGO mechanism allows calling C functions that do less work. e.g.
if the C function takes 200ns and we speed up the CGO mechanism to 30ns
(which seems to have been true in past Go versions), then the total calling
time for that function goes from 400 to 230ns, almost twice as fast. If the
C function itself takes 30ns, then it goes from 230ns to 60ns, which is a
four fold improvement. That opens up more options when designing APIs.

I just spent two days implementing a function in assembly in Go because it
needed popcnt, prefetch, and bsr. It could have been implemented in two
hours in C, but the CGO overhead makes it a non-starter. In other places I
duplicate code between C and Go (using very unidiomatic Go with lots of
unsafe) to avoid the CGO overhead.

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#16051 (comment), or mute
the thread
https://github.com/notifications/unsubscribe/AAAcA864gdwx_X3mtM1OEZFUon2m5hBaks5qRFwogaJpZM4I0hlW
.

aclements · 2016-07-01T14:49:44Z

@eloff, there's been some discussion (though no concrete proposals to my knowledge) to add functions for things like popcount and bsr that would be compiled as intrinsics where supported (like math.Sqrt is today). With 1.7 we're trying this out in the runtime, which now has a ctz intrinsic available to it on amd64 SSA. Obviously that doesn't solve the overall problem, but it would chip away at one more reason to use cgo in a low-overhead context.

eloff · 2016-07-01T19:18:22Z

@aclements That sounds very interesting, I'm looking forward to that! You're right, it would eliminate another reason for calling lightweight C functions from Go.

beoran · 2016-07-01T19:34:01Z

Well, it would be useful, but it's a C world, and certainly for programming
graphically intensive games, I need to call into C Apis very frequently.
All that stack swapping adds up a lot.

I think both approaches are needed, cgo calls need to be made faster in
general, but a way to mark certain C functions as special cases could
equally be useful.

To throw out a wild idea if myself, maybe consecutive calls into C could be
grouped together and use the same C stack without swapping back. This would
probably require a cgo directive...
On 1 Jul 2016 9:19 pm, "Daniel Eloff" [email protected] wrote:

@aclements https://github.com/aclements That sounds very interesting,
I'm looking forward to that! You're right, it would eliminate another
reason for calling lightweight C functions from Go.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#16051 (comment), or mute
the thread
https://github.com/notifications/unsubscribe/AAEWeUDyOm9kqAgwrqBzfLHRe45-llHRks5qRWg4gaJpZM4I0hlW
.

bcmills · 2016-07-01T19:44:39Z

To throw out a wild idea if myself, maybe consecutive calls into C could be grouped together and use the same C stack without swapping back.

How is that different from writing a function in C that makes those calls in sequence?

beoran · 2016-07-01T20:01:24Z

The difference is that I would like to do it in go, not C. Sure, I could
write a low level game engine in C and call into that from go. But that
defeats the purpose of using go in the first place.

I tried to write a 2d game engine in go a few years ago, but the cgo
overhead was too much. So I was forced to go back to C, and since all the
logic was in C, there wasn't much reason left to use go for that project.

Such a C block in go as I propose, would be a bit like extern "C" in C++,
you could still use a subset of go while handling the lower level API.
On 1 Jul 2016 9:45 pm, "Bryan C. Mills" [email protected] wrote:

To throw out a wild idea if myself, maybe consecutive calls into C could
be grouped together and use the same C stack without swapping back.

How is that different from writing a function in C that makes those calls
in sequence?

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#16051 (comment), or mute
the thread
https://github.com/notifications/unsubscribe/AAEWeTUDLzs0Fs-SmywjjLMualwbqWF9ks5qRW5ggaJpZM4I0hlW
.

kunos · 2016-07-09T20:06:38Z

Another gamedev here supporting this proposal. As it stands now, Go would be really a good candidate for a gamedev language, the GC is getting really fast and the language has a lot to offer to the industry.
Sadly the cgo performance make it a no starter at the moment. I keep wondering how this cost is also impacting other areas like server applications with heavy UDP traffic. I am using Go for the server in my game and looking at the code it eventually ends up calling various WSA* functions on Windows.. so there you go, that price is there even when developing non strictly games software.

Perhaps on Linux there is a difference between a syscall and a cgo call, on Windows it does not seem to be the case tho.

I wonder if, instead of tagging cgo calls we could tag the goroutine to use some kind of "special" stack and get ad-hoc treatment from the scheduler to be able to call into C with as close to 0 cost as possible.

I do think this could improve the performance of the entire tech, I agree with rewriting as much as possible in Go, that's fine, but it's still a world of C based operating systems (and it will be for the foreseeable future) and, eventually, you'll have to call C stuff in order to make things happen.. be it a triangle on the screen or a packet over the network.

aclements · 2016-07-09T22:56:34Z

Our current plan is to start by optimizing the existing cgo path and see how far we can get with that before we consider introducing a new mechanism (particularly one that could introduce new classes of bugs).

I keep wondering how this cost is also impacting other areas like server applications with heavy UDP traffic.

I'm not sure, but it might be possible to further optimize Windows system calls beyond the regular cgo path. Most other platforms don't use cgo for syscalls, and there's much less overhead. OTOH, Windows syscalls can have callbacks, so the generality may be necessary.

eloff · 2016-07-09T23:12:37Z

As the OP, I applaud that strategy. I still think a specialized mechanism
may be needed, but it seems relatively easy to roll my own. If I supply my
own C stacks then I don't need any fragile hooks into the go runtime or
scheduler, just a little assembly code. So if it's still a problem
afterward, there is an option that just requires a little effort - and I
think for a potentially unsafe and advanced feature that's even desirable.

On Sat, Jul 9, 2016 at 3:57 PM, Austin Clements [email protected]
wrote:

Our current plan is to start by optimizing the existing cgo path and see
how far we can get with that before we consider introducing a new mechanism
(particularly one that could introduce new classes of bugs).

I keep wondering how this cost is also impacting other areas like server
applications with heavy UDP traffic.

I'm not sure, but it might be possible to further optimize Windows system
calls beyond the regular cgo path. Most other platforms don't use cgo for
syscalls, and there's much less overhead. OTOH, Windows syscalls can have
callbacks, so the generality may be necessary.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#16051 (comment), or mute
the thread
https://github.com/notifications/unsubscribe/AABrtk08THykI6Q2OnEp9wXn7RlUcgkoks5qUCdagaJpZM4I0hlW
.

alexbrainman · 2016-07-09T23:15:55Z

I keep wondering how this cost is also impacting other areas like server applications with heavy UDP traffic.

I think syscall cost is not significant here. But I might be wrong. If you have some real numbers to show one way or other, we could try and make it faster.

Windows syscalls can have callbacks, so the generality may be necessary

Very few Go windows apps use callbacks - the ones who do call syscall.NewCallback. We could optimize the app if it doesn't calls syscall.NewCallback.

Alex

kunos · 2016-07-10T11:03:32Z

Actually I did some tests right now, calling a function from a dll, using CGO ends up being faster than using syscall.Syscall to call the same function.. on Windows.

You are right tho about the UDP thing, I did some actual tests/calculations and with a rate of about 20K packets/second I am looking at a 2-3ms overhead to call into Winsock so nothing to be worried about at all.

alexbrainman · 2016-07-10T13:03:09Z

calling a function from a dll, using CGO ends up being faster than using syscall.Syscall to call the same function.. on Windows.

That is surprising to me. Do you mind providing instructions to reproduce what you see? Thank you.

Alex

kunos · 2016-07-10T13:26:51Z

Sure, this is the relevant part of the code:


/*
#include <windows.h>
#include <mmsystem.h>

#cgo LDFLAGS: -lwinmm
*/
import "C"

var (
    winmm, _        = syscall.LoadLibrary("winmm.dll")
    ftimeGetTime, _ = syscall.GetProcAddress(winmm, "timeGetTime")
)

func tgt() uint32 {
    ret, _, _ := syscall.Syscall(ftimeGetTime, 0, 0, 0, 0)

    return uint32(ret)
}

func testCGOCall() {
    v := uint32(0)

    for i := 0; i < 20000; i++ {
        v += uint32(C.timeGetTime())
    }

}

func testSyscallCall() {
    v := uint32(0)

    for i := 0; i < 20000; i++ {
        v += tgt()
    }
}

And this is the result on my PC:
Syscall: 4.0056ms
CGO: 3.0032ms

alexbrainman · 2016-07-11T02:00:07Z

And this is the result on my PC:
Syscall: 4.0056ms
CGO: 3.0032ms

Thank you for the source. But I don't think you run test for long enough. I did

c:\dev\src\issues\issue16051>type foo.go
package issue16051

import (
        "syscall"
)

/*
#include <windows.h>
#include <mmsystem.h>

#cgo LDFLAGS: -lwinmm
*/
import "C"

func Call_timeGetTime_CGO() uint32 {
        return uint32(C.timeGetTime())
}

var (
        winmm        = syscall.MustLoadDLL("winmm.dll")
        ftimeGetTime = winmm.MustFindProc("timeGetTime")
)

func Call_timeGetTime_Syscall() uint32 {
        ret, _, _ := syscall.Syscall(ftimeGetTime.Addr(), 0, 0, 0, 0)
        return uint32(ret)
}

c:\dev\src\issues\issue16051>type foo_test.go
package issue16051_test

import (
        "testing"

        "issues/issue16051"
)

func BenchmarkTimeGetTimeSyscall(b *testing.B) {
        for i := 0; i < b.N; i++ {
                issue16051.Call_timeGetTime_Syscall()
        }
}

func BenchmarkTimeGetTimeCGO(b *testing.B) {
        for i := 0; i < b.N; i++ {
                issue16051.Call_timeGetTime_CGO()
        }
}

c:\dev\src\issues\issue16051>go test -run=none -bench=.
testing: warning: no tests to run
BenchmarkTimeGetTimeSyscall     10000000               228 ns/op
BenchmarkTimeGetTimeCGO         10000000               229 ns/op
PASS
ok      issues/issue16051       5.087s

c:\dev\src\issues\issue16051>

and I see not much difference between syscall and cgo.

Alex

PS: On Windows you cannot even measure it if it is less than 15ms

kunos · 2016-07-11T06:59:04Z

Well I get the same results increasing the loop count:

Syscall: 2.5059244s
CGO: 2.2626265s

Perhaps I have a weird PC.
And no, using the right timer on Windows you can get reasonable results even for <1ms .

rasky · 2016-07-31T10:28:53Z

I prepared a self-contained benchmark that highlights real-world cgo performance issues here: https://github.com/rasky/tdb-cgo-bench. It's an example that uses the cgo wrapper of TrailDB, which highlights a real-world case where the overhead of cgo is very high. The included benchmark shows a 6x performance regression when executing the same basic code in Go vs C. I hope this can be useful while working on cgo performances in the 1.8 cycle.

aclements · 2016-08-02T18:50:14Z

@rasky, that's very helpful. Thanks! I was able to reproduce the 6x slowdown locally with your benchmark. I took a glance at the profile; much like other high-overhead cgo benchmarks (#9704), it spends a lot of time dealing with defers (~28%). We have a plan to fix that (#14939). Much of the remaining overhead goes into entersyscall/exitsyscall, which, unfortunately, are much trickier to optimize.

adg · 2016-09-12T23:54:10Z

Let's revisit this once the work to speed up defer is complete.

gopherbot · 2016-09-23T18:00:39Z

CL https://golang.org/cl/29656 mentions this issue.

This optimizes deferproc and deferreturn in various ways. The most important optimization is that it more carefully arranges to prevent preemption or stack growth. Currently we do this by switching to the system stack on every deferproc and every deferreturn. While we need to be on the system stack for the slow path of allocating and freeing defers, in the common case we can fit in the nosplit stack. Hence, this change pushes the system stack switch down into the slow paths and makes everything now exposed to the user stack nosplit. This also eliminates the need for various acquirem/releasem pairs, since we are now preventing preemption by preventing stack split checks. As another smaller optimization, we special case the common cases of zero-sized and pointer-sized defer frames to respectively skip the copy and perform the copy in line instead of calling memmove. This speeds up the runtime defer benchmark by 42%: name old time/op new time/op delta Defer-4 75.1ns ± 1% 43.3ns ± 1% -42.31% (p=0.000 n=8+10) In reality, this speeds up defer by about 2.2X. The two benchmarks below compare a Lock/defer Unlock pair (DeferLock) with a Lock/Unlock pair (NoDeferLock). NoDeferLock establishes a baseline cost, so these two benchmarks together show that this change reduces the overhead of defer from 61.4ns to 27.9ns. name old time/op new time/op delta DeferLock-4 77.4ns ± 1% 43.9ns ± 1% -43.31% (p=0.000 n=10+10) NoDeferLock-4 16.0ns ± 0% 15.9ns ± 0% -0.39% (p=0.000 n=9+8) This also shaves 34ns off cgo calls: name old time/op new time/op delta CgoNoop-4 122ns ± 1% 88.3ns ± 1% -27.72% (p=0.000 n=8+9) Updates #14939, #16051. Change-Id: I2baa0dea378b7e4efebbee8fca919a97d5e15f38 Reviewed-on: https://go-review.googlesource.com/29656 Reviewed-by: Keith Randall <[email protected]>

This optimizes deferproc and deferreturn in various ways. The most important optimization is that it more carefully arranges to prevent preemption or stack growth. Currently we do this by switching to the system stack on every deferproc and every deferreturn. While we need to be on the system stack for the slow path of allocating and freeing defers, in the common case we can fit in the nosplit stack. Hence, this change pushes the system stack switch down into the slow paths and makes everything now exposed to the user stack nosplit. This also eliminates the need for various acquirem/releasem pairs, since we are now preventing preemption by preventing stack split checks. As another smaller optimization, we special case the common cases of zero-sized and pointer-sized defer frames to respectively skip the copy and perform the copy in line instead of calling memmove. This speeds up the runtime defer benchmark by 42%: name old time/op new time/op delta Defer-4 75.1ns ± 1% 43.3ns ± 1% -42.31% (p=0.000 n=8+10) In reality, this speeds up defer by about 2.2X. The two benchmarks below compare a Lock/defer Unlock pair (DeferLock) with a Lock/Unlock pair (NoDeferLock). NoDeferLock establishes a baseline cost, so these two benchmarks together show that this change reduces the overhead of defer from 61.4ns to 27.9ns. name old time/op new time/op delta DeferLock-4 77.4ns ± 1% 43.9ns ± 1% -43.31% (p=0.000 n=10+10) NoDeferLock-4 16.0ns ± 0% 15.9ns ± 0% -0.39% (p=0.000 n=9+8) This also shaves 34ns off cgo calls: name old time/op new time/op delta CgoNoop-4 122ns ± 1% 88.3ns ± 1% -27.72% (p=0.000 n=8+9) Updates golang#14939, golang#16051. Change-Id: I2baa0dea378b7e4efebbee8fca919a97d5e15f38 Reviewed-on: https://go-review.googlesource.com/29656 Reviewed-by: Keith Randall <[email protected]>

ianlancetaylor · 2016-10-24T22:45:58Z

We are going to decline this specific proposal. Marking some cgo calls as non-blocking can fail in too many subtle ways that are hard to understand. We are certainly interested in speeding up cgo calls in general, but this approach is not the one we will take.

Note that we have sped up cgo calls in 1.8 in https://golang.org/cl/30080.

linkerlin · 2017-09-11T12:52:31Z

@ianlancetaylor what about seperating C and Go in two processes which connected with pipe or mmap, and make a C-call is translate to send a msg to C-process and waiting for a response.
This way can make C-code run faster and keep Go runtime happy.

davecheney · 2017-09-11T21:08:05Z

@linkerlin this proposal was declined eight months ago. Please do not continue the conversation on closed issues. Please see https://golang.org/wiki/Questions for good places to ask. Thanks.

aclements · 2017-09-12T02:04:22Z

(We shouldn't continue this conversation here, as @davecheney pointed out, but just to close the loop...)

@linkerlin, putting them in separate processes would make it hard to pass pointers, which is one of the main features of cgo. But even putting them in separate threads in the same process wouldn't help with the overhead: the transition would still have to interact with the Go scheduler and then it would also have to interact with the OS scheduler. Nonetheless, if an application wanted to split up its Go and C components this way, it could certainly do so without runtime help.

chen3feng · 2018-06-29T11:28:55Z

I support this idea.
But I think it will be better if we can control blocking/nonblocking at each invoking point.
For example, say the write system call, if it is writing to a disk file, it will not block, but if it writes to a socket/pile, it may be blocked.

Maybe we can use some syntax like this:
C.write(fd, data, size) // will switch thread
C.nonblocking.write(fd, data, size) // will not switch thread

ianlancetaylor · 2018-06-29T12:55:04Z

@chen3feng This issue is closed. Comments on closed issues are not tracked. Please open a new issue or use a forum; see https://golang.org/wiki/Questions . Thanks.

I think it is extremely unlikely that we would ever let the programmer pick whether a cgo call blocks or not, as that is very easy to get wrong, and getting it wrong may cause the entire program to freeze. That said, please do not reply here.

ianlancetaylor changed the title ~~A faster C-call mechanism for non-blocking C functions~~ proposal: a faster C-call mechanism for non-blocking C functions Jun 13, 2016

ianlancetaylor added the Proposal label Jun 13, 2016

ianlancetaylor added this to the Proposal milestone Jun 13, 2016

eliasnaur mentioned this issue Sep 2, 2016

proposal: expose Java API to gomobile bind programs #16876

Closed

bradfitz assigned aclements Sep 12, 2016

ianlancetaylor closed this as completed Oct 24, 2016

ianlancetaylor added the Proposal-Declined label Oct 24, 2016

thepudds mentioned this issue Jan 14, 2017

math/bits: an integer bit twiddling library #18616

Closed

ayj mentioned this issue Mar 16, 2017

Evaluate the use of a full-fledged language for attribute pattern matching. istio/old_mixer_repo#387

Closed

petermattis mentioned this issue Aug 1, 2017

perf: explore pushing parts of MVCC or KV evaluation into C++ cockroachdb/cockroach#17172

Closed

fdr mentioned this issue Nov 2, 2017

Incremental backup wal-g/wal-g#29

Merged

ianlancetaylor added the FrozenDueToAge label Jun 29, 2018

golang locked as resolved and limited conversation to collaborators Jun 29, 2018

rsc unassigned aclements Jun 23, 2022

proposal: a faster C-call mechanism for non-blocking C functions #16051

proposal: a faster C-call mechanism for non-blocking C functions #16051

Comments

eloff commented Jun 13, 2016 • edited Loading

ianlancetaylor commented Jun 13, 2016

eloff commented Jun 13, 2016 • edited Loading

ALTree commented Jun 13, 2016

rasky commented Jun 15, 2016

ianlancetaylor commented Jun 15, 2016

eloff commented Jun 15, 2016

ianlancetaylor commented Jun 16, 2016

eloff commented Jun 16, 2016

beoran commented Jun 30, 2016

bradfitz commented Jun 30, 2016

eloff commented Jun 30, 2016

ianlancetaylor commented Jun 30, 2016

eloff commented Jul 1, 2016

davecheney commented Jul 1, 2016

aclements commented Jul 1, 2016

eloff commented Jul 1, 2016

beoran commented Jul 1, 2016

bcmills commented Jul 1, 2016

beoran commented Jul 1, 2016

kunos commented Jul 9, 2016

aclements commented Jul 9, 2016

eloff commented Jul 9, 2016

alexbrainman commented Jul 9, 2016

kunos commented Jul 10, 2016

alexbrainman commented Jul 10, 2016

kunos commented Jul 10, 2016 • edited Loading

alexbrainman commented Jul 11, 2016

kunos commented Jul 11, 2016

rasky commented Jul 31, 2016

aclements commented Aug 2, 2016

adg commented Sep 12, 2016

gopherbot commented Sep 23, 2016

ianlancetaylor commented Oct 24, 2016

linkerlin commented Sep 11, 2017

davecheney commented Sep 11, 2017

aclements commented Sep 12, 2017 • edited Loading

chen3feng commented Jun 29, 2018 • edited Loading

ianlancetaylor commented Jun 29, 2018

eloff commented Jun 13, 2016 •

edited

Loading

eloff commented Jun 13, 2016 •

edited

Loading

kunos commented Jul 10, 2016 •

edited

Loading

aclements commented Sep 12, 2017 •

edited

Loading

chen3feng commented Jun 29, 2018 •

edited

Loading