Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DP-108] Investigate why spawn is slow #206

Open
qnikst opened this issue Jun 17, 2015 · 17 comments
Open

[DP-108] Investigate why spawn is slow #206

qnikst opened this issue Jun 17, 2015 · 17 comments

Comments

@qnikst
Copy link
Contributor

qnikst commented Jun 17, 2015

[Imported from JIRA. Reported by Niklas Hambuechen @[email protected]) as DP-108 on 2015-03-15 02:48:31]
Copying my posts from the #haskell-distributed IRC channel:

spawn seems to be very slow for me, even though I'm on localhost. Doing it in a loop gets me to almost 50 ms per spawn, why would it be so high? I can't use spawnAsync in my case, but why would a spawn on localhost take this long in the first place? My ethernet latency is 0.5ms and localhost latency is 0.1ms, so that can't be it. CPU is low too.

I have a suspicion: using strace -f -c -w on the node onto which I spawn the processes (a slave using simplelocalnet), I see 179596 calls to the select syscall. That doesn't seem right given that I only do 100 spawns and nothing else. Might this be that the master is sending a lot of small numbers, which it recvs one after the other? I think this is the only way to trigger so many selects, and I've seen that recvInt32 does exactly such a thing (recv'ing 4 bytes at a time), and it does appear in my profiling output.

Further, the 50ms that each spawn takes are suspiciously close to the 40ms TCP ACK delay on Linux (I'm on Linux), as mentioned here: http://stackoverflow.com/a/2253620/263061.

I have found something different though that fixes the problem: setting +RTS -V0 on the slave reduces the time for each spawn to 3ms. How can it be that this has such a huge effect?

I can get the same good results with +RTS -C0.001. But why? This sets the context switch interval; if that has such a positive effect, doesn't that mean that there are other Haskell threads around that actually run and thus stop my recv/recv from immediately being scheduled again? Assume there's only one recv that I'm running; when it gets a context switch interrupt, interrupting the recv, it should see that there are no other Haskell threads to be run, and immediately go back into my recv again, I can't see a reason why it should do anything else that's not my recv ...

Also, setting +RTS -C to something very high does not make it slower than 50ms per spawn, e.g. setting +RTS -C1 does not make it take 1 second per spawn, it's still 50ms.

Setting +RTS -N2/-N3/-N4 helps, too: I get down to 6 ms, compared to the 50 ms for -N1.

nh2: may it be that there are actually 2 recvs going on, but only one can be active at the same time if I'm running on -N1, so the system toggles between them at the interval of the context switch interval -C, which defaults to 20ms, and two of these switches make the ~50ms that I'm seeing?

@qnikst qnikst changed the title Investigate why spawn is slow [DP-108] Investigate why spawn is slow Jun 18, 2015
@teh
Copy link
Contributor

teh commented Feb 5, 2017

@qnikst - did you ever get anywhere with this?

I see similar 50-100ms delays for simple calls like whereis but I'm not sure it's the same problem. I can't fix it with any of the suggestions above so I was wondering whether you discovered either the cause or a solution?

@hyperthunk
Copy link
Member

Could this be related to the issue with network when is compiled with -threaded? Do you both see this for spawnlocal too?

@qnikst
Copy link
Contributor Author

qnikst commented Feb 5, 2017

@teh unfortunately we didn't move much in this direction.

We had a discussion with @nh2 about the problem and solution. If I recall correctly large timing was due to TCP Nagle's algorithm, one solution is to set TCP_NO_DELAY option in network transport TCP (it's possible today), another, that was discussed was extend network-transport API by the flush call that will flush all messages. If I recall correctly everybody agreed that this approach may work, but that didn't move further. In our project we are using TCP NO_DELAY option, but we are ready to pay related overhead, as it's ok for our solution, but may not work in another.

If you have minimal example we could reiterate on this issue and try to investigate that further.

@teh
Copy link
Contributor

teh commented Feb 5, 2017

Created https://github.com/teh/dh-minimal-slow-testcase - see README for details

@teh
Copy link
Contributor

teh commented Feb 6, 2017

I think I narrowed this down to 1-byte writes and reads. Looking at strace:

strace -c -f binary
[...]
 62.79    0.000054           0      9796         5 write
 37.21    0.000032           0      9791           poll
[...]

And the actual trace looks like this:

select(5, [3 4], [], NULL, {tv_sec=0, tv_usec=0}) = 0 (Timeout)
poll([{fd=2, events=POLLOUT}], 1, 0)    = 1 ([{fd=2, revents=POLLOUT}])
write(2, "1", 11)                        = 1
poll([{fd=2, events=POLLOUT}], 1, 0)    = 1 ([{fd=2, revents=POLLOUT}])
write(2, ":", 1:)                        = 1
poll([{fd=2, events=POLLOUT}], 1, 0)    = 1 ([{fd=2, revents=POLLOUT}])
write(2, "8", 18)                        = 1
poll([{fd=2, events=POLLOUT}], 1, 0)    = 1 ([{fd=2, revents=POLLOUT}])
write(2, "0", 10)                        = 1
poll([{fd=2, events=POLLOUT}], 1, 0)    = 1 ([{fd=2, revents=POLLOUT}])
write(2, "8", 18)                        = 1
poll([{fd=2, events=POLLOUT}], 1, 0)    = 1 ([{fd=2, revents=POLLOUT}])
write(2, "1", 11)                        = 1
poll([{fd=2, events=POLLOUT}], 1, 0)    = 1 ([{fd=2, revents=POLLOUT}])
write(2, ":", 1:)                        = 1
poll([{fd=2, events=POLLOUT}], 1, 0)    = 1 ([{fd=2, revents=POLLOUT}])
write(2, "0", 10)                        = 1
poll([{fd=2, events=POLLOUT}], 1, 0)    = 1 ([{fd=2, revents=POLLOUT}])
write(2, "\\", 1\)                       = 1
poll([{fd=2, events=POLLOUT}], 1, 0)    = 1 ([{fd=2, revents=POLLOUT}])
write(2, "2", 12)                        = 1
poll([{fd=2, events=POLLOUT}], 1, 0)    = 1 ([{fd=2, revents=POLLOUT}])
write(2, "3", 13)                        = 1
poll([{fd=2, events=POLLOUT}], 1, 0)    = 1 ([{fd=2, revents=POLLOUT}])
write(2, "0", 10)                        = 1
poll([{fd=2, events=POLLOUT}], 1, 0)    = 1 ([{fd=2, revents=POLLOUT}])
write(2, "k", 1k)                        = 1
poll([{fd=2, events=POLLOUT}], 1, 0)    = 1 ([{fd=2, revents=POLLOUT}])
write(2, "\\", 1\)                       = 1

@qnikst
Copy link
Contributor Author

qnikst commented Feb 6, 2017

This strace looking like dumping logs to stdout (fd=2). I don't think it looks very nice, as seems one write is one as a multiple poll>>write calls, but it seems it may hide problems with d-p itself. Can you try export DISTRIBUTED_PROCESS_TRACE_EVENT=y instead, this should reduce this kind of overhead?

@hyperthunk
Copy link
Member

Can you try export DISTRIBUTED_PROCESS_TRACE_EVENT=y instead, this should reduce this kind of overhead?

@qnikst do you mean DISTRIBUTED_PROCESS_TRACE_EVENTLOG or DISTRIBUTED_PROCESS_TRACE_CONSOLE? Also, if you don't set any trace flags, there will be nothing in the output. See the section on DISTRIBUTED_PROCESS_TRACE_FLAGS in the documentation.

@hyperthunk
Copy link
Member

Also, tracing might still be a bit broken for some cases, as per #265. That is high on my stack to fix, and I'm hoping to get to it once I've got CI fixed for the repos I'm maintainer for.

@teh
Copy link
Contributor

teh commented Feb 6, 2017

Ah that was dumb, apologies. Looks like unbuffered 1-char-at-a-time logging which is not ideal but almost certainly a separate issue.

Without logging to CONSOLE (export -n DISTRIBUTED_PROCESS_TRACE_CONSOLE) I can reproduce the large number of select calls. While that's not great either the time doesn't seem to be lost as CPU time.

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
[...]
100.00    0.002215           2      1058       304 select

Another thing I noticed is that I see an unusually high number of SIGVTALRM compared to "normal" GHC processes. Could that cause the slowdown somehow?

--- SIGVTALRM {si_signo=SIGVTALRM, si_code=SI_TIMER, si_timerid=0, si_overrun=0, si_value={int=0, ptr=NULL}} ---
rt_sigreturn({mask=[]})                 = -1 EINTR (Interrupted system call)

BTW, can you reproduce the slowness in my repo above? If you can't reproduce then I will focus on my system (build flags etc), if you can reproduce I'd start looking into h-d library code.

@teh
Copy link
Contributor

teh commented Feb 6, 2017

Here's an illustration of the issue as a strace log. Apologies for so much text, let me know if you prefer the data in a separate gist, or as an attachment.

Specifically the client sends data at [06.701756] C, the four bytes \0\0\5\332.

But this data only arrives at the server at [06.742397] S, i.e. ~40ms later. As mentioned in the initial post that's suspiciously close to the default Linux TCP ack delay.

legend:

timestamp   [C]lient/[S]erver 
|           | strace output 
|           | |
[06.701576] C recvfrom(6, 0x42005ff150, 4, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
[06.701607] C select(7, [3 4 6], [], NULL, {tv_sec=0, tv_usec=0}) = 0 (Timeout)
[06.701635] C writev(6, [{iov_base="\0\0\0\0", iov_len=4}, {iov_base="\0\0\4\0", iov_len=4}], 2) = 8
[06.701664] C select(7, [3 4 6], [], NULL, {tv_sec=0, tv_usec=0}) = 0 (Timeout)
[06.701684] C writev(6, [{iov_base="\0\0\4\1", iov_len=4}, {iov_base="\0\0\0\31", iov_len=4}, {iov_base="\1\0\0\0\0\0\0\0\020127.0.0.1:8081:0", iov_len=25}], 
3) = 33
[06.701704] S select(5, [3 4], [], NULL, {tv_sec=0, tv_usec=0}) = 0 (Timeout)
[06.701756] C writev(6, [{iov_base="\0\0\4\1", iov_len=4}, {iov_base="\0\0\5\332", iov_len=4}, {iov_base="\0\0\0\0\0\0\0\0\020127.0.0.1:8001:0\361;\374\242\0\
0\0"..., iov_len=1498}], 3) = 1506
[06.701790] C select(7, [3 4 6], [], NULL, {tv_sec=0, tv_usec=0}) = 0 (Timeout)
[06.701829] C select(7, [3 4 6], [], NULL, {tv_sec=0, tv_usec=0}) = 0 (Timeout)
[06.701860] C writev(6, [{iov_base="\0\0\4\0", iov_len=4}, {iov_base="\0\0\0\31", iov_len=4}, {iov_base="\1\0\0\0\0\0\0\0\020127.0.0.1:8081:0", iov_len=25}], 
3) = 33
[06.701891] C writev(6, [{iov_base="\0\0\4\0", iov_len=4}, {iov_base="\0\0\0?", iov_len=4}, {iov_base="\0\0\0\0\0\0\0\0\020127.0.0.1:8001:0\361;\374\242\0\0\0
"..., iov_len=63}], 3) = 71
[06.709408] C select(7, [3 4 6], [], NULL, {tv_sec=2, tv_usec=882318}) = ? ERESTARTNOHAND (To be restarted if no handler)
[06.709539] C rt_sigreturn({mask=[]})                 = -1 EINTR (Interrupted system call)
[06.710627] S select(5, [3 4], [], NULL, NULL)        = ? ERESTARTNOHAND (To be restarted if no handler)
[06.710696] S rt_sigreturn({mask=[]})                 = -1 EINTR (Interrupted system call)
[06.719499] C select(7, [3 4 6], [], NULL, {tv_sec=2, tv_usec=874859}) = ? ERESTARTNOHAND (To be restarted if no handler)
[06.719718] C rt_sigreturn({mask=[]})                 = -1 EINTR (Interrupted system call)
[06.720603] S select(5, [3 4], [], NULL, NULL)        = ? ERESTARTNOHAND (To be restarted if no handler)
[06.720681] S rt_sigreturn({mask=[]})                 = -1 EINTR (Interrupted system call)
[06.729506] C select(7, [3 4 6], [], NULL, {tv_sec=2, tv_usec=864945}) = ? ERESTARTNOHAND (To be restarted if no handler)
[06.729670] C rt_sigreturn({mask=[]})                 = -1 EINTR (Interrupted system call)
[06.730703] S select(5, [3 4], [], NULL, NULL)        = ? ERESTARTNOHAND (To be restarted if no handler)
[06.730804] S rt_sigreturn({mask=[]})                 = -1 EINTR (Interrupted system call)
[06.739496] C select(7, [3 4 6], [], NULL, {tv_sec=2, tv_usec=855265}) = ? ERESTARTNOHAND (To be restarted if no handler)
[06.739681] C rt_sigreturn({mask=[]})                 = -1 EINTR (Interrupted system call)
[06.740644] S select(5, [3 4], [], NULL, NULL)        = ? ERESTARTNOHAND (To be restarted if no handler)
[06.740723] S rt_sigreturn({mask=[]})                 = -1 EINTR (Interrupted system call)
[06.742081] S select(5, [3 4], [], NULL, NULL)        = 1 (in [4])
[06.742171] S recvfrom(4, "\0\0\0\0", 4, 0, NULL, NULL) = 4
[06.742194] S recvfrom(4, "\0\0\4\0", 4, 0, NULL, NULL) = 4
[06.742214] S recvfrom(4, "\0\0\4\1", 4, 0, NULL, NULL) = 4
[06.742231] S recvfrom(4, "\0\0\0\31", 4, 0, NULL, NULL) = 4
[06.742249] S recvfrom(4, "\1\0\0\0\0\0\0\0\020127.0.0.1:8081:0", 25, 0, NULL, NULL) = 25
[06.742266] S recvfrom(4, "\0\0\4\1", 4, 0, NULL, NULL) = 4
[06.742283] S select(4, [3], [], NULL, {tv_sec=0, tv_usec=0}) = 0 (Timeout)
[06.742351] S select(4, [3], [], NULL, {tv_sec=0, tv_usec=0}) = 0 (Timeout)
[06.742397] S recvfrom(4, "\0\0\5\332", 4, 0, NULL, NULL) = 4

@teh
Copy link
Contributor

teh commented Feb 6, 2017

Summary for the next person hitting this issue: The problem is TCP_NODELAY not being set. TCP_NODELAY needs to be set on all transports, not just the "calling" side.

The cause is a bad interaction of small initial window size and delayed TCP acks: The sending side waits for an ack to send more data but the receiving side is doing a delayed TCP ack.

This is a good explanation of the issue.

@qnikst
Copy link
Contributor Author

qnikst commented Feb 6, 2017

Yeah as I said in #206 (comment), we have solved similar issue for us by using TCP_NODELAY (and exposing that option in n-t-tcp API). But using TCP_NODELAY is having a latency by dropping throughput, and quite possibly network throughput if your nodes are not too close to each other. So if you have enough streaming data in your communication to make TCP flow, this option is undesirable and even can hit performance (for this reason it's not a default). This why solution with flush was discussed.

@hyperthunk
Copy link
Member

@qnikst can you point me to the discussion about flush please? I'm going to look into this after the v0.8 release is done.

@qnikst
Copy link
Contributor Author

qnikst commented Nov 21, 2018

Here is a summary #206 (comment) discussion itself had happened few years ago during ZuriHac in person.

@hyperthunk
Copy link
Member

@qnikst I saw that yeah, question is what are the semantics for deciding when to call flush? I wouldn't have thought implementing flush is too onerous, but knowing when to empty your buffers is another matter entirely...

@hyperthunk
Copy link
Member

Also, does this issue persist with the example repository/project you wrote, @teh, if you use a different network-transport library? For example the network-transport-cci library, or network-transport-zeromq? Those libraries may be somewhat better behaved since the stacks they're running on top of are doing a lot more work to manage communication between endpoints...

@qnikst
Copy link
Contributor Author

qnikst commented Nov 21, 2018

@hyperthunk for the flush the idea was to introduce a new method for connection and allow a user to flush explicitly. Wrt tcp-nodelay it's up to the user whether to enable it or not. I'm not sure that this issue is a blocker for a release, as currently, it's possible to set nodelay, and adding a new method for the connection will be an API breaker and require more thought and testing.

I'm not sure that I see good places for implicit flush in the project.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants