Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dp 110 fd #199

Closed
wants to merge 1 commit into from
Closed

Dp 110 fd #199

wants to merge 1 commit into from

Conversation

qnikst
Copy link
Contributor

@qnikst qnikst commented Jun 16, 2015

No description provided.

@qnikst
Copy link
Contributor Author

qnikst commented Jun 16, 2015

Please rebase on the top of master

@mboes
Copy link
Contributor

mboes commented Jun 17, 2015

I'd like to see a proper performance analysis of whether the thread pool improves or impedes performance in any significant way before this goes in.

@qnikst
Copy link
Contributor Author

qnikst commented Jun 17, 2015

This issue is not about performance, this is about non-blocking operations, and threadpool here guarantees ordering of operations. So I'm not sure what we can measure here..

@facundominguez
Copy link
Contributor

We could measure a node with 2 processes each sending many messages to a process in one of two other nodes. Isolate one of the target nodes and observe how fast the messages arrive at the other.

This patch should make it orders of magnitude faster.

@qnikst qnikst added this to the distributed-process-0.6 milestone Jun 18, 2015
@qnikst
Copy link
Contributor Author

qnikst commented Jun 18, 2015

Related to #204

@qnikst
Copy link
Contributor Author

qnikst commented Jun 18, 2015

@mboes I've implemented a benchmark in a setting suggested by @facundominguez

main program that pings 2 nodes (one is hardcoded), and reply service (code below).
I've started them in 2 different containers communication was disabled by docker rules,
the difference was in oder of magnitude:

30s w/o patch
and 3s with patch.

not sure if I need to include this benchmark as it require many manual steps to run it.

UPD. with -threaded I don't observe this difference. Possibly better benchmarks may needed.

import System.Environment
import Control.Monad
import Control.Applicative
import Control.Distributed.Process
import Control.Distributed.Process.Node
import Control.Concurrent
import Network.Transport.TCP (createTransport, defaultTCPParameters)
import Data.Binary (encode, decode)
import qualified Data.ByteString.Lazy as BSL
import qualified Data.ByteString.Char8 as BS
import Network.Transport (closeTransport, EndPointAddress(..))
import Network.Transport.TCP
import Control.Distributed.Process
import Control.Distributed.Process.Internal.Types

num :: Int
num = 10000

main = do
  [h1,h2,h3] <- getArgs
  [Right t1,Right t3] <- mapM (\h -> createTransport h "0" defaultTCPParameters) [h1,h3]
  [n1,n3] <- mapM (\t -> newLocalNode t initRemoteTable)
                     [t1,t3]

  let p1 = ProcessId (NodeId $ EndPointAddress $ BS.pack "172.17.42.1:9999:0") (LocalProcessId 0 10)
  print p1

  addr3 <- newEmptyMVar
  forkIO $ runProcess n3 $ getSelfPid >>= (liftIO . putMVar addr3) >> forever (expect >>= flip send ())

  forkIO $ runProcess n1 $ do
    x <- getSelfPid
    forever $ usend p1 x
  runProcess n1 $ do
    p <- liftIO $ takeMVar addr3
    x <- getSelfPid
    replicateM_ num $ do
      send p x
      expect :: Process ()
import System.Environment
import Control.Monad
import Control.Applicative
import Control.Distributed.Process
import Control.Distributed.Process.Node
import Control.Concurrent
import Network.Transport.TCP (createTransport, defaultTCPParameters)
import Data.Binary (encode, decode)
import qualified Data.ByteString.Lazy as BSL
import qualified Data.ByteString.Char8 as BS
import Network.Transport (closeTransport, EndPointAddress(..))
import Network.Transport.TCP
import Control.Distributed.Process
import Control.Distributed.Process.Internal.Types

main = do
  [h] <- getArgs
  Right t <- createTransport h "9999" defaultTCPParameters
  n <- newLocalNode t initRemoteTable
  runProcess n $ do
    say . show =<< getSelfPid
    forever $ do
      p <- expect
      say $ show p
      send p ()

@qnikst
Copy link
Contributor Author

qnikst commented Jun 18, 2015

Another update without patch send between local processes never (I couldn't wait so much) happen,
with patch everything work reliably

import Control.Monad
import Control.Distributed.Process
import Control.Distributed.Process.Node
import Control.Concurrent
import Network.Transport.TCP (createTransport, defaultTCPParameters)
import Network.Transport (EndPointAddress(..))
import qualified Data.ByteString.Char8 as BS

num :: Int
num = 10000

main :: IO ()
main = do
  Right t <- createTransport "127.0.0.1" "0" defaultTCPParameters
  n <- newLocalNode t initRemoteTable

  _ <- forkIO $ runProcess n $ do
    forever $ do
      mref <- monitorNode $ NodeId $ EndPointAddress (BS.pack "198.164.0.4:8923:0")
      unmonitor mref

  runProcess n $ do
    self <- getSelfPid
    localPid <- spawnLocal $ forever $ send self () >> (expect :: Process ())
    replicateM_ num $ do
      () <- expect
      say "."
      send localPid ()

@facundominguez
Copy link
Contributor

The last benchmark looks like what I was expecting.

@mboes we had to change the benchmark because in my initial proposal (and first @qnikst attempt) the NC was not involved.

A pool of threads is used for network operations. One thread is used per
NodeId.
@qnikst
Copy link
Contributor Author

qnikst commented Jun 22, 2015

PTAL, I have rebased patch on the top of the current master, so it could be merged automatically.

@mboes
Copy link
Contributor

mboes commented Jun 22, 2015

This issue is not about performance, this is about non-blocking operations, and threadpool here guarantees ordering of operations. So I'm not sure what we can measure here..

Except that the introduction of yet another layer of synchronized communication between threads is something you have to pay for each send, not just the problematic ones when there are network problems. The "benchmarks" you wrote earlier after @facundominguez proposed them are not, in my book, benchmarks (case in point, you don't have meaningful relative performance comparisons) - they are tests exhibiting the problem that this patch proposes one way to fix.

A benchmark here would demonstrate that communication latencies are not adversely affected in a high speed network, even using say n-t-inmemory, under a variety of stressful scenarios (many processes each sending 1 message simultaneously to the same target, processes sending to multiple target nodes simultaneously, one process sending many messages etc).

In any case we've discussed this patch in the past and I'm not convinced that this is the right approach. Before investing significant resources into this and going full speed ahead, can we get @dcoutts and @edsko's feedback? Guys, the problem is this: current implementations of the network-transport API are not fully asynchronous, nor does the API documentation mandate that. Therefore @facundominguez is proposing to have distributed-process maintain a per endpoint threads to manage the send queue, so that the node controller is never affected by the lack of asynchronicity. But one could also argue that the problem here is that network-transport backends should be fully asynchronous when connecting and sending messages, so that upstream libraries like distributed-process don't have to paper over things in this way, and effectively introduce yet another (potentially unbounded) buffer in the communication path. What do you think?

@mboes mboes added the discuss label Jun 22, 2015
@qnikst
Copy link
Contributor Author

qnikst commented Jun 22, 2015

@mboes we generic microbenchmarks in d-p so we could run that code and check difference, in my first benchmark I didn't have any measurable difference in general case. So if running benchmarks worth it before other discussions I can gather data.

@edsko
Copy link
Member

edsko commented Jun 22, 2015

I haven't looked at the detail yet, but I'm not very keen on introducing an unbounded buffer into the pipeline. We tried hard to avoid unbounded buffers.

@mboes
Copy link
Contributor

mboes commented Oct 19, 2015

@edsko @facundominguez ok then, based on @edsko's latest feedback, do we agree that this issue is best solved at the network-transport level, and this PR should be closed?

@facundominguez
Copy link
Contributor

We agree that we should solve this at the network-transport level.
The alternatives are either to merge this in the meantime or to live with d-p as is. Either way is fine with me.

@facundominguez
Copy link
Contributor

Either way is fine with me.

After a second thought, I think merging in the meantime might be more useful to users of CH than keeping the code as is.

@mboes
Copy link
Contributor

mboes commented Oct 26, 2015

This is a fair amount of imprecisely quantified technical debt we'd be committing to master. I say imprecisely unquantified because the benchmarks discussion above is still unresolved. Could documenting this infelicity in the design of d-p/n-t to the user instead be a good enough stopgap? There are ways for the user to mitigate (not remove yes, but mitigate...) this problem, in particular by setting aggressive timeouts in n-t-tcp in the meantime.

@dcoutts
Copy link
Member

dcoutts commented Feb 29, 2016

Yes, without having gone into the details too much it does sound like something for the NT level API. We do not want to adversely affect the fast path of sending because of occasional expensive slow operations like establishing new heavyweight connections.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants