Make system UUID determination atomic #317

jpsamaroo · 2021-12-14T15:09:17Z

@mattwigway please review if you get the chance

Fixes #316

mattwigway · 2021-12-14T17:07:03Z

I think this should work, as mv on a single file system should in general be atomic. I don't think there would be any issue with multiple Dagger jobs sharing the same UUID per machine.

You mentioned in #316 that IP addresses can be the same between workers - but isn't the IP how Distributed knows how to connect to each worker, so shouldn't they be distinct? If we want to make sure things are unique per-machine, we could also use the MAC address of the first hardware network interface—although that may not work with Infiniband-fabric clusters like we have here at UNC, because the Infiniband hardware addresses are 20 bytes but are truncated when reading with ifconfig

jpsamaroo · 2021-12-14T19:20:20Z

I don't think there would be any issue with multiple Dagger jobs sharing the same UUID per machine.

I'd even consider that a feature, since in the future we might want to be able to bind disparate Julia clusters together, so having the UUID would let us find matching nodes across clusters.

we could also use the MAC address of the first hardware network interface—although that may not work with Infiniband-fabric clusters like we have here at UNC, because the Infiniband hardware addresses are 20 bytes but are truncated when reading with ifconfig

MAC address would be a solid approach. Do we have an easy way to access that via libuv? If not, we might need to settle on IP addresses (although the approach I have here will generally work fine, since you don't typically share /tmp drives between networked machines).

Also, I don't want to be in the business of parsing script outputs, so using a library is the only reasonable way forward.

DrChainsaw · 2021-12-14T20:42:08Z

but isn't the IP how Distributed knows how to connect to each worker, so shouldn't they be distinct?

I might be wrong here, but I think that port number is used as well so that the IP:portnr tuple must be distinct. One case when IPs are not distinct are clusters where a single machine can have multiple workers. Iirc I even had to replace the default port randomization as the birthday problem makes it so that large machines have a too large chance to fail due to port collisions.

mattwigway · 2021-12-14T21:20:12Z

Isn't the point of this field to have a value that's unique to the physical machine, not the worker process?

…

On Tue, Dec 14, 2021 at 3:42 PM DrChainsaw ***@***.***> wrote: but isn't the IP how Distributed knows how to connect to each worker, so shouldn't they be distinct? I might be wrong here, but I think that port number is used as well so that the IP:portnr tuple must be distinct. One case when IPs are not distinct are clusters where a single machine can have multiple workers. Iirc I even had to replace the default port randomization as the birthday problem makes it so that large machines have a too large chance to fail due to port collisions. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#317 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAEKNLX3P53BXD2NU2N3JR3UQ6T2VANCNFSM5KBEQW4Q> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

DrChainsaw · 2021-12-14T21:31:42Z

Isn't the point of this field to have a value that's unique to the physical machine, not the worker process?

Sorry, I might have misunderstood what this is about. It seemed to me that the issue was that multiple workers where trying to use the same resource and that having them use different resources would be desirable. It then seemed to me that using IP to give them different resources would not help in this case. Please ignore me if what I wrote makes no sense :)

vchuravy · 2021-12-14T22:01:29Z

Yeah we need a UUID independent of IP since we might have many NICs, and the goal is to identify the machine so that you could figure out which processes are running on the same machine as you do and accelerate communication.

I am still not sure that Dagger needs to worry about this and instead we should use a network library like UCX that will do this for us.

mattwigway · 2021-12-14T22:02:39Z

Good point, we'd have that problem with MACs as well. The approach used in this PR is probably fine for now.

jpsamaroo · 2021-12-15T14:51:05Z

Thanks all for the reviews!

I am still not sure that Dagger needs to worry about this and instead we should use a network library like UCX that will do this for us.

Even if UCX can do optimized transfers via IPC, the scheduler still wants to know when those transfer types will be used, so that (in the future, at least) it can guesstimate their costs during scheduling.

Make system UUID determination atomic

aec9e96

jpsamaroo added the bug label Dec 14, 2021

jpsamaroo requested review from vchuravy and krynju December 14, 2021 15:09

vchuravy approved these changes Dec 14, 2021

View reviewed changes

jpsamaroo merged commit a26e261 into master Dec 15, 2021

jpsamaroo deleted the jps/system-uuid-atomic branch December 15, 2021 14:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make system UUID determination atomic #317

Make system UUID determination atomic #317

jpsamaroo commented Dec 14, 2021

mattwigway commented Dec 14, 2021

jpsamaroo commented Dec 14, 2021

DrChainsaw commented Dec 14, 2021

mattwigway commented Dec 14, 2021 via email

DrChainsaw commented Dec 14, 2021

vchuravy commented Dec 14, 2021

mattwigway commented Dec 14, 2021

jpsamaroo commented Dec 15, 2021

Make system UUID determination atomic #317

Make system UUID determination atomic #317

Conversation

jpsamaroo commented Dec 14, 2021

mattwigway commented Dec 14, 2021

jpsamaroo commented Dec 14, 2021

DrChainsaw commented Dec 14, 2021

mattwigway commented Dec 14, 2021 via email

DrChainsaw commented Dec 14, 2021

vchuravy commented Dec 14, 2021

mattwigway commented Dec 14, 2021

jpsamaroo commented Dec 15, 2021