Skip to content

Make system UUID determination atomic #317

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Dec 15, 2021
Merged

Conversation

jpsamaroo
Copy link
Member

@mattwigway please review if you get the chance

Fixes #316

@jpsamaroo jpsamaroo added the bug label Dec 14, 2021
@jpsamaroo jpsamaroo requested review from vchuravy and krynju December 14, 2021 15:09
@mattwigway
Copy link
Contributor

I think this should work, as mv on a single file system should in general be atomic. I don't think there would be any issue with multiple Dagger jobs sharing the same UUID per machine.

You mentioned in #316 that IP addresses can be the same between workers - but isn't the IP how Distributed knows how to connect to each worker, so shouldn't they be distinct? If we want to make sure things are unique per-machine, we could also use the MAC address of the first hardware network interface—although that may not work with Infiniband-fabric clusters like we have here at UNC, because the Infiniband hardware addresses are 20 bytes but are truncated when reading with ifconfig

@jpsamaroo
Copy link
Member Author

I don't think there would be any issue with multiple Dagger jobs sharing the same UUID per machine.

I'd even consider that a feature, since in the future we might want to be able to bind disparate Julia clusters together, so having the UUID would let us find matching nodes across clusters.

we could also use the MAC address of the first hardware network interface—although that may not work with Infiniband-fabric clusters like we have here at UNC, because the Infiniband hardware addresses are 20 bytes but are truncated when reading with ifconfig

MAC address would be a solid approach. Do we have an easy way to access that via libuv? If not, we might need to settle on IP addresses (although the approach I have here will generally work fine, since you don't typically share /tmp drives between networked machines).

Also, I don't want to be in the business of parsing script outputs, so using a library is the only reasonable way forward.

@DrChainsaw
Copy link
Contributor

but isn't the IP how Distributed knows how to connect to each worker, so shouldn't they be distinct?

I might be wrong here, but I think that port number is used as well so that the IP:portnr tuple must be distinct. One case when IPs are not distinct are clusters where a single machine can have multiple workers. Iirc I even had to replace the default port randomization as the birthday problem makes it so that large machines have a too large chance to fail due to port collisions.

@mattwigway
Copy link
Contributor

mattwigway commented Dec 14, 2021 via email

@DrChainsaw
Copy link
Contributor

Isn't the point of this field to have a value that's unique to the physical machine, not the worker process?

Sorry, I might have misunderstood what this is about. It seemed to me that the issue was that multiple workers where trying to use the same resource and that having them use different resources would be desirable. It then seemed to me that using IP to give them different resources would not help in this case. Please ignore me if what I wrote makes no sense :)

@vchuravy
Copy link
Member

Yeah we need a UUID independent of IP since we might have many NICs, and the goal is to identify the machine so that you could figure out which processes are running on the same machine as you do and accelerate communication.

I am still not sure that Dagger needs to worry about this and instead we should use a network library like UCX that will do this for us.

@mattwigway
Copy link
Contributor

Good point, we'd have that problem with MACs as well. The approach used in this PR is probably fine for now.

@jpsamaroo jpsamaroo merged commit a26e261 into master Dec 15, 2021
@jpsamaroo jpsamaroo deleted the jps/system-uuid-atomic branch December 15, 2021 14:48
@jpsamaroo
Copy link
Member Author

Thanks all for the reviews!

I am still not sure that Dagger needs to worry about this and instead we should use a network library like UCX that will do this for us.

Even if UCX can do optimized transfers via IPC, the scheduler still wants to know when those transfer types will be used, so that (in the future, at least) it can guesstimate their costs during scheduling.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Race in system_uuid_fallback
4 participants