Skip to content

Race in system_uuid_fallback #316

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
mattwigway opened this issue Dec 14, 2021 · 1 comment · Fixed by #317
Closed

Race in system_uuid_fallback #316

mattwigway opened this issue Dec 14, 2021 · 1 comment · Fixed by #317
Assignees
Labels

Comments

@mattwigway
Copy link
Contributor

I just got this stacktrace while initializing Dagger on a large cluster:

ERROR: LoadError: On worker 3:
InitError: ArgumentError: Malformed UUID string: ""
Stacktrace:
  [1] throw_malformed_uuid
    @ ./uuid.jl:80
  [2] parse
    @ ./uuid.jl:83 [inlined]
  [3] #154
    @ ~/.julia/packages/Dagger/xtTHY/src/utils/system_uuid.jl:13
  [4] get!
    @ ./dict.jl:464
  [5] system_uuid_fallback
    @ ~/.julia/packages/Dagger/xtTHY/src/utils/system_uuid.jl:4 [inlined]
  [6] system_uuid
    @ ~/.julia/packages/Dagger/xtTHY/src/utils/system_uuid.jl:20 [inlined]
  [7] __init__
    @ ~/.julia/packages/Dagger/xtTHY/src/Dagger.jl:69
  [8] _include_from_serialized
    @ ./loading.jl:768
  [9] _require_search_from_serialized
    @ ./loading.jl:854
 [10] _tryrequire_from_serialized
    @ ./loading.jl:783
 [11] _require_search_from_serialized
    @ ./loading.jl:843
 [12] _require
    @ ./loading.jl:1097
 [13] require
    @ ./loading.jl:1013
 [14] #1
    @ ~/dcm/julia-1.7.0/share/julia/stdlib/v1.7/Distributed/src/Distributed.jl:79
 [15] #103
    @ ~/dcm/julia-1.7.0/share/julia/stdlib/v1.7/Distributed/src/process_messages.jl:274
 [16] run_work_thunk
    @ ~/dcm/julia-1.7.0/share/julia/stdlib/v1.7/Distributed/src/process_messages.jl:63
 [17] run_work_thunk
    @ ~/dcm/julia-1.7.0/share/julia/stdlib/v1.7/Distributed/src/process_messages.jl:72
 [18] #96
    @ ./task.jl:423
during initialization of module Dagger

I think the issue is here. If multiple workers are running on the same machine, they may try to write to dagger-system-uuid simultaneously. Locking won't help us here, because the race is across processes.

Is there a reason we need to use UUIDs here? Could we just use the IP address of the worker machine?

@jpsamaroo
Copy link
Member

Ahh, I guess we should instead write to a tempfile and move it into place non-forcefully to ensure we get an error if another process moved their file first. We should also change the name of the file to include the username, so that different users can have their own UUIDs.

Is there a reason we need to use UUIDs here? Could we just use the IP address of the worker machine?

Local IP addresses can be identical between different workers, IIUC. However, if you want to submit a patch to calculate the UUID from some system IP address, I'd accept this as an opt-in feature (enabled via environment variable).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants