Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Activating environment on remote workers on a cluster fails on v1.7 (works on 1.6) #42405

Closed
jishnub opened this issue Sep 28, 2021 · 10 comments
Closed
Labels
parallelism Parallel or distributed computation

Comments

@jishnub
Copy link
Member

jishnub commented Sep 28, 2021

This works on julia v1.6.3 but fails on v1.7.0-rc1 and nightly on a Slurm cluster (using ClusterManagers v0.4.2)

the main julia script named slurmtrial.jl:

using Distributed, ClusterManagers
addprocs_slurm(parse(Int, ENV["SLURM_NTASKS"]));
@everywhere begin
    using Pkg
    Pkg.activate(Base.dirname(Base.active_project()))
end
rmprocs.(workers())

The jobscript that I use to submit this (change the julia path and the output file names to run the same code on different julia versions):

#!/bin/bash
#SBATCH --time="10"
#SBATCH --job-name=test
#SBATCH -o test18.out
#SBATCH -e test18.err
#SBATCH --ntasks=56

cd $SCRATCH/jobs
julia18="$SCRATCH/julia/julia-82d8a36491/bin/julia"
julia17="$SCRATCH/julia/julia-1.7.0-rc1/bin/julia"
julia16="$SCRATCH/julia/julia-1.6.3/bin/julia"
$julia18 -e 'include("$(ENV["HOME"])/slurmtrial.jl")'

I am using 2 nodes with 28 cores each, so a total of 56 workers. The error sometimes doesn't happen if I only use a few cores on one node (eg. 2 cores).

output on v1.6 (this is what is expected):

$ cat test16.err
  Activating environment at `/scratch/username/.julia/environments/v1.6/Project.toml`

output on v1.7 & v1.8

$ cat test17.err                                                                                                                                  [14:12:10]
  Activating project at `/scratch/username/.julia/environments/v1.7`
ERROR: LoadError: On worker 2:
IOError: unlink("/scratch/username/.julia/logs/manifest_usage.toml"): no such file or directory (ENOENT)
Stacktrace:
  [1] uv_error
    @ ./libuv.jl:97 [inlined]
  [2] unlink
    @ ./file.jl:958
  [3] #rm#12
    @ ./file.jl:276
  [4] #checkfor_mv_cp_cptree#13
    @ ./file.jl:323
  [5] #mv#17
    @ ./file.jl:411 [inlined]
  [6] write_env_usage
    @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.7/Pkg/src/Types.jl:495
  [7] EnvCache
    @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.7/Pkg/src/Types.jl:337
  [8] EnvCache
    @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.7/Pkg/src/Types.jl:317 [inlined]
  [9] add_snapshot_to_undo
    @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.7/Pkg/src/API.jl:1627
 [10] add_snapshot_to_undo
    @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.7/Pkg/src/API.jl:1623 [inlined]
 [11] #activate#282
    @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.7/Pkg/src/API.jl:1589
 [12] activate
    @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.7/Pkg/src/API.jl:1552
 [13] top-level scope
    @ ~/slurmtrial.jl:5
 [14] eval
    @ ./boot.jl:373
 [15] #103
    @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.7/Distributed/src/process_messages.jl:274
 [16] run_work_thunk
    @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.7/Distributed/src/process_messages.jl:63
 [17] run_work_thunk
    @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.7/Distributed/src/process_messages.jl:72
 [18] #96
    @ ./task.jl:411

...and 39 more exceptions.

Stacktrace:
 [1] sync_end(c::Channel{Any})
   @ Base ./task.jl:369
 [2] macro expansion
   @ ./task.jl:388 [inlined]
 [3] remotecall_eval(m::Module, procs::Vector{Int64}, ex::Expr)
   @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.7/Distributed/src/macros.jl:223
 [4] top-level scope
   @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.7/Distributed/src/macros.jl:207
 [5] include(fname::String)
   @ Base.MainInclude ./client.jl:451
 [6] top-level scope
   @ none:1
in expression starting at /home/username/slurmtrial.jl:3

Note that the number of exceptions raised here is 40 and not 56 (this number is variable).

@KristofferC
Copy link
Member

Should be fixed on #42255.

@DilumAluthge
Copy link
Member

Is it not fixed on master?

@KristofferC
Copy link
Member

If Pkg is bumped.

@DilumAluthge
Copy link
Member

If Pkg is bumped.

#42407

@DilumAluthge
Copy link
Member

DilumAluthge commented Sep 28, 2021

If Pkg is bumped.

Are you sure that master does not already include the fix? According to #42407, Julia master is only behind Pkg master by one commit, and that commit doesn't seem to fix this issue?

Was the fix in question merged to the master branch of Pkg?

@DilumAluthge
Copy link
Member

The OP says that the error is happening on Julia nightly, which is why I'm concerned.

@KristofferC
Copy link
Member

KristofferC commented Sep 28, 2021

Oh yeah, that was only reverted for 1.6 / 1.7: JuliaLang/Pkg.jl#2731.

@IanButterworth
Copy link
Member

Potential fix with tests in JuliaLang/Pkg.jl#2732

@jishnub
Copy link
Member Author

jishnub commented Oct 27, 2021

I can confirm that this is now fixed on v1.7.0-rc2 as expected, but still exists on nightly

@IanButterworth
Copy link
Member

There's a proper robust fix in JuliaLang/Pkg.jl#2793 upcoming for 1.8.
It's unlikely to be backported as it relies on a new stdlib, but 1.6 and 1.7 only have the seemingly rare issue of usage toml files getting corrupted during process concurrency, so they should be fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parallelism Parallel or distributed computation
Projects
None yet
Development

No branches or pull requests

6 participants