Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add the "auto-detect the current allocation" feature #234

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions Project.toml
Original file line number Diff line number Diff line change
Expand Up @@ -4,14 +4,18 @@ version = "1.1.0"

[deps]
Distributed = "8ba89e20-285c-5b6f-9357-94700520ee1b"
LSFClusterManager = "af02cf76-cbe3-4eeb-96a8-af9391005858"
Logging = "56ddb016-857b-54e1-b83d-db4d58db5568"
Pkg = "44cfe95a-1eb2-52ea-b672-e2afdf69b78f"
SlurmClusterManager = "c82cd089-7bf7-41d7-976b-6b5d413cbe0a"
Sockets = "6462fe0b-24de-5631-8697-dd941f90decc"

[compat]
Distributed = "< 0.0.1, 1"
LSFClusterManager = "1.0.0"
Logging = "< 0.0.1, 1"
Pkg = "< 0.0.1, 1"
SlurmClusterManager = "0.1.3"
Sockets = "< 0.0.1, 1"
julia = "1.2"

Expand Down
8 changes: 8 additions & 0 deletions src/ClusterManagers.jl
Original file line number Diff line number Diff line change
Expand Up @@ -4,16 +4,24 @@ using Distributed
using Sockets
using Pkg

import LSFClusterManager
import SlurmClusterManager

export launch, manage, kill, init_worker, connect
import Distributed: launch, manage, kill, init_worker, connect

# Bring some other names into scope, just for convenience:
using Distributed: addprocs

worker_cookie() = begin Distributed.init_multi(); cluster_cookie() end
worker_arg() = `--worker=$(worker_cookie())`


# PBS doesn't have the same semantics as SGE wrt to file accumulate,
# a different solution will have to be found
include("qsub.jl")

include("auto_detect.jl")
include("scyld.jl")
include("condor.jl")
include("slurm.jl")
Expand Down
143 changes: 143 additions & 0 deletions src/auto_detect.jl
Original file line number Diff line number Diff line change
@@ -0,0 +1,143 @@
function addprocs_autodetect_current_scheduler(; kwargs...)
sched = autodetect_current_scheduler()

Check warning on line 2 in src/auto_detect.jl

View check run for this annotation

Codecov / codecov/patch

src/auto_detect.jl#L1-L2

Added lines #L1 - L2 were not covered by tests

if sched == :slurm
res = Distributed.addprocs(SlurmClusterManager.SlurmManager(); kwargs...)

Check warning on line 5 in src/auto_detect.jl

View check run for this annotation

Codecov / codecov/patch

src/auto_detect.jl#L4-L5

Added lines #L4 - L5 were not covered by tests

elseif sched == :lsf
np = _lsf_get_numtasks()
res = LSFClusterManager.addprocs_lsf(np; kwargs...)

Check warning on line 9 in src/auto_detect.jl

View check run for this annotation

Codecov / codecov/patch

src/auto_detect.jl#L7-L9

Added lines #L7 - L9 were not covered by tests

# elseif sched == :sge
# # SGE is not currently maintained.
# np = _sge_get_number_of_tasks()
# res = addprocs_sge(np; kwargs...)

# elseif sched == :pbs
# # PBS is not currently maintained.
# np = _torque_get_numtasks()
# res = addprocs_pbs(np; kwargs...)

else
error("Unable to auto-detect cluster scheduler: $(sched)")

Check warning on line 22 in src/auto_detect.jl

View check run for this annotation

Codecov / codecov/patch

src/auto_detect.jl#L22

Added line #L22 was not covered by tests
end

return res

Check warning on line 25 in src/auto_detect.jl

View check run for this annotation

Codecov / codecov/patch

src/auto_detect.jl#L25

Added line #L25 was not covered by tests
end

function autodetect_current_scheduler()
if _autodetect_is_slurm()
return :slurm
elseif _autodetect_is_lsf()
return :lsf
elseif _autodetect_is_sge()
return :sge
elseif _autodetect_is_pbs()
return :pbs

Check warning on line 36 in src/auto_detect.jl

View check run for this annotation

Codecov / codecov/patch

src/auto_detect.jl#L28-L36

Added lines #L28 - L36 were not covered by tests
end
return nothing

Check warning on line 38 in src/auto_detect.jl

View check run for this annotation

Codecov / codecov/patch

src/auto_detect.jl#L38

Added line #L38 was not covered by tests
end

##### Slurm:

function _autodetect_is_slurm()
has_SLURM_JOB_ID = _has_env_nonempty("SLURM_JOB_ID")
has_SLURM_JOBID = _has_env_nonempty("SLURM_JOBID")
res = has_SLURM_JOB_ID || has_SLURM_JOBID
return res

Check warning on line 47 in src/auto_detect.jl

View check run for this annotation

Codecov / codecov/patch

src/auto_detect.jl#L43-L47

Added lines #L43 - L47 were not covered by tests
end

##### LSF:

function _autodetect_is_lsf()

Check warning on line 52 in src/auto_detect.jl

View check run for this annotation

Codecov / codecov/patch

src/auto_detect.jl#L52

Added line #L52 was not covered by tests
# https://www.ibm.com/docs/en/spectrum-lsf/10.1.0?topic=variables-environment-set-job-execution
has_LSB_JOBNAME = _has_env_nonempty("LSB_JOBNAME")
return has_LSB_JOBNAME

Check warning on line 55 in src/auto_detect.jl

View check run for this annotation

Codecov / codecov/patch

src/auto_detect.jl#L54-L55

Added lines #L54 - L55 were not covered by tests
end

function _lsf_get_numtasks()

Check warning on line 58 in src/auto_detect.jl

View check run for this annotation

Codecov / codecov/patch

src/auto_detect.jl#L58

Added line #L58 was not covered by tests
# https://www.ibm.com/docs/en/spectrum-lsf/10.1.0?topic=variables-environment-variable-reference
#
# See also:
# https://portal.supercomputing.wales/index.php/index/slurm/lsf-to-slurm-ref/
name = "LSB_DJOB_NUMPROC"
value_str = strip(ENV[name])
value_int = _getenv_parse_int(name)
return value_int

Check warning on line 66 in src/auto_detect.jl

View check run for this annotation

Codecov / codecov/patch

src/auto_detect.jl#L63-L66

Added lines #L63 - L66 were not covered by tests
end

##### SGE (Sun Grid Engine):

function _autodetect_is_sge()

Check warning on line 71 in src/auto_detect.jl

View check run for this annotation

Codecov / codecov/patch

src/auto_detect.jl#L71

Added line #L71 was not covered by tests
# https://docs.oracle.com/cd/E19957-01/820-0699/chp4-21/index.html
has_SGE_O_HOST = _has_env_nonempty("SGE_O_HOST")
return has_SGE_O_HOST

Check warning on line 74 in src/auto_detect.jl

View check run for this annotation

Codecov / codecov/patch

src/auto_detect.jl#L73-L74

Added lines #L73 - L74 were not covered by tests

# Important note:
# The "job ID" environment variable in SGE is just named `JOB_ID`.
# This is obviously too vague, because the variable name is not specific to SGE.
# Therefore, we can't use that variable for our SGE auto-detection.
end

function _sge_get_numtasks()
msg = "Because this is Sun Grid Engine (SGE), ClusterManagers.jl is not able " *

Check warning on line 83 in src/auto_detect.jl

View check run for this annotation

Codecov / codecov/patch

src/auto_detect.jl#L82-L83

Added lines #L82 - L83 were not covered by tests
"to correctly auto-detect the number of tasks. " *
"Therefore, ClusterManagers.jl will instead use the value of the " *
"NHOSTS environment variable: $(np)"
@warn msg

Check warning on line 87 in src/auto_detect.jl

View check run for this annotation

Codecov / codecov/patch

src/auto_detect.jl#L87

Added line #L87 was not covered by tests

# https://docs.oracle.com/cd/E19957-01/820-0699/chp4-21/index.html
name = "NHOSTS"
value_int = _getenv_parse_int(name)
return value_int

Check warning on line 92 in src/auto_detect.jl

View check run for this annotation

Codecov / codecov/patch

src/auto_detect.jl#L90-L92

Added lines #L90 - L92 were not covered by tests
end

##### PBS and Torque:

function _autodetect_is_pbs()

Check warning on line 97 in src/auto_detect.jl

View check run for this annotation

Codecov / codecov/patch

src/auto_detect.jl#L97

Added line #L97 was not covered by tests
# https://docs.adaptivecomputing.com/torque/2-5-12/help.htm#topics/2-jobs/exportedBatchEnvVar.htm
has_PBS_JOBID = _has_env_nonempty("PBS_JOBID")
return has_PBS_JOBID

Check warning on line 100 in src/auto_detect.jl

View check run for this annotation

Codecov / codecov/patch

src/auto_detect.jl#L99-L100

Added lines #L99 - L100 were not covered by tests
end

function _torque_get_numtasks()

Check warning on line 103 in src/auto_detect.jl

View check run for this annotation

Codecov / codecov/patch

src/auto_detect.jl#L103

Added line #L103 was not covered by tests
# https://docs.adaptivecomputing.com/torque/2-5-12/help.htm#topics/2-jobs/exportedBatchEnvVar.htm
name = "PBS_TASKNUM"
value_int = _getenv_parse_int(name)
return value_int

Check warning on line 107 in src/auto_detect.jl

View check run for this annotation

Codecov / codecov/patch

src/auto_detect.jl#L105-L107

Added lines #L105 - L107 were not covered by tests

@info "Using auto-detected num_tasks: $(np)"

Check warning on line 109 in src/auto_detect.jl

View check run for this annotation

Codecov / codecov/patch

src/auto_detect.jl#L109

Added line #L109 was not covered by tests
end

##### General utility functions:

function _has_env_nonempty(name::AbstractString)
stripped_value = strip(get(ENV, name, ""))
res_b = !isempty(stripped_value)
return res_b

Check warning on line 117 in src/auto_detect.jl

View check run for this annotation

Codecov / codecov/patch

src/auto_detect.jl#L114-L117

Added lines #L114 - L117 were not covered by tests
end

function _getenv_parse_int(name::AbstractString)
if !haskey(ENV, name)
msg = "Environment variable is not defined: $(name)"
error(msg)

Check warning on line 123 in src/auto_detect.jl

View check run for this annotation

Codecov / codecov/patch

src/auto_detect.jl#L120-L123

Added lines #L120 - L123 were not covered by tests
end
original_value = ENV[name]
if isempty(original_value)
msg = "Environment variable is defined, but is empty: $(name)"
error(msg)

Check warning on line 128 in src/auto_detect.jl

View check run for this annotation

Codecov / codecov/patch

src/auto_detect.jl#L125-L128

Added lines #L125 - L128 were not covered by tests
end
stripped_value_str = strip(original_value)
if isempty(stripped_value)
msg = "Environment variable is defined, but contains only whitespace: $(name)"
error(msg)

Check warning on line 133 in src/auto_detect.jl

View check run for this annotation

Codecov / codecov/patch

src/auto_detect.jl#L130-L133

Added lines #L130 - L133 were not covered by tests
end
value_int = tryparse(Int, stripped_value_str)
if !(value_int isa Int)
msg =

Check warning on line 137 in src/auto_detect.jl

View check run for this annotation

Codecov / codecov/patch

src/auto_detect.jl#L135-L137

Added lines #L135 - L137 were not covered by tests
"Environment variable \"$(name)\" is defined, " *
"but its value \"$(stripped_value_str)\" could not be parsed as an integer."
error(msg)

Check warning on line 140 in src/auto_detect.jl

View check run for this annotation

Codecov / codecov/patch

src/auto_detect.jl#L140

Added line #L140 was not covered by tests
end
return value_int

Check warning on line 142 in src/auto_detect.jl

View check run for this annotation

Codecov / codecov/patch

src/auto_detect.jl#L142

Added line #L142 was not covered by tests
end
Loading