Clockwork was developed on machines with the following configuration:
- 2x10Gbit network links (bonded to 20Gbit)
- 32 cores
- 768GB RAM
- 2x 32GB nVidia Tesla v100 GPUs per Worker
This page describes what you can expect and what you will need to do if you run Clockwork in a different environment.
Clockwork's worker requires 3+9*{gpu_count}
CPU cores. So for a node with 1 GPU, a worker requires 12 cores.
Clockwork's controller requires approximately 18 cores.
The above are hard limits.
Clockwork's worker requires increasing RAM depending on the number of models loaded.
For example, using approx. 4000 models required around 500GB RAM.
Depending on the workload you wish to run, you will need to ensure you have enough RAM. The amount of RAM needed by a model is approximately equal to the size of its weights.
Alternatively, for most workloads, you can scale down the number of models used, to fit the RAM you have available.
- Run
./client -h
to read workload descriptions; most workloads are parameterized and can be adjusted to use fewer models - The experiments in the
clockwork-results
describe how you can adjust the workloads to use fewer models and less RAM.
Clockwork can work on any GPU architecture. However, the models in clockwork-modelzoo-volta
are specifically compiled for sm_70
(Tesla) architectures.
Providing instructions for compiling models for other architectures is TODO.
Clockwork was developed and run on machines with 32GB Tesla v100 GPUs. However, many cloud providers commonly offer 16GB Tesla v100 GPUs.
Clockwork can run with lower GPU memory; however, it requires configuration changes:
- Edit
config/default.cfg
on worker machines and setweights_cache_size
to 10737418240L. - The maximum number of models will be approximately 2000. Some of the experiments in
clockwork-results
will not be able to run with a full complement of models.
- A
CUDA: out of memory
fromsrc/clockwork/cache.cpp
indicates thatweights_cache_size
is too large for your GPU. - A
CUDA_ERROR_OUT_OF_MEMORY
fromsrc/clockwork/model/cuda.cpp
indicates too many models loaded. You can either reduce the number of models you use, or reduce the value ofweights_cache_size
(which by proxy increases memory available for kernels).
Clockwork experiments were performed in a private cluster. We have not extensively experimented with Clockwork in a cloud setting, and issues may arise due to network contention that we are unfamiliar with.
A network less than 10Gbit is not recommended and Clockwork may not work properly. Clockwork's controller only expects a small amount of latency for each action on the network. If your network is too slow, all actions will fail. If you see throughput=0
, and if your worker nodes are only reporting errors, then network latency is an issue. (Note: this is solvable by removing some hard-coding in the controller. Please consider contributing!)
Do not fear! Clockwork experiments can still run without good network support, so long as the inference results are not important (ie, if you only care about measuring performance). The following configuration will have inputs generated by workers, and not send by clients or controller:
- On the client machine,
export CLOCKWORK_DISABLE_INPUTS=1
- On the controller, pass
0
for thegenerate_inputs
argument. ie,./controller INFER5 volta02:12345 0
You should see the following when you run the controller:
Network Status: Client ✘✘✘✘✘ Controller ✘✘✘✘✘ Workers (inputs generated by worker)
Alternatively, with CLOCKWORK_DISABLE_INPUTS=
AND ./controller INFER5 volta02:12345 1
you will see:
Network Status: Client ✘✘✘✘✘ Controller ✔✔✔✔✔ Workers (inputs generated by controller)
With CLOCKWORK_DISABLE_INPUTS=0
and any value for generate_inputs
, you will see:
Network Status: Client ✔✔✔✔✔ Controller ✔✔✔✔✔ Workers (inputs generated by client)