You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
communication method: NCCL for intra-communication, GRPC for inter-communication
..gpipe.sh: baseline
..modulo.sh: OOO-Backprop
AWS common setup
For multi-node experiments, you must set same Security Group.
Security Group must allows all TCP ports within itself.
For EC2 instanse, Use Deep Learning AMI (Ubuntu 18.04) Version 56.0 image which already contains everything for experiments(NVIDIA driver, docker, git, etc...)
Pre-Training Experiments
Launch 8, 16, 24, or 32 AWS instances and edit the script file... Then run the script on the first node (NODE_HOST_LIST[0]).