Skip to content

alexiglad/ncsa_tutorial

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Tutorial (please watch below linked videos!!!)

Helpful Commands, Tips, and Tricks

  • Docs and main allocations:
    • Delta (A100s, A40s, etc)
    • DeltaAI (GH200's which are better than H100's)
  • helpful commands and info PLEASE READ THROUGH THEY ARE SUPER HELPFUL!!!
  • squeue -u $USER --start
    • show when job starts
  • sinfo -s and the more details sinfo yields how many nodes are available/in use
    • working on making bash script work better to estimate available A100's
    • A/I/O/T stand for allocated, idle, other, and total
  • support request link
  • increase timeout for vscode to ensure can duo in time
  • to see partitions/accounts can use
    • use accounts command
  • to launch slurm script but still use conda env:
    • When using your own custom conda environment with a batch job, submit the batch job from within the environment and do not add conda activate commands to the job script; the job inherits your environment.
    • THIS IS VERY IMPORTANT
  • if using a jupyterlab session or vscode session do:
    • unset SLURM_NTASKS
    • prevents issues with this error
  • for creating singularity container
    • may need to mount temp requirements to tmp/requirements.txt
    • may need to remove nvidia/triton from requirements
  • cannot use mem-per-gpu, use mem-per-cpu instead
    • need integer value
  • can make reservations for lots of resources/different amounts of time if submit support ticket :)
  • use quota and not df -h to get accurate file storage left
  • sacct can show the compute allocated for a job
    • sacct -j job_id --format=JobID,JobName,AllocCPUS,AllocTRES,ReqMem,MaxRSS,State
  • scontrol show config can show the controlled resources allowable to req
    • or similarly scontrol show job job_id can show important info about job, such as why crashed
  • to do srun ijob in terminal (fill out xxxx with your allocation):
    • for GH200
      • srun --account=xxxx-dtai-gh --partition=ghx4 --time=48:00:00 --mem-bind=verbose,local --gpu-bind=verbose,closest --nodes=1 --mem-per-cpu=1G --cpus-per-gpu=72 --gpus-per-node=4 --pty /bin/bash
      • srun --account=xxxx-dtai-gh --partition=ghx4-interactive --time=1:00:00 --mem-bind=verbose,local --gpu-bind=verbose,closest --nodes=1 --mem-per-cpu=1G --cpus-per-gpu=72 --gpus-per-node=1 --pty /bin/bash
        • interactive
    • for A100 NCSA
      • srun --account=xxxx-delta-gpu --partition=gpuA100x4 --time=48:00:00 --mem-bind=verbose,local --gpu-bind=verbose,closest --nodes=1 --mem-per-cpu=3G --cpus-per-gpu=16 --gpus-per-node=4 --pty /bin/bash
      • srun --account=xxxx-delta-gpu --partition=gpuA100x4-interactive --time=1:00:00 --mem-bind=verbose,local --gpu-bind=verbose,closest --nodes=1 --mem-per-cpu=3G --cpus-per-gpu=16 --gpus-per-node=1 --pty /bin/bash
        • interactive
  • gh200 env setup
    • see delta ai docs for pytorch version to install (nightly)
    • see this for triton install (courtesy of revanth) https://drive.google.com/file/d/162mESS9BOXDxWLzj--xRU_D2u1YsEtyc/view
      • if doesnt work the first time use setup.py to clean and retry
    • use loose_requirements
    • current xformers not compatible with installing triton, so cant do triton
    • for xformers (install after triton and pytorch and other packages)
      • export TORCH_CUDA_ARCH_LIST="9.0"
      • pip install xformers==0.0.27.post2 --no-deps --no-cache-dir
  • need to use screen to run this to prevent dc
    • screen -S session_name
    • screen -ls
    • echo $STY
    • Ctrl-a d
    • screen -r session_name
    • screen -S session_name -X quit
    • screen -d session_name
      • for when screen says attached but cant access
    • ctrl-a esc
  • slurm priority things
    • scontrol show job <job_id> | grep Priority
      • shows priority of specific job and its ID
    • sprio
      • then ctrl f in terminal, can show job priority relative to others
    • sshare -l
      • can ctrl f and see the fairshare and levelFS
  • grad students can only request discover or lower
    • profs can req accelerate
  • for this error
    • FATAL: While checking container encryption: could not open image /work/hdd/bcsi/agladstone/containers/pytorch_2.4.sif: the image's architecture (amd64) could not run on the host's (arm64)
      • need to swith dir and then rerun command...

ZSH

  • this caused issue with both blender and NCSA
    • if want to use just manually open zsh terminal. otherwise use bash

Conda

About

Tutorial for UIUC NCSA Cluster

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages