Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated Dockerfiles #63

Closed
wants to merge 1 commit into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
48 changes: 25 additions & 23 deletions docker/jax/training/0.4/Dockerfile.neuronx
Original file line number Diff line number Diff line change
@@ -1,14 +1,16 @@
FROM public.ecr.aws/docker/library/ubuntu:22.04

LABEL dlc_major_version="1"

Check failure on line 3 in docker/jax/training/0.4/Dockerfile.neuronx

View workflow job for this annotation

GitHub Actions / dockerfile-linter

DL3048 style: Invalid label key.
LABEL maintainer="Amazon AI"

# Neuron SDK components version numbers
ARG NEURONX_RUNTIME_LIB_VERSION=2.23.112.0-9b5179492
ARG NEURONX_COLLECTIVES_LIB_VERSION=2.23.135.0-3e70920f2
ARG NEURONX_TOOLS_VERSION=2.20.204.0
ARG NEURONX_CC_VERSION=2.16.372.0
ARG NEURONX_JAX_TRAINING_VERSION=0.1.2
# Neuron SDK pre-release packages
ARG NEURON_ARTIFACT_PATH=/root/neuron_artifacts
ARG NEURONX_RUNTIME_LIB_VERSION
ARG NEURONX_COLLECTIVES_LIB_VERSION
ARG NEURONX_TOOLS_VERSION
ARG NEURONX_CC_VERSION
ARG NEURONX_JAX_TRAINING_VERSION
ARG NEURON_XLA

ARG PYTHON=python3.10
ARG PYTHON_VERSION=3.10.12
Expand All @@ -31,7 +33,7 @@
ENV LD_LIBRARY_PATH="${LD_LIBRARY_PATH}:/opt/amazon/openmpi/lib64"
ENV LD_LIBRARY_PATH="${LD_LIBRARY_PATH}:/usr/local/lib"

RUN apt-get update \

Check failure on line 36 in docker/jax/training/0.4/Dockerfile.neuronx

View workflow job for this annotation

GitHub Actions / dockerfile-linter

DL3008 warning: Pin versions in apt get install. Instead of `apt-get install <package>` use `apt-get install <package>=<version>`
&& apt-get upgrade -y \
&& apt-get install -y --no-install-recommends \
build-essential \
Expand Down Expand Up @@ -74,7 +76,7 @@
&& apt-get clean

# Install Open MPI
RUN mkdir -p /tmp/openmpi \

Check failure on line 79 in docker/jax/training/0.4/Dockerfile.neuronx

View workflow job for this annotation

GitHub Actions / dockerfile-linter

DL3003 warning: Use WORKDIR to switch to a directory

Check failure on line 79 in docker/jax/training/0.4/Dockerfile.neuronx

View workflow job for this annotation

GitHub Actions / dockerfile-linter

SC2046 warning: Quote this to prevent word splitting.
&& cd /tmp/openmpi \
&& wget --quiet https://download.open-mpi.org/release/open-mpi/v4.1/openmpi-${OMPI_VERSION}.tar.gz \
&& tar zxf openmpi-${OMPI_VERSION}.tar.gz \
Expand All @@ -86,7 +88,7 @@
&& rm -rf /tmp/openmpi

# Install packages and configure SSH for MPI operator in k8s
RUN apt-get update && apt-get install -y openmpi-bin openssh-server \

Check failure on line 91 in docker/jax/training/0.4/Dockerfile.neuronx

View workflow job for this annotation

GitHub Actions / dockerfile-linter

DL3008 warning: Pin versions in apt get install. Instead of `apt-get install <package>` use `apt-get install <package>=<version>`

Check failure on line 91 in docker/jax/training/0.4/Dockerfile.neuronx

View workflow job for this annotation

GitHub Actions / dockerfile-linter

DL3015 info: Avoid additional packages by specifying `--no-install-recommends`
&& mkdir -p /var/run/sshd \
&& echo " UserKnownHostsFile /dev/null" >> /etc/ssh/ssh_config \
&& echo " StrictHostKeyChecking no" >> /etc/ssh/ssh_config \
Expand All @@ -95,7 +97,7 @@
&& apt-get clean

# install Python
RUN wget -q https://www.python.org/ftp/python/$PYTHON_VERSION/Python-$PYTHON_VERSION.tgz \

Check failure on line 100 in docker/jax/training/0.4/Dockerfile.neuronx

View workflow job for this annotation

GitHub Actions / dockerfile-linter

SC2046 warning: Quote this to prevent word splitting.

Check failure on line 100 in docker/jax/training/0.4/Dockerfile.neuronx

View workflow job for this annotation

GitHub Actions / dockerfile-linter

DL3003 warning: Use WORKDIR to switch to a directory
&& tar -xzf Python-$PYTHON_VERSION.tgz \
&& cd Python-$PYTHON_VERSION \
&& ./configure --enable-shared --prefix=/usr/local \
Expand All @@ -114,33 +116,33 @@
# ompi_info to fail. This is only observed in CPU containers
ENV PATH="$PATH:/home/.openmpi/bin"
ENV LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/home/.openmpi/lib/"
RUN ompi_info --parsable --all | grep mpi_built_with_cuda_support:value

Check failure on line 119 in docker/jax/training/0.4/Dockerfile.neuronx

View workflow job for this annotation

GitHub Actions / dockerfile-linter

DL4006 warning: Set the SHELL option -o pipefail before RUN with a pipe in it. If you are using /bin/sh in an alpine image or if your shell is symlinked to busybox then consider explicitly setting your SHELL to /bin/ash, or disable this check

RUN mkdir -p /etc/pki/tls/certs && cp /etc/ssl/certs/ca-certificates.crt /etc/pki/tls/certs/ca-bundle.crt

# Install Neuron Driver, Runtime and Tools
RUN echo "deb https://apt.repos.neuron.amazonaws.com focal main" > /etc/apt/sources.list.d/neuron.list
RUN wget -qO - https://apt.repos.neuron.amazonaws.com/GPG-PUB-KEY-AMAZON-AWS-NEURON.PUB | apt-key add -
# Install AWS CLI
RUN ${PIP} install --no-cache-dir -U \
"awscli<2"

RUN apt-get update \
&& apt-get install -y \
aws-neuronx-tools=$NEURONX_TOOLS_VERSION \
aws-neuronx-collectives=$NEURONX_COLLECTIVES_LIB_VERSION \
aws-neuronx-runtime-lib=$NEURONX_RUNTIME_LIB_VERSION \
&& rm -rf /var/lib/apt/lists/* \
&& rm -rf /tmp/tmp* \
&& apt-get clean
# Copy Neuron artifacts into container for local installation
COPY pip ${NEURON_ARTIFACT_PATH}/pip/
COPY apt ${NEURON_ARTIFACT_PATH}/apt/

# Install Neuron Driver, Runtime and Tools
RUN apt-get install -y \

Check failure on line 132 in docker/jax/training/0.4/Dockerfile.neuronx

View workflow job for this annotation

GitHub Actions / dockerfile-linter

DL3015 info: Avoid additional packages by specifying `--no-install-recommends`
${NEURON_ARTIFACT_PATH}/apt/${NEURONX_TOOLS_VERSION} \
${NEURON_ARTIFACT_PATH}/apt/${NEURONX_COLLECTIVES_LIB_VERSION} \
${NEURON_ARTIFACT_PATH}/apt/${NEURONX_RUNTIME_LIB_VERSION}

# Add Neuron PATH
ENV PATH="/opt/aws/neuron/bin:${PATH}"

# Install AWS CLI
RUN ${PIP} install --no-cache-dir -U "awscli<2"
#Install JAX and Neuron dependencies
RUN ${PIP} install --force-reinstall --find-links ${NEURON_ARTIFACT_PATH}/pip \
${NEURON_ARTIFACT_PATH}/pip/${NEURONX_CC_VERSION} \
${NEURON_ARTIFACT_PATH}/pip/${NEURONX_JAX_TRAINING_VERSION}

# Install JAX & Neuron CC
RUN ${PIP} config set global.extra-index-url https://pip.repos.neuron.amazonaws.com \
&& ${PIP} install --force-reinstall neuronx-cc==$NEURONX_CC_VERSION --extra-index-url https://pip.repos.neuron.amazonaws.com \
&& ${PIP} install --force-reinstall jax-neuronx==$NEURONX_JAX_TRAINING_VERSION --extra-index-url https://pip.repos.neuron.amazonaws.com
RUN rm -rf ${NEURON_ARTIFACT_PATH}

# EFA Installer does apt get. Make sure to run apt update before that
RUN apt-get update
Expand Down
66 changes: 33 additions & 33 deletions docker/pytorch/inference/2.5.1/Dockerfile.neuronx
Original file line number Diff line number Diff line change
Expand Up @@ -4,16 +4,18 @@ LABEL dlc_major_version="1"
LABEL maintainer="Amazon AI"
LABEL com.amazonaws.sagemaker.capabilities.accept-bind-to-port=true

# Neuron SDK components version numbers
ARG NEURONX_CC_VERSION=2.16.372.0
ARG NEURONX_FRAMEWORK_VERSION=2.5.1.2.4.0
ARG NEURONX_TRANSFORMERS_VERSION=0.13.380
ARG NEURONX_COLLECTIVES_LIB_VERSION=2.23.135.0-3e70920f2
ARG NEURONX_RUNTIME_LIB_VERSION=2.23.112.0-9b5179492
ARG NEURONX_TOOLS_VERSION=2.20.204.0
ARG NEURONX_DISTRIBUTED_VERSION=0.10.1
ARG NEURONX_DISTRIBUTED_INFERENCE_VERSION=0.1.1

# Neuron SDK pre-release packages
ARG NEURON_ARTIFACT_PATH=/root/neuron_artifacts
ARG NEURONX_RUNTIME_LIB_VERSION
ARG NEURONX_COLLECTIVES_LIB_VERSION
ARG NEURONX_TOOLS_VERSION
ARG NEURONX_FRAMEWORK_VERSION
ARG NEURONX_TRANSFORMERS_VERSION
ARG NEURONX_CC_VERSION
ARG NEURONX_DISTRIBUTED_VERSION
ARG NEURONX_DISTRIBUTED_INFERENCE_VERSION

ARG PIP=pip3
ARG PYTHON=python3.10
ARG PYTHON_VERSION=3.10.12
ARG TORCHSERVE_VERSION=0.11.0
Expand All @@ -37,7 +39,6 @@ RUN apt-get update \
curl \
emacs \
git \
gnupg2 \
gpg-agent \
jq \
libgl1-mesa-glx \
Expand All @@ -56,17 +57,14 @@ RUN apt-get update \
&& rm -rf /tmp/tmp* \
&& apt-get clean

RUN echo "deb https://apt.repos.neuron.amazonaws.com focal main" > /etc/apt/sources.list.d/neuron.list
RUN wget -qO - https://apt.repos.neuron.amazonaws.com/GPG-PUB-KEY-AMAZON-AWS-NEURON.PUB | apt-key add -
# Copy Neuron artifacts into container for local installation
COPY pip ${NEURON_ARTIFACT_PATH}/pip/
COPY apt ${NEURON_ARTIFACT_PATH}/apt/

RUN apt-get update \
&& apt-get install -y \
aws-neuronx-tools=$NEURONX_TOOLS_VERSION \
aws-neuronx-collectives=$NEURONX_COLLECTIVES_LIB_VERSION \
aws-neuronx-runtime-lib=$NEURONX_RUNTIME_LIB_VERSION \
&& rm -rf /var/lib/apt/lists/* \
&& rm -rf /tmp/tmp* \
&& apt-get clean
RUN apt-get install -y \
${NEURON_ARTIFACT_PATH}/apt/${NEURONX_TOOLS_VERSION} \
${NEURON_ARTIFACT_PATH}/apt/${NEURONX_COLLECTIVES_LIB_VERSION} \
${NEURON_ARTIFACT_PATH}/apt/${NEURONX_RUNTIME_LIB_VERSION}

# https://github.com/docker-library/openjdk/issues/261 https://github.com/docker-library/openjdk/pull/263/files
RUN keytool -importkeystore -srckeystore /etc/ssl/certs/java/cacerts -destkeystore /etc/ssl/certs/java/cacerts.jks -deststoretype JKS -srcstorepass changeit -deststorepass changeit -noprompt; \
Expand Down Expand Up @@ -102,7 +100,7 @@ RUN conda install -c conda-forge \
enum-compat \
ipython

RUN pip install --no-cache-dir -U \
RUN ${PIP} install --no-cache-dir -U \
opencv-python>=4.8.1.78 \
"numpy<1.24,>1.21" \
"scipy>=1.8.0" \
Expand All @@ -113,17 +111,19 @@ RUN pip install --no-cache-dir -U \
boto3 \
cryptography

RUN pip install -U --extra-index-url https://pip.repos.neuron.amazonaws.com \
neuronx-cc==$NEURONX_CC_VERSION \
torch-neuronx==$NEURONX_FRAMEWORK_VERSION \
transformers-neuronx==$NEURONX_TRANSFORMERS_VERSION \
&& pip install -U "protobuf>=3.18.3,<4" \
RUN ${PIP} install --no-cache-dir --find-links ${NEURON_ARTIFACT_PATH}/pip \
${NEURON_ARTIFACT_PATH}/pip/${NEURONX_CC_VERSION} \
${NEURON_ARTIFACT_PATH}/pip/${NEURONX_FRAMEWORK_VERSION} \
${NEURON_ARTIFACT_PATH}/pip/${NEURONX_TRANSFORMERS_VERSION} \
&& ${PIP} install -U "protobuf>=3.18.3,<4" \
"transformers==4.45.*" \
torchserve==${TORCHSERVE_VERSION} \
torch-model-archiver==${TORCHSERVE_VERSION} \
&& pip install --no-deps --no-cache-dir -U torchvision==0.20.* \
&& pip install --no-deps -U --extra-index-url https://pip.repos.neuron.amazonaws.com neuronx_distributed==$NEURONX_DISTRIBUTED_VERSION \
&& pip install -U --extra-index-url https://pip.repos.neuron.amazonaws.com neuronx_distributed_inference==$NEURONX_DISTRIBUTED_INFERENCE_VERSION
&& ${PIP} install --no-deps --no-cache-dir -U torchvision==0.20.* \
&& ${PIP} install --no-deps --find-links -U ${NEURON_ARTIFACT_PATH}/pip/${NEURONX_DISTRIBUTED_VERSION} \
&& ${PIP} install --no-deps --find-links -U ${NEURON_ARTIFACT_PATH}/pip/${NEURONX_DISTRIBUTED_INFERENCE_VERSION}

RUN rm -rf ${NEURON_ARTIFACT_PATH}

RUN useradd -m model-server \
&& mkdir -p /home/model-server/tmp /opt/ml/model \
Expand All @@ -138,11 +138,11 @@ RUN chmod +x /usr/local/bin/dockerd-entrypoint.py \
&& chmod +x /usr/local/bin/neuron-monitor.sh \
&& chmod +x /usr/local/bin/entrypoint.sh

ADD https://raw.githubusercontent.com/aws/deep-learning-containers/master/src/deep_learning_container.py /usr/local/bin/deep_learning_container.py
ADD https://raw.githubusercontent.com/aws-neuron/deep-learning-containers/main/docker/common/deep_learning_container.py /usr/local/bin/deep_learning_container.py

RUN chmod +x /usr/local/bin/deep_learning_container.py

RUN pip install --no-cache-dir "sagemaker-pytorch-inference==${SM_TOOLKIT_VERSION}"
RUN ${PIP} install --no-cache-dir "sagemaker-pytorch-inference==${SM_TOOLKIT_VERSION}"

# patch default_pytorch_inference_handler.py to import torch_neuronx
RUN DEST_DIR=$(python -c "import os.path, sagemaker_pytorch_serving_container; print(os.path.dirname(sagemaker_pytorch_serving_container.__file__))") \
Expand All @@ -167,4 +167,4 @@ EXPOSE 8080 8081
ENTRYPOINT ["python", "/usr/local/bin/dockerd-entrypoint.py"]
CMD ["/usr/local/bin/entrypoint.sh"]

HEALTHCHECK CMD curl --fail http://localhost:8080/ping || exit 1
HEALTHCHECK CMD curl --fail http://localhost:8080/ping || exit 1
63 changes: 36 additions & 27 deletions docker/pytorch/training/2.5.1/Dockerfile.neuronx
Original file line number Diff line number Diff line change
Expand Up @@ -3,21 +3,22 @@ FROM public.ecr.aws/docker/library/ubuntu:22.04
LABEL maintainer="Amazon AI"
LABEL dlc_major_version="1"

# Neuron SDK components version numbers
ARG NEURONX_DISTRIBUTED_VERSION=0.10.1
ARG NEURONX_DISTRIBUTED_TRAINING_VERSION=1.1.1
ARG NEURONX_CC_VERSION=2.16.372.0
ARG NEURONX_FRAMEWORK_VERSION=2.5.1.2.4.0
ARG NEURONX_COLLECTIVES_LIB_VERSION=2.23.135.0-3e70920f2
ARG NEURONX_RUNTIME_LIB_VERSION=2.23.112.0-9b5179492
ARG NEURONX_TOOLS_VERSION=2.20.204.0
# Neuron SDK pre-release packages
ARG NEURON_ARTIFACT_PATH=/root/neuron_artifacts
ARG NEURONX_RUNTIME_LIB_VERSION
ARG NEURONX_COLLECTIVES_LIB_VERSION
ARG NEURONX_TOOLS_VERSION
ARG NEURONX_FRAMEWORK_VERSION
ARG NEURONX_CC_VERSION
ARG NEURONX_DISTRIBUTED_VERSION
ARG NEURONX_DISTRIBUTED_TRAINING_VERSION

ARG PYTHON=python3.10
ARG PYTHON_VERSION=3.10.12
ARG PIP=pip3
ARG OMPI_VERSION=4.1.5

# This arg required to stop docker build waiting for region configuration while installing tz data from ubuntu 20
# This arg required to stop docker build waiting for region configuration while installing tz data from ubuntu 22
ARG DEBIAN_FRONTEND=noninteractive

# Python won’t try to write .pyc or .pyo files on the import of source modules
Expand Down Expand Up @@ -77,17 +78,14 @@ RUN apt-get update \
&& rm -rf /var/lib/apt/lists/* \
&& apt-get clean

RUN echo "deb https://apt.repos.neuron.amazonaws.com focal main" > /etc/apt/sources.list.d/neuron.list
RUN wget -qO - https://apt.repos.neuron.amazonaws.com/GPG-PUB-KEY-AMAZON-AWS-NEURON.PUB | apt-key add -
# Copy Neuron artifacts into container for local installation
COPY pip ${NEURON_ARTIFACT_PATH}/pip/
COPY apt ${NEURON_ARTIFACT_PATH}/apt/

RUN apt-get update \
&& apt-get install -y \
aws-neuronx-tools=$NEURONX_TOOLS_VERSION \
aws-neuronx-collectives=$NEURONX_COLLECTIVES_LIB_VERSION \
aws-neuronx-runtime-lib=$NEURONX_RUNTIME_LIB_VERSION \
&& rm -rf /var/lib/apt/lists/* \
&& rm -rf /tmp/tmp* \
&& apt-get clean
RUN apt-get install -y \
${NEURON_ARTIFACT_PATH}/apt/${NEURONX_TOOLS_VERSION} \
${NEURON_ARTIFACT_PATH}/apt/${NEURONX_COLLECTIVES_LIB_VERSION} \
${NEURON_ARTIFACT_PATH}/apt/${NEURONX_RUNTIME_LIB_VERSION}

# Install Open MPI
RUN mkdir -p /tmp/openmpi \
Expand All @@ -101,6 +99,15 @@ RUN mkdir -p /tmp/openmpi \
&& ldconfig \
&& rm -rf /tmp/openmpi

# Install packages and configure SSH for MPI operator in k8s
RUN apt-get update && apt-get install -y openmpi-bin openssh-server \
&& mkdir -p /var/run/sshd \
&& echo " UserKnownHostsFile /dev/null" >> /etc/ssh/ssh_config \
&& echo " StrictHostKeyChecking no" >> /etc/ssh/ssh_config \
&& sed -i 's/#\(StrictModes \).*/\1no/g' /etc/ssh/sshd_config \
&& rm -rf /var/lib/apt/lists/* \
&& apt-get clean

# install Python
RUN wget -q https://www.python.org/ftp/python/$PYTHON_VERSION/Python-$PYTHON_VERSION.tgz \
&& tar -xzf Python-$PYTHON_VERSION.tgz \
Expand Down Expand Up @@ -140,16 +147,18 @@ RUN ${PIP} install --no-cache-dir -U \
transformers==4.36.2 \
Pillow

RUN ${PIP} config set global.extra-index-url https://pip.repos.neuron.amazonaws.com \
&& ${PIP} install --force-reinstall torch-neuronx==$NEURONX_FRAMEWORK_VERSION --extra-index-url https://pip.repos.neuron.amazonaws.com \
&& ${PIP} install --force-reinstall neuronx-cc==$NEURONX_CC_VERSION --extra-index-url https://pip.repos.neuron.amazonaws.com
RUN mkdir -p /etc/pki/tls/certs && cp /etc/ssl/certs/ca-certificates.crt /etc/pki/tls/certs/ca-bundle.crt

RUN pip install --force-reinstall --find-links ${NEURON_ARTIFACT_PATH}/pip \
${NEURON_ARTIFACT_PATH}/pip/${NEURONX_CC_VERSION} \
${NEURON_ARTIFACT_PATH}/pip/${NEURONX_FRAMEWORK_VERSION}

RUN ${PIP} install --force-reinstall --no-deps neuronx_distributed==$NEURONX_DISTRIBUTED_VERSION --extra-index-url https://pip.repos.neuron.amazonaws.com
RUN ${PIP} install --no-deps --find-links -U ${NEURON_ARTIFACT_PATH}/pip/${NEURONX_DISTRIBUTED_VERSION}

## Installation for Neuronx Distributed Training framework
# Install Cython & wheel
RUN ${PIP} install --no-cache-dir Cython \
&& ${PIP} install --no-cache-dir wheel
&& ${PIP} install --no-cache-dir wheel

# Copy the apex_setup.py file
COPY apex_setup.py /root/apex_setup.py
Expand All @@ -168,8 +177,9 @@ RUN wget https://raw.githubusercontent.com/aws-neuron/neuronx-distributed-traini
"dill==0.3.8" \
"torch==2.5.1"

RUN ${PIP} install --no-deps --find-links -U ${NEURON_ARTIFACT_PATH}/pip/${NEURONX_DISTRIBUTED_TRAINING_VERSION}

RUN ${PIP} install --force-reinstall --no-deps neuronx_distributed_training==$NEURONX_DISTRIBUTED_TRAINING_VERSION --extra-index-url https://pip.repos.neuron.amazonaws.com
RUN rm -rf ${NEURON_ARTIFACT_PATH}

# attrs, neuronx-cc required: >=19.2.0, sagemaker <24,>=23.1.0
# protobuf neuronx-cc<4, sagemaker-training >=3.9.2,<=3.20.3
Expand All @@ -186,7 +196,7 @@ RUN ${PIP} install --no-cache-dir -U \
"urllib3>=1.26.0,<1.27"

# Install extra packages needed by sagemaker (for passing test_utility_packages_using_import)
RUN pip install --no-cache-dir -U \
RUN ${PIP} install --no-cache-dir -U \
"bokeh>=3.0.1,<4" \
"imageio>=2.22,<3" \
"opencv-python>=4.8.1.78" \
Expand All @@ -206,7 +216,6 @@ RUN cd $HOME \
&& ./efa_installer.sh -y -g -d --skip-kmod --skip-limit-conf --no-verify \
&& cd $HOME


# Clean up after apt update
RUN rm -rf /var/lib/apt/lists/* \
&& rm -rf /tmp/tmp* \
Expand Down