Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA 11 environment installation config? #20

Closed
lmmx opened this issue Apr 4, 2021 · 5 comments
Closed

CUDA 11 environment installation config? #20

lmmx opened this issue Apr 4, 2021 · 5 comments

Comments

@lmmx
Copy link

lmmx commented Apr 4, 2021

Hey there, been having some trouble installing on RTX 30 series (which require CUDA 11), and was hoping I could get some tips on how to get it working (either from the developers or other users who've encountered the same problem?)

Not sure if this is being run on a CUDA 11 environment in your group at present (if so please let me know what I've missed!), I've not been able to run the global/ subdirectory code, beginning with GetCode.py on either Python 3.6 or 3.7 (while 3.8 and above are incompatible, they require TensorFlow 2.2).

I installed via conda as the pip installed TensorFlow was built against CUDA 10, and raised errors about missing *.so.10 libraries as a result, which disappeared when using the conda-forge package.

After getting an error that "Setting up TensorFlow plugin "fused_bias_act.cu": Failed!" I tried some advice on an NVIDIA forum post for StyleGAN2, to change line 135 of global/dnnlib/tflib/custom_ops.py to

compile_opts += f' --compiler-options \'-fPIC -D_GLIBCXX_USE_CXX11_ABI=1\''

However this had no effect: there still seems to be a failure to register the GPU.

To check whether I can use the environment I'm running

StyleCLIP/global $ python GetCode.py --code_type "w"

My environment setups (3.6 and 3.7 respectively) are as follows (after each is the error output for that environment)

Click to show setup for Python 3.6, CUDA 11.0.221, PyTorch 1.7.1, TensorFlow 1.14.0

conda create -n styleclip4
conda activate styleclip4
conda install -y "python<3.7" -c conda-forge # Python 3.6.13 (restricted by TensorFlow 1.x dependency)
conda install -y pytorch==1.7.1 torchvision "cudatoolkit<11.2" -c pytorch
# PyPi tensorflow-gpu package is built for CUDA 10, incompatible with 11, use conda-forge community package
conda install "tensorflow-gpu<2" -c conda-forge # 1.14.0
pip install git+https://github.com/openai/CLIP.git # forces pytorch 1.7.1 install
pip install pandas requests opencv-python matplotlib scikit-learn gdown
gdown https://drive.google.com/u/0/uc?id=1EM87UquaoQmk17Q8d5kYIAHqu0dkYqdT&export=download
git clone https://github.com/omertov/encoder4editing.git

Gives:

/home/louis/miniconda3/envs/styleclip4/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:516: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
/home/louis/miniconda3/envs/styleclip4/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:517: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
/home/louis/miniconda3/envs/styleclip4/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:518: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
/home/louis/miniconda3/envs/styleclip4/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:519: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
/home/louis/miniconda3/envs/styleclip4/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:520: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
/home/louis/miniconda3/envs/styleclip4/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:525: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  np_resource = np.dtype([("resource", np.ubyte, 1)])
/home/louis/miniconda3/envs/styleclip4/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:541: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
/home/louis/miniconda3/envs/styleclip4/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:542: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
/home/louis/miniconda3/envs/styleclip4/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:543: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
/home/louis/miniconda3/envs/styleclip4/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:544: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
/home/louis/miniconda3/envs/styleclip4/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:545: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
/home/louis/miniconda3/envs/styleclip4/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:550: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  np_resource = np.dtype([("resource", np.ubyte, 1)])
Setting up TensorFlow plugin "fused_bias_act.cu": Failed!
Traceback (most recent call last):
  File "GetCode.py", line 284, in <module>
    GetCode(Gs,random_state,num_img,num_once,dataset_name)
  File "GetCode.py", line 109, in GetCode
    dlatent_avg=Gs.get_var('dlatent_avg')
  File "/home/louis/dev/cv/StyleCLIP/global/dnnlib/tflib/network.py", line 396, in get_var
    return self.find_var(var_or_local_name).eval()
  File "/home/louis/dev/cv/StyleCLIP/global/dnnlib/tflib/network.py", line 391, in find_var
    return self._get_vars()[var_or_local_name] if isinstance(var_or_local_name, str) else var_or_local_name
  File "/home/louis/dev/cv/StyleCLIP/global/dnnlib/tflib/network.py", line 297, in _get_vars
    self._vars = OrderedDict(self._get_own_vars())
  File "/home/louis/dev/cv/StyleCLIP/global/dnnlib/tflib/network.py", line 286, in _get_own_vars
    self._init_graph()
  File "/home/louis/dev/cv/StyleCLIP/global/dnnlib/tflib/network.py", line 151, in _init_graph
    out_expr = self._build_func(*self._input_templates, **build_kwargs)
  File "<string>", line 187, in G_main
  File "/home/louis/dev/cv/StyleCLIP/global/dnnlib/tflib/network.py", line 232, in input_shape
    return self.input_shapes[0]
  File "/home/louis/dev/cv/StyleCLIP/global/dnnlib/tflib/network.py", line 219, in input_shapes
    self._input_shapes = [t.shape.as_list() for t in self.input_templates]
  File "/home/louis/dev/cv/StyleCLIP/global/dnnlib/tflib/network.py", line 267, in input_templates
    self._init_graph()
  File "/home/louis/dev/cv/StyleCLIP/global/dnnlib/tflib/network.py", line 151, in _init_graph
    out_expr = self._build_func(*self._input_templates, **build_kwargs)
  File "<string>", line 491, in G_synthesis_stylegan2
  File "<string>", line 455, in layer
  File "<string>", line 99, in modulated_conv2d_layer
  File "<string>", line 68, in apply_bias_act
  File "/home/louis/dev/cv/StyleCLIP/global/dnnlib/tflib/ops/fused_bias_act.py", line 72, in fused_bias_act
    return impl_dict[impl](x=x, b=b, axis=axis, act=act, alpha=alpha, gain=gain, clamp=clamp)
  File "/home/louis/dev/cv/StyleCLIP/global/dnnlib/tflib/ops/fused_bias_act.py", line 132, in _fused_bias_act_cuda
    cuda_op = _get_plugin().fused_bias_act
  File "/home/louis/dev/cv/StyleCLIP/global/dnnlib/tflib/ops/fused_bias_act.py", line 18, in _get_plugin
    return custom_ops.get_plugin(os.path.splitext(__file__)[0] + '.cu')
  File "/home/louis/dev/cv/StyleCLIP/global/dnnlib/tflib/custom_ops.py", line 139, in get_plugin
    compile_opts += f' --gpu-architecture={_get_cuda_gpu_arch_string()}'
  File "/home/louis/dev/cv/StyleCLIP/global/dnnlib/tflib/custom_ops.py", line 60, in _get_cuda_gpu_arch_string
    raise RuntimeError('No GPU devices found')
RuntimeError: No GPU devices found

Click to show setup for Python 3.7, CUDA 11.0.221, PyTorch 1.7.1, TensorFlow 1.14.0

conda create -n styleclip3
conda activate styleclip3
conda install -y "python<3.8" -c conda-forge # Python 3.7.10 (restricted by TensorFlow 1.x dependency)
conda install -y pytorch==1.7.1 torchvision "cudatoolkit<11.2" -c pytorch
# PyPi tensorflow-gpu package is built for CUDA 10, incompatible with 11, use conda-forge community package
conda install "tensorflow-gpu<2" -c conda-forge # 1.14.0
pip install git+https://github.com/openai/CLIP.git # forces pytorch 1.7.1 install
pip install pandas requests opencv-python matplotlib scikit-learn gdown
gdown https://drive.google.com/u/0/uc?id=1EM87UquaoQmk17Q8d5kYIAHqu0dkYqdT&export=download
git clone https://github.com/omertov/encoder4editing.git
/home/louis/miniconda3/envs/styleclip3/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:516: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
/home/louis/miniconda3/envs/styleclip3/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:517: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
/home/louis/miniconda3/envs/styleclip3/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:518: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
/home/louis/miniconda3/envs/styleclip3/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:519: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
/home/louis/miniconda3/envs/styleclip3/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:520: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
/home/louis/miniconda3/envs/styleclip3/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:525: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  np_resource = np.dtype([("resource", np.ubyte, 1)])
/home/louis/miniconda3/envs/styleclip3/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:541: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
/home/louis/miniconda3/envs/styleclip3/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:542: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
/home/louis/miniconda3/envs/styleclip3/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:543: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
/home/louis/miniconda3/envs/styleclip3/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:544: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
/home/louis/miniconda3/envs/styleclip3/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:545: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
/home/louis/miniconda3/envs/styleclip3/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:550: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  np_resource = np.dtype([("resource", np.ubyte, 1)])
Setting up TensorFlow plugin "fused_bias_act.cu": Failed!
Traceback (most recent call last):
  File "GetCode.py", line 284, in <module>
    GetCode(Gs,random_state,num_img,num_once,dataset_name)
  File "GetCode.py", line 109, in GetCode
    dlatent_avg=Gs.get_var('dlatent_avg')
  File "/home/louis/dev/cv/StyleCLIP/global/dnnlib/tflib/network.py", line 396, in get_var
    return self.find_var(var_or_local_name).eval()
  File "/home/louis/dev/cv/StyleCLIP/global/dnnlib/tflib/network.py", line 391, in find_var
    return self._get_vars()[var_or_local_name] if isinstance(var_or_local_name, str) else var_or_local_name
  File "/home/louis/dev/cv/StyleCLIP/global/dnnlib/tflib/network.py", line 297, in _get_vars
    self._vars = OrderedDict(self._get_own_vars())
  File "/home/louis/dev/cv/StyleCLIP/global/dnnlib/tflib/network.py", line 286, in _get_own_vars
    self._init_graph()
  File "/home/louis/dev/cv/StyleCLIP/global/dnnlib/tflib/network.py", line 151, in _init_graph
    out_expr = self._build_func(*self._input_templates, **build_kwargs)
  File "<string>", line 187, in G_main
  File "/home/louis/dev/cv/StyleCLIP/global/dnnlib/tflib/network.py", line 232, in input_shape
    return self.input_shapes[0]
  File "/home/louis/dev/cv/StyleCLIP/global/dnnlib/tflib/network.py", line 219, in input_shapes
    self._input_shapes = [t.shape.as_list() for t in self.input_templates]
  File "/home/louis/dev/cv/StyleCLIP/global/dnnlib/tflib/network.py", line 267, in input_templates
    self._init_graph()
  File "/home/louis/dev/cv/StyleCLIP/global/dnnlib/tflib/network.py", line 151, in _init_graph
    out_expr = self._build_func(*self._input_templates, **build_kwargs)
  File "<string>", line 491, in G_synthesis_stylegan2
  File "<string>", line 455, in layer
  File "<string>", line 99, in modulated_conv2d_layer
  File "<string>", line 68, in apply_bias_act
  File "/home/louis/dev/cv/StyleCLIP/global/dnnlib/tflib/ops/fused_bias_act.py", line 72, in fused_bias_act
    return impl_dict[impl](x=x, b=b, axis=axis, act=act, alpha=alpha, gain=gain, clamp=clamp)
  File "/home/louis/dev/cv/StyleCLIP/global/dnnlib/tflib/ops/fused_bias_act.py", line 132, in _fused_bias_act_cuda
    cuda_op = _get_plugin().fused_bias_act
  File "/home/louis/dev/cv/StyleCLIP/global/dnnlib/tflib/ops/fused_bias_act.py", line 18, in _get_plugin
    return custom_ops.get_plugin(os.path.splitext(__file__)[0] + '.cu')
  File "/home/louis/dev/cv/StyleCLIP/global/dnnlib/tflib/custom_ops.py", line 139, in get_plugin
    compile_opts += f' --gpu-architecture={_get_cuda_gpu_arch_string()}'
  File "/home/louis/dev/cv/StyleCLIP/global/dnnlib/tflib/custom_ops.py", line 60, in _get_cuda_gpu_arch_string
    raise RuntimeError('No GPU devices found')
RuntimeError: No GPU devices found

Requiring packages come from the anaconda channel instead for some reason is enforcing TensorFlow 1.10.0, which (once you get past some interface changes) still leads to the same GPU registration problem.

Click to show setup for Python 3.6, CUDA 11.0.221, PyTorch 1.7.1, TensorFlow 1.10.0 (anaconda channel)

conda create -n styleclip6
conda activate styleclip6
conda install -y "python<3.7" -c anaconda # Python 3.6.13 (restricted by TensorFlow 1.x dependency)
conda install -y pytorch==1.7.1 torchvision "cudatoolkit>=11,<11.2" -c pytorch
# PyPi tensorflow-gpu package is built for CUDA 10, incompatible with 11, use conda-forge community package
conda install "tensorflow-gpu<2" -c anaconda # 1.10.0
pip install git+https://github.com/openai/CLIP.git # forces pytorch 1.7.1 install
pip install pandas requests opencv-python matplotlib scikit-learn gdown
gdown https://drive.google.com/u/0/uc?id=1EM87UquaoQmk17Q8d5kYIAHqu0dkYqdT&export=download
git clone https://github.com/omertov/encoder4editing.git

CUDA can also be downloaded from conda-forge but upon installing pytorch, the package is superseded by the higher-priority cudatoolkit package in that channel (making it equivalent to the attempt above which failed)

I'm out of ideas so giving up at this point, please let me know if there's a solution!

@betterze
Copy link
Collaborator

betterze commented Apr 4, 2021

Dear lmmx,

Thank you for your interest in our work. Would you mind trying our colab?

In colab, the CUDA Version is 11.2, so I dont think there is a problem to use CUDA 11. Would you mind trying the setting we use in colab? By the way, we update the code a bit, maybe use the new version of the code.

Best Wishes,

Alex

@lmmx
Copy link
Author

lmmx commented Apr 4, 2021

I looked more closely, specifically at the version of TensorFlow (the tensorflow % 'magic' command specifying 1.x gives TF 1.15.2) so I tried to install this (I was previously getting 1.15.0), as follows

Click to show installation with CUDA 11 and Python 3.7

conda create -y -n styleclip10
conda activate styleclip10
conda install "cudatoolkit>=11,<11.2" -c pytorch
conda install -y "python<3.8" -c conda-forge # Python 3.7.10 (restricted by TensorFlow 1.x dependency)
pip install tensorflow-gpu==1.15.2
pip install torch==1.7.1+cu110 torchvision==0.8.2+cu110 -f https://download.pytorch.org/whl/torch_stable.html
pip install git+https://github.com/openai/CLIP.git # forces pytorch 1.7.1 install
pip install pandas requests opencv-python matplotlib scikit-learn gdown
gdown https://drive.google.com/u/0/uc?id=1EM87UquaoQmk17Q8d5kYIAHqu0dkYqdT&export=download
git clone https://github.com/omertov/encoder4editing.git

cd global
python GetCode.py --code_type "w"

... running this gave errors about missing libcudart.so.10 and similar *.10 libraries (which indicate a requirement for CUDA 10).

I ran ! find / -iname "libcudart*" 2> /dev/null and found that the colab instance has both CUDA 10 and CUDA 11 libraries, and so I strongly suspect it is quietly picking up CUDA 10 libraries even though ! nvcc --version reports CUDA 11. This is the type of reproducibility bug that conda avoids (I won't be able to run something with a CUDA 10 dependency).

As far as I know, the only way to run TensorFlow for CUDA 11 is nvidia-tensorflow (Google is no longer updating/maintaining CUDA 11 binaries for 1.x, which I suspect the colab arrangement is a workaround for).

This changes the installation to:

conda create -y -n styleclip11
conda activate styleclip11
conda install -y "python<3.7" -c conda-forge # Python 3.6.13 (restricted by TensorFlow 1.x dependency)
pip install nvidia-pyindex
pip install nvidia-tensorflow==1.15.2 # only available for Python 3.6, replaces tensorflow-gpu==1.15.2
pip install nvidia-tensorboard
conda install "cudatoolkit>=11,<11.2" -c pytorch # 11.0.221
pip install torch==1.7.1+cu110 torchvision==0.8.2+cu110 -f https://download.pytorch.org/whl/torch_stable.html
pip install git+https://github.com/openai/CLIP.git # forces pytorch 1.7.1 install
pip install pandas requests opencv-python matplotlib scikit-learn gdown
gdown https://drive.google.com/u/0/uc?id=1EM87UquaoQmk17Q8d5kYIAHqu0dkYqdT&export=download
git clone https://github.com/omertov/encoder4editing.git

cd global
python GetCode.py --code_type "w"

With this replacement for TensorFlow, the GPU successfully registered, however it still didn't succeed.

I received an error from cuBLAS, Blas GEMM launch failed. I couldn't figure out what was causing this, but eventually I just tried again with a newer version of nvidia-tensorflow (guessing that it may have been patched in a bugfix in a later version) and successfully got it working with 1.15.4 ! This is the last version of nvidia-tensorflow for Python 3.6 (and I don't think you can run 3.8 due to the restriction from PyTorch 1.7.1 which CLIP requires)

Long story short, this is the installation to get a working StyleCLIP setup for CUDA 11!

conda create -y -n styleclip12
conda activate styleclip12
conda install -y "python<3.7" -c conda-forge # Python 3.6.13 (restricted by TensorFlow 1.x dependency)
pip install nvidia-pyindex
pip install nvidia-tensorflow==1.15.4 # only available for Python 3.6, replaces tensorflow-gpu==1.15.2
conda install -y "cudatoolkit>=11,<11.2" -c pytorch # 11.0.221
pip install torch==1.7.1+cu110 torchvision==0.8.2+cu110 -f https://download.pytorch.org/whl/torch_stable.html
pip install git+https://github.com/openai/CLIP.git # forces pytorch 1.7.1 install
pip install pandas requests opencv-python matplotlib scikit-learn gdown
gdown https://drive.google.com/u/0/uc?id=1EM87UquaoQmk17Q8d5kYIAHqu0dkYqdT&export=download
git clone https://github.com/omertov/encoder4editing.git

cd global
python GetCode.py --code_type "w"
  • This time nvidia-tensorboard was installed automatically with nvidia-tensorflow

Only took a dozen attempts 😅 Sorry for the overly detailed report, just wanted to explain how I arrived at this setup 😺

@lmmx lmmx closed this as completed Apr 4, 2021
@qo4on
Copy link

qo4on commented Jul 10, 2021

I also have troubles running StyleCLIP in Kaggle on Tesla P100.
Colab and Kaggle both use Python 3.7.10. In Kaggle, the CUDA Version is 11.0 (Build cuda_11.0_bu.TC445_37.28845127_0). I didn't find CUDA 10 here.

The setup throws some warnings but has completed without exception.

Setup code
import os

content_path = '/content'
os.makedirs(content_path, exist_ok=True)
os.chdir(content_path)
CODE_DIR = 'encoder4editing'

!apt-get install sudo > /dev/null
!pip install tensorflow-gpu==1.15.5 > /dev/null
!pip install torch==1.7.1+cu110 torchvision==0.8.2+cu110 torchaudio==0.7.2 -f   https://download.pytorch.org/whl/torch_stable.html > /dev/null
!git clone -q https://github.com/omertov/encoder4editing.git $CODE_DIR
!wget -q https://github.com/ninja-build/ninja/releases/download/v1.8.2/ninja-linux.zip > /dev/  null
!sudo unzip -qq ninja-linux.zip -d /usr/local/bin/
!sudo update-alternatives --quiet --install /usr/bin/ninja ninja /usr/local/bin/ninja 1 --force
!pip install -U ftfy regex tqdm gdown > /dev/null
!pip install git+https://github.com/openai/CLIP.git > /dev/null
!git clone -q https://github.com/orpatashnik/StyleCLIP

os.chdir(f'./{CODE_DIR}')

from argparse import Namespace
import time
import os
import sys
import numpy as np
from PIL import Image
import torch
import torchvision.transforms as transforms

sys.path.append(".")
sys.path.append("..")

from utils.common import tensor2im
from models.psp import pSp

%load_ext autoreload
%autoreload 2
debconf: delaying package configuration, since apt-utils is not installed
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tensorflow-cloud 0.1.13 requires tensorflow<3.0,>=1.15.0, which is not installed.
fancyimpute 0.5.5 requires tensorflow, which is not installed.
dask-cudf 21.6.1+2.g101fc0fda4 requires cupy-cuda112, which is not installed.
cudf 21.6.1+2.g101fc0fda4 requires cupy-cuda110, which is not installed.
tensorflow-probability 0.13.0 requires gast>=0.3.2, but you have gast 0.2.2 which is incompatible.
tensorflow-cloud 0.1.13 requires tensorboard>=2.3.0, but you have tensorboard 1.15.0 which is incompatible.
pytorch-lightning 1.3.8 requires tensorboard!=2.5.0,>=2.2.0, but you have tensorboard 1.15.0 which is incompatible.
pyldavis 3.3.1 requires numpy>=1.20.0, but you have numpy 1.18.5 which is incompatible.
plotnine 0.8.0 requires numpy>=1.19.0, but you have numpy 1.18.5 which is incompatible.
pdpbox 0.2.1 requires matplotlib==3.1.1, but you have matplotlib 3.4.2 which is incompatible.
osmnx 1.1.1 requires numpy>=1.19, but you have numpy 1.18.5 which is incompatible.
matrixprofile 1.1.10 requires protobuf==3.11.2, but you have protobuf 3.17.3 which is incompatible.
imbalanced-learn 0.8.0 requires scikit-learn>=0.24, but you have scikit-learn 0.23.2 which is incompatible.
dask-cudf 21.6.1+2.g101fc0fda4 requires dask<=2021.5.1,>=2021.4.0, but you have dask 2021.6.2 which is incompatible.
dask-cudf 21.6.1+2.g101fc0fda4 requires distributed<=2021.5.1,>=2.22.0, but you have distributed 2021.6.2 which is incompatible.
WARNING: Running pip as root will break packages and permissions. You should install packages reliably by using venv: https://pip.pypa.io/warnings/venv
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
pytorch-lightning 1.3.8 requires tensorboard!=2.5.0,>=2.2.0, but you have tensorboard 1.15.0 which is incompatible.
WARNING: Running pip as root will break packages and permissions. You should install packages reliably by using venv: https://pip.pypa.io/warnings/venv
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tensorflow-cloud 0.1.13 requires tensorflow<3.0,>=1.15.0, which is not installed.
tensorflow-cloud 0.1.13 requires tensorboard>=2.3.0, but you have tensorboard 1.15.0 which is incompatible.
pytorch-lightning 1.3.8 requires tensorboard!=2.5.0,>=2.2.0, but you have tensorboard 1.15.0 which is incompatible.
WARNING: Running pip as root will break packages and permissions. You should install packages reliably by using venv: https://pip.pypa.io/warnings/venv
  Running command git clone -q https://github.com/openai/CLIP.git /tmp/pip-req-build-ox56xk7n
WARNING: Running pip as root will break packages and permissions. You should install packages reliably by using venv: https://pip.pypa.io/warnings/venv

But GetCode.py returns an error:

dataset_name='ffhq'
os.chdir(os.path.join(content_path, "StyleCLIP/global_directions"))
!python GetCode.py --dataset_name $dataset_name --code_type 'w'
Error
--2021-07-10 16:39:26--  https://nvlabs-fi-cdn.nvidia.com/stylegan2/networks/stylegan2-ffhq-config-f.pkl
Resolving nvlabs-fi-cdn.nvidia.com (nvlabs-fi-cdn.nvidia.com)... 52.84.169.105, 52.84.169.104, 52.84.169.37, ...
Connecting to nvlabs-fi-cdn.nvidia.com (nvlabs-fi-cdn.nvidia.com)|52.84.169.105|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 381673535 (364M) [application/x-www-form-urlencoded]
Saving to: ‘./model/stylegan2-ffhq-config-f.pkl’

stylegan2-ffhq-conf 100%[===================>] 363.99M  41.7MB/s    in 9.1s    

2021-07-10 16:39:37 (39.9 MB/s) - ‘./model/stylegan2-ffhq-config-f.pkl’ saved [381673535/381673535]

2021-07-10 16:39:37.238757: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcudart.so.10.0'; dlerror: libcudart.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib64:/usr/local/cuda/lib64:/opt/conda/lib
2021-07-10 16:39:37.239082: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcublas.so.10.0'; dlerror: libcublas.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib64:/usr/local/cuda/lib64:/opt/conda/lib
2021-07-10 16:39:37.239345: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcufft.so.10.0'; dlerror: libcufft.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib64:/usr/local/cuda/lib64:/opt/conda/lib
2021-07-10 16:39:37.239599: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcurand.so.10.0'; dlerror: libcurand.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib64:/usr/local/cuda/lib64:/opt/conda/lib
2021-07-10 16:39:37.239848: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcusolver.so.10.0'; dlerror: libcusolver.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib64:/usr/local/cuda/lib64:/opt/conda/lib
2021-07-10 16:39:37.240127: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcusparse.so.10.0'; dlerror: libcusparse.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib64:/usr/local/cuda/lib64:/opt/conda/lib
2021-07-10 16:39:37.240381: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcudnn.so.7'; dlerror: libcudnn.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib64:/usr/local/cuda/lib64:/opt/conda/lib
2021-07-10 16:39:37.240417: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1662] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
Setting up TensorFlow plugin "fused_bias_act.cu": 2021-07-10 16:39:37.884038: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcudart.so.10.0'; dlerror: libcudart.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib64:/usr/local/cuda/lib64:/opt/conda/lib
2021-07-10 16:39:37.884184: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcublas.so.10.0'; dlerror: libcublas.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib64:/usr/local/cuda/lib64:/opt/conda/lib
2021-07-10 16:39:37.884298: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcufft.so.10.0'; dlerror: libcufft.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib64:/usr/local/cuda/lib64:/opt/conda/lib
2021-07-10 16:39:37.884427: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcurand.so.10.0'; dlerror: libcurand.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib64:/usr/local/cuda/lib64:/opt/conda/lib
2021-07-10 16:39:37.884535: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcusolver.so.10.0'; dlerror: libcusolver.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib64:/usr/local/cuda/lib64:/opt/conda/lib
2021-07-10 16:39:37.884648: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcusparse.so.10.0'; dlerror: libcusparse.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib64:/usr/local/cuda/lib64:/opt/conda/lib
2021-07-10 16:39:37.884752: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcudnn.so.7'; dlerror: libcudnn.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib64:/usr/local/cuda/lib64:/opt/conda/lib
2021-07-10 16:39:37.884774: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1662] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
Failed!
Traceback (most recent call last):
  File "GetCode.py", line 210, in <module>
    GetCode(Gs,random_state,num_img,num_once,dataset_name)
  File "GetCode.py", line 82, in GetCode
    dlatent_avg=Gs.get_var('dlatent_avg')
  File "/content/StyleCLIP/global_directions/dnnlib/tflib/network.py", line 396, in get_var
    return self.find_var(var_or_local_name).eval()
  File "/content/StyleCLIP/global_directions/dnnlib/tflib/network.py", line 391, in find_var
    return self._get_vars()[var_or_local_name] if isinstance(var_or_local_name, str) else var_or_local_name
  File "/content/StyleCLIP/global_directions/dnnlib/tflib/network.py", line 297, in _get_vars
    self._vars = OrderedDict(self._get_own_vars())
  File "/content/StyleCLIP/global_directions/dnnlib/tflib/network.py", line 286, in _get_own_vars
    self._init_graph()
  File "/content/StyleCLIP/global_directions/dnnlib/tflib/network.py", line 151, in _init_graph
    out_expr = self._build_func(*self._input_templates, **build_kwargs)
  File "<string>", line 187, in G_main
  File "/content/StyleCLIP/global_directions/dnnlib/tflib/network.py", line 232, in input_shape
    return self.input_shapes[0]
  File "/content/StyleCLIP/global_directions/dnnlib/tflib/network.py", line 219, in input_shapes
    self._input_shapes = [t.shape.as_list() for t in self.input_templates]
  File "/content/StyleCLIP/global_directions/dnnlib/tflib/network.py", line 267, in input_templates
    self._init_graph()
  File "/content/StyleCLIP/global_directions/dnnlib/tflib/network.py", line 151, in _init_graph
    out_expr = self._build_func(*self._input_templates, **build_kwargs)
  File "<string>", line 491, in G_synthesis_stylegan2
  File "<string>", line 455, in layer
  File "<string>", line 99, in modulated_conv2d_layer
  File "<string>", line 68, in apply_bias_act
  File "/content/StyleCLIP/global_directions/dnnlib/tflib/ops/fused_bias_act.py", line 72, in fused_bias_act
    return impl_dict[impl](x=x, b=b, axis=axis, act=act, alpha=alpha, gain=gain, clamp=clamp)
  File "/content/StyleCLIP/global_directions/dnnlib/tflib/ops/fused_bias_act.py", line 132, in _fused_bias_act_cuda
    cuda_op = _get_plugin().fused_bias_act
  File "/content/StyleCLIP/global_directions/dnnlib/tflib/ops/fused_bias_act.py", line 18, in _get_plugin
    return custom_ops.get_plugin(os.path.splitext(__file__)[0] + '.cu')
  File "/content/StyleCLIP/global_directions/dnnlib/tflib/custom_ops.py", line 139, in get_plugin
    compile_opts += f' --gpu-architecture={_get_cuda_gpu_arch_string()}'
  File "/content/StyleCLIP/global_directions/dnnlib/tflib/custom_ops.py", line 60, in _get_cuda_gpu_arch_string
    raise RuntimeError('No GPU devices found')
RuntimeError: No GPU devices found

I tried

!pip install nvidia-pyindex
!pip install nvidia-tensorflow==1.15.4

instead of !pip install tensorflow-gpu==1.15.5, and got different warnings in setup:

Setup warnings
debconf: delaying package configuration, since apt-utils is not installed
Collecting nvidia-pyindex
  Downloading nvidia-pyindex-1.0.9.tar.gz (10 kB)
Building wheels for collected packages: nvidia-pyindex
  Building wheel for nvidia-pyindex (setup.py) ... done
  Created wheel for nvidia-pyindex: filename=nvidia_pyindex-1.0.9-py3-none-any.whl size=8399 sha256=4e6e0fc2e1aadf5ee6ee40c16f85ac678628f544d1535a1418e51ee076538e4b
  Stored in directory: /root/.cache/pip/wheels/f1/a1/a1/6cc45cc1ae6b1876f12ef399c0d0d6e18809e9ced611c7c2a7
Successfully built nvidia-pyindex
Installing collected packages: nvidia-pyindex
Successfully installed nvidia-pyindex-1.0.9
WARNING: Running pip as root will break packages and permissions. You should install packages reliably by using venv: https://pip.pypa.io/warnings/venv
Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
ERROR: Could not find a version that satisfies the requirement nvidia-tensorflow==1.15.4 (from versions: 0.0.1.dev4, 0.0.1.dev5)
ERROR: No matching distribution found for nvidia-tensorflow==1.15.4
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
kornia 0.5.5 requires numpy<=1.19, but you have numpy 1.19.5 which is incompatible.
WARNING: Running pip as root will break packages and permissions. You should install packages reliably by using venv: https://pip.pypa.io/warnings/venv
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tensorflow-cloud 0.1.13 requires tensorflow<3.0,>=1.15.0, which is not installed.
WARNING: Running pip as root will break packages and permissions. You should install packages reliably by using venv: https://pip.pypa.io/warnings/venv
  Running command git clone -q https://github.com/openai/CLIP.git /tmp/pip-req-build-ecayw2pl
WARNING: Running pip as root will break packages and permissions. You should install packages reliably by using venv: https://pip.pypa.io/warnings/venv

And another error in GetCode.py:

`GetCode.py` error
2021-07-10 17:15:00.744825: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
Traceback (most recent call last):
File "GetCode.py", line 7, in <module>
  from dnnlib import tflib  
File "/content/StyleCLIP/global_directions/dnnlib/tflib/__init__.py", line 9, in <module>
  from . import autosummary
File "/content/StyleCLIP/global_directions/dnnlib/tflib/autosummary.py", line 32, in <module>
  from . import tfutil
File "/content/StyleCLIP/global_directions/dnnlib/tflib/tfutil.py", line 18, in <module>
  import tensorflow.contrib   # requires TensorFlow 1.x!
ModuleNotFoundError: No module named 'tensorflow.contrib'

Do you have any idea how to fix this?

@qo4on
Copy link

qo4on commented Jul 11, 2021

It looks like nvidia-tensorflow does not work with Python 3.7.
Any workaround for running StyleCLIP_global.ipynb in Kaggle?

@armheb
Copy link

armheb commented Jul 13, 2021

I looked more closely, specifically at the version of TensorFlow (the tensorflow % 'magic' command specifying 1.x gives TF 1.15.2) so I tried to install this (I was previously getting 1.15.0), as follows

Click to show installation with CUDA 11 and Python 3.7
... running this gave errors about missing libcudart.so.10 and similar *.10 libraries (which indicate a requirement for CUDA 10).

I ran ! find / -iname "libcudart*" 2> /dev/null and found that the colab instance has both CUDA 10 and CUDA 11 libraries, and so I strongly suspect it is quietly picking up CUDA 10 libraries even though ! nvcc --version reports CUDA 11. This is the type of reproducibility bug that conda avoids (I won't be able to run something with a CUDA 10 dependency).

As far as I know, the only way to run TensorFlow for CUDA 11 is nvidia-tensorflow (Google is no longer updating/maintaining CUDA 11 binaries for 1.x, which I suspect the colab arrangement is a workaround for).

This changes the installation to:

conda create -y -n styleclip11
conda activate styleclip11
conda install -y "python<3.7" -c conda-forge # Python 3.6.13 (restricted by TensorFlow 1.x dependency)
pip install nvidia-pyindex
pip install nvidia-tensorflow==1.15.2 # only available for Python 3.6, replaces tensorflow-gpu==1.15.2
pip install nvidia-tensorboard
conda install "cudatoolkit>=11,<11.2" -c pytorch # 11.0.221
pip install torch==1.7.1+cu110 torchvision==0.8.2+cu110 -f https://download.pytorch.org/whl/torch_stable.html
pip install git+https://github.com/openai/CLIP.git # forces pytorch 1.7.1 install
pip install pandas requests opencv-python matplotlib scikit-learn gdown
gdown https://drive.google.com/u/0/uc?id=1EM87UquaoQmk17Q8d5kYIAHqu0dkYqdT&export=download
git clone https://github.com/omertov/encoder4editing.git

cd global
python GetCode.py --code_type "w"

With this replacement for TensorFlow, the GPU successfully registered, however it still didn't succeed.

I received an error from cuBLAS, Blas GEMM launch failed. I couldn't figure out what was causing this, but eventually I just tried again with a newer version of nvidia-tensorflow (guessing that it may have been patched in a bugfix in a later version) and successfully got it working with 1.15.4 ! This is the last version of nvidia-tensorflow for Python 3.6 (and I don't think you can run 3.8 due to the restriction from PyTorch 1.7.1 which CLIP requires)

Long story short, this is the installation to get a working StyleCLIP setup for CUDA 11!

conda create -y -n styleclip12
conda activate styleclip12
conda install -y "python<3.7" -c conda-forge # Python 3.6.13 (restricted by TensorFlow 1.x dependency)
pip install nvidia-pyindex
pip install nvidia-tensorflow==1.15.4 # only available for Python 3.6, replaces tensorflow-gpu==1.15.2
conda install -y "cudatoolkit>=11,<11.2" -c pytorch # 11.0.221
pip install torch==1.7.1+cu110 torchvision==0.8.2+cu110 -f https://download.pytorch.org/whl/torch_stable.html
pip install git+https://github.com/openai/CLIP.git # forces pytorch 1.7.1 install
pip install pandas requests opencv-python matplotlib scikit-learn gdown
gdown https://drive.google.com/u/0/uc?id=1EM87UquaoQmk17Q8d5kYIAHqu0dkYqdT&export=download
git clone https://github.com/omertov/encoder4editing.git

cd global
python GetCode.py --code_type "w"
  • This time nvidia-tensorboard was installed automatically with nvidia-tensorflow

Only took a dozen attempts Sorry for the overly detailed report, just wanted to explain how I arrived at this setup

Thanks for your solution and explanation. I have tried your recommended instruction for installing the tensorflow on RTX3090, now the tensorflow recognizes the GPU but when trying to run the GetCode.py, I get the following error:
nvcc fatal : Value 'sm_86' is not defined for option 'gpu-architecture'
any recommendations?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants