Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUBLAS compilation issue on 4090 with make : "Unsupported gpu architecture 'compute_89'" Works with cmake or without -arch=native #1420

Closed
TheBloke opened this issue May 12, 2023 · 7 comments

Comments

@TheBloke
Copy link
Contributor

TheBloke commented May 12, 2023

Current Behavior

When building llama.cpp with LLAMA_CUBLAS=1 make on a system with a 4090 or L40 GPU, I get the following failure:
nvcc fatal : Unsupported gpu architecture 'compute_89'

However if I remove -arch=native from Makefile, it will compile fine.

Do I just need to update CUDA toolkit? But then why does it compile and work without that flag? CUDA 11.x is listed as compatible with compute 8.x.

Compilation failure:

g++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -c examples/common.cpp -o common.o
nvcc --forward-unknown-to-host-compiler -arch=native -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -Wno-pedantic -c ggml-cuda.cu -o ggml-cuda.o
nvcc fatal   : Unsupported gpu architecture 'compute_89'
make: *** [Makefile:124: ggml-cuda.o] Error 1

I can build it instead with: mkdir build && cd build && cmake -DLLAMA_CUBLAS=1 .. && cmake --build . --config Release and this works.

Workaround / fix

If I remove -arch=native from Makefile line: NVCCFLAGS = --forward-unknown-to-host-compiler -arch=native then it compiles.

But will this result in a less optimised executable?

Environment and Context

I'm using a Docker container based on Ubuntu 20.04 with CUDA 11.6

CUDA is in PATH:

root@4f2326844e8c:~# echo $PATH
/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin

Steps to Reproduce

Please provide detailed steps for reproducing the issue. We are not sitting in front of your screen, so the more detail the better.

  1. On system with NV 4090 or L40 GPU, try make clean LLAMA_CUBLAS=1 make and observe failure
  2. On same sysetm, try CMAKE (mkdir build && cd build && cmake -DLLAMA_CUBLAS=1 .. && cmake --build . --config Release and it works.
  3. Remove -arch=native from NVCCFLAGS = --forward-unknown-to-host-compiler -arch=native and try step 1 again; it will now compile.

Failure Logs

nvcc

root@4f2326844e8c:~/llama.cpp# nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Tue_Mar__8_18:18:20_PST_2022
Cuda compilation tools, release 11.6, V11.6.124
Build cuda_11.6.r11.6/compiler.31057947_0

Log of failed compile with make:

root@4f2326844e8c:~/llama.cpp# make clean LLAMA_CUBLAS=1 make
I llama.cpp build info:
I UNAME_S:  Linux
I UNAME_P:  x86_64
I UNAME_M:  x86_64
I CFLAGS:   -I.              -O3 -std=c11   -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -march=native -mtune=native -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include
I CXXFLAGS: -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include
I LDFLAGS:   -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/targets/x86_64-linux/lib
I CC:       cc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
I CXX:      g++ (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0

rm -vf *.o main quantize quantize-stats perplexity embedding benchmark-matmult save-load-state build-info.h
make: *** No rule to make target 'make'.  Stop.
root@4f2326844e8c:~/llama.cpp# make clean && LLAMA_CUBLAS=1 make
I llama.cpp build info:
I UNAME_S:  Linux
I UNAME_P:  x86_64
I UNAME_M:  x86_64
I CFLAGS:   -I.              -O3 -std=c11   -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -march=native -mtune=native
I CXXFLAGS: -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native
I LDFLAGS:
I CC:       cc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
I CXX:      g++ (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0

rm -vf *.o main quantize quantize-stats perplexity embedding benchmark-matmult save-load-state build-info.h
I llama.cpp build info:
I UNAME_S:  Linux
I UNAME_P:  x86_64
I UNAME_M:  x86_64
I CFLAGS:   -I.              -O3 -std=c11   -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -march=native -mtune=native -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include
I CXXFLAGS: -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include
I LDFLAGS:   -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/targets/x86_64-linux/lib
I CC:       cc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
I CXX:      g++ (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0

cc  -I.              -O3 -std=c11   -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -march=native -mtune=native -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include   -c ggml.c -o ggml.o
g++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -c llama.cpp -o llama.o
llama.cpp: In function ‘size_t llama_set_state_data(llama_context*, const uint8_t*)’:
llama.cpp:2624:36: warning: cast from type ‘const uint8_t*’ {aka ‘const unsigned char*’} to type ‘void*’ casts away qualifiers [-Wcast-qual]
 2624 |             kin3d->data = (void *) in;
      |                                    ^~
llama.cpp:2628:36: warning: cast from type ‘const uint8_t*’ {aka ‘const unsigned char*’} to type ‘void*’ casts away qualifiers [-Wcast-qual]
 2628 |             vin3d->data = (void *) in;
      |                                    ^~
g++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -c examples/common.cpp -o common.o
nvcc --forward-unknown-to-host-compiler -arch=native -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -Wno-pedantic -c ggml-cuda.cu -o ggml-cuda.o
nvcc fatal   : Unsupported gpu architecture 'compute_89'
make: *** [Makefile:124: ggml-cuda.o] Error 1

Log of compiling successfully with CMAKE:

root@4f2326844e8c:~/llama.cpp# rm -rf build && mkdir build && cd build && cmake -DLLAMA_CUBLAS=1 .. && cmake --build . --config Release
-- The C compiler identification is GNU 9.4.0
-- The CXX compiler identification is GNU 9.4.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found Git: /usr/bin/git (found version "2.25.1")
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed
-- Check if compiler accepts -pthread
-- Check if compiler accepts -pthread - yes
-- Found Threads: TRUE
-- Found CUDAToolkit: /usr/local/cuda/include (found version "11.6.124")
-- cuBLAS found
-- The CUDA compiler identification is NVIDIA 11.6.124
-- Detecting CUDA compiler ABI info
-- Detecting CUDA compiler ABI info - done
-- Check for working CUDA compiler: /usr/local/cuda/bin/nvcc - skipped
-- Detecting CUDA compile features
-- Detecting CUDA compile features - done
-- CMAKE_SYSTEM_PROCESSOR: x86_64
-- x86 detected
-- GGML CUDA sources found, configuring CUDA architecture
-- Configuring done (2.3s)
-- Generating done (0.0s)
-- Build files have been written to: /root/llama.cpp/build
[  3%] Built target BUILD_INFO
[  6%] Building C object CMakeFiles/ggml.dir/ggml.c.o
[  9%] Building CUDA object CMakeFiles/ggml.dir/ggml-cuda.cu.o
[  9%] Built target ggml
[ 12%] Building CXX object CMakeFiles/llama.dir/llama.cpp.o
/root/llama.cpp/llama.cpp: In function ‘size_t llama_set_state_data(llama_context*, const uint8_t*)’:
/root/llama.cpp/llama.cpp:2624:36: warning: cast from type ‘const uint8_t*’ {aka ‘const unsigned char*’} to type ‘void*’ casts away qualifiers [-Wcast-qual]
 2624 |             kin3d->data = (void *) in;
      |                                    ^~
/root/llama.cpp/llama.cpp:2628:36: warning: cast from type ‘const uint8_t*’ {aka ‘const unsigned char*’} to type ‘void*’ casts away qualifiers [-Wcast-qual]
 2628 |             vin3d->data = (void *) in;
      |                                    ^~
[ 15%] Linking CXX static library libllama.a
[ 15%] Built target llama
[ 18%] Building CXX object tests/CMakeFiles/test-quantize-fns.dir/test-quantize-fns.cpp.o
[ 21%] Linking CXX executable ../bin/test-quantize-fns
[ 21%] Built target test-quantize-fns
[ 25%] Building CXX object tests/CMakeFiles/test-quantize-perf.dir/test-quantize-perf.cpp.o
[ 28%] Linking CXX executable ../bin/test-quantize-perf
[ 28%] Built target test-quantize-perf
[ 31%] Building CXX object tests/CMakeFiles/test-sampling.dir/test-sampling.cpp.o
/root/llama.cpp/tests/test-sampling.cpp: In function ‘void test_top_k(const std::vector<float>&, const std::vector<float>&, int)’:
/root/llama.cpp/tests/test-sampling.cpp:22:44: warning: unused parameter ‘expected_probs’ [-Wunused-parameter]
   22 |                 const std::vector<float> & expected_probs,
      |                 ~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~
/root/llama.cpp/tests/test-sampling.cpp: In function ‘void test_top_p(const std::vector<float>&, const std::vector<float>&, float)’:
/root/llama.cpp/tests/test-sampling.cpp:46:44: warning: unused parameter ‘expected_probs’ [-Wunused-parameter]
   46 |                 const std::vector<float> & expected_probs,
      |                 ~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~
/root/llama.cpp/tests/test-sampling.cpp: In function ‘void test_tfs(const std::vector<float>&, const std::vector<float>&, float)’:
/root/llama.cpp/tests/test-sampling.cpp:71:44: warning: unused parameter ‘expected_probs’ [-Wunused-parameter]
   71 |                 const std::vector<float> & expected_probs,
      |                 ~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~
/root/llama.cpp/tests/test-sampling.cpp: In function ‘void test_typical(const std::vector<float>&, const std::vector<float>&, float)’:
/root/llama.cpp/tests/test-sampling.cpp:94:44: warning: unused parameter ‘expected_probs’ [-Wunused-parameter]
   94 |                 const std::vector<float> & expected_probs,
      |                 ~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~
/root/llama.cpp/tests/test-sampling.cpp: In function ‘void test_repetition_penalty(const std::vector<float>&, const std::vector<int>&, const std::vector<float>&, float)’:
/root/llama.cpp/tests/test-sampling.cpp:119:44: warning: unused parameter ‘expected_probs’ [-Wunused-parameter]
  119 |                 const std::vector<float> & expected_probs,
      |                 ~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~
/root/llama.cpp/tests/test-sampling.cpp: In function ‘void test_frequency_presence_penalty(const std::vector<float>&, const std::vector<int>&, const std::vector<float>&, float, float)’:
/root/llama.cpp/tests/test-sampling.cpp:148:44: warning: unused parameter ‘expected_probs’ [-Wunused-parameter]
  148 |                 const std::vector<float> & expected_probs,
      |                 ~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~
[ 34%] Linking CXX executable ../bin/test-sampling
[ 34%] Built target test-sampling
[ 37%] Building CXX object tests/CMakeFiles/test-tokenizer-0.dir/test-tokenizer-0.cpp.o
/root/llama.cpp/tests/test-tokenizer-0.cpp:19:2: warning: extra ‘;’ [-Wpedantic]
   19 | };
      |  ^
[ 40%] Linking CXX executable ../bin/test-tokenizer-0
[ 40%] Built target test-tokenizer-0
[ 43%] Building CXX object examples/CMakeFiles/common.dir/common.cpp.o
[ 43%] Built target common
[ 46%] Building CXX object examples/main/CMakeFiles/main.dir/main.cpp.o
[ 50%] Linking CXX executable ../../bin/main
[ 50%] Built target main
[ 53%] Building CXX object examples/quantize/CMakeFiles/quantize.dir/quantize.cpp.o
[ 56%] Linking CXX executable ../../bin/quantize
[ 56%] Built target quantize
[ 59%] Building CXX object examples/quantize-stats/CMakeFiles/quantize-stats.dir/quantize-stats.cpp.o
[ 62%] Linking CXX executable ../../bin/quantize-stats
[ 62%] Built target quantize-stats
[ 65%] Building CXX object examples/perplexity/CMakeFiles/perplexity.dir/perplexity.cpp.o
[ 68%] Linking CXX executable ../../bin/perplexity
[ 68%] Built target perplexity
[ 71%] Building CXX object examples/embedding/CMakeFiles/embedding.dir/embedding.cpp.o
[ 75%] Linking CXX executable ../../bin/embedding
[ 75%] Built target embedding
[ 78%] Building CXX object examples/save-load-state/CMakeFiles/save-load-state.dir/save-load-state.cpp.o
[ 81%] Linking CXX executable ../../bin/save-load-state
[ 81%] Built target save-load-state
[ 84%] Building CXX object examples/benchmark/CMakeFiles/benchmark.dir/benchmark-matmult.cpp.o
[ 87%] Linking CXX executable ../../bin/benchmark
[ 87%] Built target benchmark
[ 90%] Building CXX object pocs/vdot/CMakeFiles/vdot.dir/vdot.cpp.o
[ 93%] Linking CXX executable ../../bin/vdot
[ 93%] Built target vdot
[ 96%] Building CXX object pocs/vdot/CMakeFiles/q8dot.dir/q8dot.cpp.o
[100%] Linking CXX executable ../../bin/q8dot
[100%] Built target q8dot

Log of successful compile with make after removing -arch=native

root@4f2326844e8c:~/llama.cpp# make clean && LLAMA_CUBLAS=1 make
I llama.cpp build info:
I UNAME_S:  Linux
I UNAME_P:  x86_64
I UNAME_M:  x86_64
I CFLAGS:   -I.              -O3 -std=c11   -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -march=native -mtune=native
I CXXFLAGS: -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native
I LDFLAGS:
I CC:       cc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
I CXX:      g++ (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0

rm -vf *.o main quantize quantize-stats perplexity embedding benchmark-matmult save-load-state build-info.h
removed 'common.o'
removed 'ggml-cuda.o'
removed 'ggml.o'
removed 'llama.o'
removed 'main'
removed 'quantize'
removed 'quantize-stats'
removed 'perplexity'
removed 'embedding'
removed 'build-info.h'
I llama.cpp build info:
I UNAME_S:  Linux
I UNAME_P:  x86_64
I UNAME_M:  x86_64
I CFLAGS:   -I.              -O3 -std=c11   -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -march=native -mtune=native -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include
I CXXFLAGS: -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include
I LDFLAGS:   -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/targets/x86_64-linux/lib
I CC:       cc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
I CXX:      g++ (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0

cc  -I.              -O3 -std=c11   -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -march=native -mtune=native -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include   -c ggml.c -o ggml.o
g++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -c llama.cpp -o llama.o
llama.cpp: In function ‘size_t llama_set_state_data(llama_context*, const uint8_t*)’:
llama.cpp:2624:36: warning: cast from type ‘const uint8_t*’ {aka ‘const unsigned char*’} to type ‘void*’ casts away qualifiers [-Wcast-qual]
 2624 |             kin3d->data = (void *) in;
      |                                    ^~
llama.cpp:2628:36: warning: cast from type ‘const uint8_t*’ {aka ‘const unsigned char*’} to type ‘void*’ casts away qualifiers [-Wcast-qual]
 2628 |             vin3d->data = (void *) in;
      |                                    ^~
g++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -c examples/common.cpp -o common.o
nvcc --forward-unknown-to-host-compiler -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -Wno-pedantic -c ggml-cuda.cu -o ggml-cuda.o
g++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include examples/main/main.cpp ggml.o llama.o common.o ggml-cuda.o -o main  -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/targets/x86_64-linux/lib

====  Run ./main -h for help.  ====

g++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include examples/quantize/quantize.cpp ggml.o llama.o ggml-cuda.o -o quantize  -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/targets/x86_64-linux/lib
g++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include examples/quantize-stats/quantize-stats.cpp ggml.o llama.o ggml-cuda.o -o quantize-stats  -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/targets/x86_64-linux/lib
g++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include examples/perplexity/perplexity.cpp ggml.o llama.o common.o ggml-cuda.o -o perplexity  -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/targets/x86_64-linux/lib
g++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include examples/embedding/embedding.cpp ggml.o llama.o common.o ggml-cuda.o -o embedding  -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/targets/x86_64-linux/lib
g++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include pocs/vdot/vdot.cpp ggml.o ggml-cuda.o -o vdot  -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/targets/x86_64-linux/lib
@TheBloke TheBloke changed the title CUBLAS compilation issue on 4090 with make - "Unsupported gpu architecture 'compute_89'" ; works with cmake CUBLAS compilation issue on 4090 with make : "Unsupported gpu architecture 'compute_89'" Works with cmake or without -arch=native May 12, 2023
@slaren
Copy link
Member

slaren commented May 12, 2023

It seems that your nvcc doesn't support compute_89, which is the compute capability of your 4090. Updating your CUDA toolkit should fix this.

@TheBloke
Copy link
Contributor Author

It seems that your nvcc doesn't support compute_89, which is the compute capability of your 4090. Updating your CUDA toolkit should fix this.

But why does it work fine when I use cmake, or remove -arch=native ?

@slaren
Copy link
Member

slaren commented May 12, 2023

That option is meant to configure the architectures for which the CUDA code will be compiled. For local use, native should be best since it will use the capability of your GPU. If you remove that option, it will default to some older architecture, which will still work with your GPU since its capability is higher than anything that your nvcc can compile to anyway, but the performance may be lower. In practice the difference is probably going to be small however, the code will still be JIT compiled to your GPU architecture when you run the program, but it may not use all the capabilities of your GPU.

I am not sure why the result is different with cmake, it uses CUDA_SELECT_NVCC_ARCH_FLAGS "Auto" which should do the same as arch=native (this should probably be changed for the CI builds). You would need to check the flags that cmake passes to nvcc to understand why it works.

@TheBloke
Copy link
Contributor Author

OK thank you very much for the detailed explanation. That makes sense now. I thought the compute arch not being found would always be a fail, and therefore I must have the right toolkit if it worked with some args.

I will try to upgrade my Docker to CUDA 11.8 or 12.x for future compilations.

@SigmaTAMU
Copy link

Could you please tell me where the file "Makefile" is ? I donot know witch file needs to be modified

@Backendmagier
Copy link

Backendmagier commented Aug 26, 2024

fixed it for me: export TORCH_CUDA_ARCH_LIST=8.7
source: https://forums.developer.nvidia.com/t/nvcc-fatal-unsupported-gpu-architecture-compute-89/257060

@shivanraptor
Copy link

shivanraptor commented Sep 23, 2024

fixed it for me: export TORCH_CUDA_ARCH_LIST=8.7 source: https://forums.developer.nvidia.com/t/nvcc-fatal-unsupported-gpu-architecture-compute-89/257060

This didn't work for me. The core issue is the nvcc version. If it's 11.x, it cannot support compute_89 (i.e. 4090 architecture). Need to update CUDA Toolkit 12.6 and update the environment paths in .bashrc or .zshrc:

export PATH=/usr/local/cuda-12.6/bin${PATH:+:${PATH}}
export LD_LIBRARY_PATH=/usr/local/cuda-12.6/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
export CUDA_HOME=/usr/local/cuda-12.6

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants